OR WAIT null SECS
Contrary to popular belief, the out-of-specification problem started years before the Barr Decision.
Production lots and tests with a specification range of two standard deviations will produce random rejections five percent of the time, as a result of extreme statistical variation. Techniques based on sound statistical reasoning were developed to deal with out-of-specification (OOS) test results. The temptation to bend the rules and lower the reject rate led to abuses, however. The most common of these was to test a sample repeatedly until a passing result was produced. In 1993, Barr Laboratories lost a lawsuit on this and related points and the judge's decision led to new interpretations of FDA rules, including the requirement that an investigation be initiated before a replicate sample can be tested. These rules and others incorporated into FDA guidance documents reflect a misunderstanding of important statistical principles.
Dealing with out-of-specification (OOS) test results has been a general manufacturing concern for more than 80 years.1 It arises because it is statistically plausible that five percent of lots and tests will fall outside accepted limits, even if the product actually meets specifications.
The bigger problem is that many manufacturers have incorrectly applied retesting procedures and averages. Such erroneous application of statistical methods is probably due, in some cases, to poor training in mathematics and unethical efforts to avoid discarding lots, in others. The most significant abuse of statistical methods has been to test lots repeatedly until a sample falls within the specification range, and then to accept a lot based on one passing result. This method is known as "testing into compliance."
This approach to OOS results became a major problem following the 1993 lawsuit between the US government and Barr Laboratories.2 Peculiar judicial conclusions and subsequent US Food and Drug Administration (FDA) actions created a major problem out of a minor quality control (QC) problem. In this article, we trace this history with an emphasis on the 15 years since the Barr Decision. A key part of the story is that poor training in mathematics and a lack of statistical thinking combine to confuse workers.
Before discussing the history of the out-of-specification (OOS) problem, it is useful to examine some basic tenets underlying lot release testing and the use of statistics.
All Measurements are Approximate
Scientists realize that all measurements are uncertain at some level and are taught that the standard deviation is the parameter that estimates the degree of this uncertainty. For the pharmaceutical analyst, this idea is very important when making quality control (QC) measurements because the analyst must balance the cost of making measurements against the needed level of certainty.
Unlike their counterparts in academia, industrial QC analysts are not expected to produce test results that are accurate and precise to the maximum number of significant figures that are possible. In most cases, the analyst's supervisors will not provide the equipment or the time to make measurements of that type, but will only provide what is necessary to determine if a product lot meets specifications.
Of course, the occurrence of OOS results also raises the question of whether the specifications themselves have been properly set. If the specifications are set improperly, we will consistently see OOS results, because the manufacturing process itself cannot meet the specifications that were set for it. This article does not deal with such circumstances, however; the OOS problem addressed here applies to stable and controlled processes with realistic requirements, in which an OOS result is a rare event.
Variability Can be Measured
The experienced QC scientist knows, when setting specifications, that individual units of a product will vary because of process variations that affect both samples and whole lots. In addition, variation in the test method itself is layered on top of process variations. Therefore, the result of a single test is affected by multiple sources of variation, and may be misleading unless the degree of variation arising from the different sources is understood. That is why a specification has ranges.
The statistically trained analyst tries to understand these variations by testing several replicates to obtain an average (mean, x-mean) and a standard deviation (s). The standard deviation is a measure of the variation of the test results and may be expressed as a percentage of the mean (100 * s/x-mean); this is called a coefficient of variation (CV). A small CV is believed to represent a test that is more precise than a test with a large CV. The usual procedure is to create specifications that allow for test results to vary in a range of two standard deviations about the mean. Statistically, this creates a range that captures 95% of expected test results. The problem lies with the remaining 5%.
Approximately 5% of the time, a confluence of random events can occur, resulting in a test result that is outside the 95% range of the specification. This situation could occur as a result of what is known as extreme statistical variation, even if nothing is wrong with the product lot or process. In rare instances, extreme statistical variation could produce test results indicating that an entire product lot is OOS; more commonly, however, the OOS result is associated with a single test sample.
To lessen the effect of extreme statistical variation, most QC analysts make their measurements using replicates. The averaging of replicates is a method for controlling the effects of variability.1 The procedures for calculating the number of replicates required for a given level of risk and test method variation have been known for a long time.2
It is also well known that when using a 95% interval for a specification range, the 5% possibility of obtaining an OOS result (even on an acceptable product, because of extreme statistical variation) can be dealt with by repeating the test on the same sample or set of samples. The idea of this approach is that if an OOS test result was caused by an extreme statistical variation that occurred 5% of the time, then there was a 95% probability that the retest would not show the effects of this extreme variation. To a certain extent, this procedure led to what FDA has called "reflexive retesting." But this method, when used properly, is a valid approach. Only when abused does it truly become "reflexive retesting."
The OOS problem did not arise from reflexive retesting, however, but rather from an incorrect extension of the procedure, which led to "testing into compliance," essentially a result of the fact that management hates to reject a batch.
The process of "testing into compliance" resulted from a reversal of the thinking that originally led to reflexive retesting. In "testing into compliance," an unethical manufacturer hopes that even a bad lot will produce a passing test result, as a result of extreme statistical variation. Consequently, failing test results are ignored and retests are ordered until extreme variation produces a passing test result. The passing result is accepted and the lot is released based on that result. Instances where seven to eight retests were ordered in an attempt to obtain a passing result are known.
If a company actually believes that such a passing result shows that the product is of good quality, it is engaging in fallacious reasoning. There is no reason to believe that the results obtained from the retests are really different from the original test result. In fact, a result from a retest should be nothing more than another member of the population of test results that are generated by random variation. The fact that a passing test result is pleasing to management does not make it more valid than a result that indicates a failure to meet a specification. From a statistical point of view, as long as these results arise from properly performed tests on the same sample, they are all members of a population of test results, each of which represents a legitimate estimate of the property specified.
The Proper Use of Retesting
There is a legitimate and proper way to use retesting, however. If a test shows that a batch is OOS, the trained QC analyst considers three possibilities:
1. Has process variation created a whole lot that is OOS?
2. Has process variation created a single sample that is OOS?
3. Did the OOS test result occur because the test was performed incorrectly?
Therefore, when confronted with an OOS test result—whether it is a single number or an average—the logical next step is to perform a retest.
The reason for this is that in some situations, sample limitations, or questions about the validity of the sample create a need to perform the retest on a new sample from the lot. In such cases, the analyst should conduct the retest under conditions under which the retest can be considered a legitimate test of the lot. Therefore, there should be no reason to believe that the retest is not as valid as the original test. If the result of the retest confirms the initial OOS conclusion, the analyst should accept the failure and reject the lot.
If the retest shows a passing test result, however, the analyst is in a quandary, because both results are equally valid. The situation in which the analyst is faced with opposite conclusions from equally valid results is common when the initial OOS result is caused by extreme statistical variation. The experienced analyst knows that it is necessary to conduct additional retests to reach a high level of confidence that the lot is really acceptable.
The OOS problem arose because of companies whose managers would immediately accept a passing result and discard a previous failing result when there was no scientifically defensible reason for doing so. This ended up in court, in the case of United States v. Barr Laboratories.
In 1993, Barr Laboratories was sued by the US government (i.e., the US Food and Drug Administration) regarding a whole set of issues, including the way the company dealt with OOS results.3–5 Barr lost and the judge who heard the case, Judge Wolin, issued a ruling commonly referred to as the Barr Decision. The Barr Decision made the OOS problem into a major problem for the QC laboratory by creating a regulatory requirement where, following an OOS result, an investigation must be initiated before any retesting can be done. Consequently, OOS results arising from random variation must be investigated before actions (i.e., retesting) can be taken to decide whether or not it is a random event. This creates additional work for QC laboratories, and an intense desire to simplify investigations by blaming OOS results on laboratory error. This can create situations in which retesting leads to testing a product into compliance with repeated claims of laboratory error. An outside observer might conclude that the laboratory is not competent to be performing the test.
Most QC supervisors who have received basic statistical training know that statistical formulas can be used to calculate the proper number of replicates needed to overcome a single failing result. The number of replicates is based on previous data concerning the variability of the product and test method. In the Barr Decision, however, the judge offered the opinion that seven passing results are needed to overcome one OOS result. This caused a number of companies to adopt a "seven replicate rule" when confronted with an OOS test result. This procedure and the testimony that originally led to the judge's conclusion were completely without scientific foundation.
Following the 1993 Barr Decision, the OOS problem was formalized by FDA in a draft guidance document, Out of Specification (OOS) Test Results for Pharmaceutical Production, issued in September 1998.6 Although it was a draft document, for many years it was the only guidance document available on this subject.
Like the Barr Decision, the FDA's draft OOS guidance document required that any single OOS result must be investigated. The guidance also introduced procedures for investigating OOS test results. It made clear recommendations for the actions that should be taken during initial laboratory investigations and formal investigations (with reporting requirements) for situations in which the OOS result cannot be attributed to laboratory error. The recommendations detailed the elements required for the investigations and the reports that would be generated. The responsibilities of the analyst who obtains an OOS test result and that analyst's supervisor were described.
In addition to the investigation requirement, the guidance document incorporated many other elements of the Barr Decision. Thankfully, it did not perpetrate the erroneous idea of seven passing test results overcoming one failing result. Unfortunately, other odd ideas, particularly related to averaging, were maintained.
The judge in the Barr Decision made the statement that "averaging conceals variation." That statement was true, as far as it went, but the purpose of averaging is to seek the central value of a normal distribution, and the measurement of variation should be made using the standard deviation. This point appears to have been misunderstood by the regulators who prepared the OOS draft guidance.
The only concession to the use of standard deviations seems to be two statements in the draft guidance document.6 In the section on averaging, the draft guidance notes that when measuring content uniformity, the analyst should also report the standard deviation of the results. The same section also says, "Unexpected variation in replicate determinations should trigger investigation and documentation requirements." This statement suggests that standard deviations should be monitored and a specification set so that the expected level of variation will be known, and actions taken in the face of an unusual level of variation.
This idea that a standard deviation might be used to detect excessive variation was not mentioned in the Barr Decision. However, the requirement that all individual test results used to calculate an average must also meet the specification individually was retained from the Barr Decision. This requirement clearly showed that the distinction between averages and single test results was not understood, despite the fact that it is one of the most basic ideas of statistics. The document also failed to consider the fact that a mean and the standard error of that mean could be within a specification, even if some of the individual results that comprise the mean could lie outside of the specification range.
When Averaging is Justified
Many of the problems from OOS results appear to arise from confusion between Barr's practices regarding content and blend uniformity tests and situations in which averaging is justified. Averaging is justified when the analyst has a good reason to believe that all test results should be identical. For instance, aliquots taken from a large, well-mixed solution may be assumed to be identical. On the other hand, when performing content and blend uniformity testing, the assumption must be that test results will not be uniform and that the lot uniformity must be proven. In such cases, averaging is not justified unless there are also tight limits on the standard deviation. When test procedures are developed, the test developer must state the reasons for believing that the test aliquots will be uniform to justify averaging the results. Otherwise, averaging should not be conducted.
The Barr Decision and subsequent FDA rules about handling OOS test results also had an impact on the use of outlier tests. The judge in the Barr case ruled that since the United States Pharmacopoeia (USP) mentioned the use of the outlier test in conjunction with biological tests but not with chemical tests, it could be used with biological tests but not with chemical tests. Given that outlier tests are well established in the theory of statistics, which is a branch of applied mathematics, this was tantamount to stating that since the USP does not specifically mention geometry, one could not use geometric considerations in pharmaceutical calculations.7 The judge apparently believed that the application of mathematics and natural laws is subject to judicial restrictions.
The USP quickly took action to include chemical tests in the auspices of the outlier test, but the FDA's acceptance of the judge's ruling showed a remarkable level of prejudice against the outlier test procedure. Outlier testing is widely used and accepted in diverse fields of science and technology. The formulas used for outlier testing have a firm foundation in the mathematical framework of statistics, provided that the underlying hypothesis of the test is affirmed. This hypothesis is that the outlier is a member of a second population of test results that contaminates the set of observations that are supposedly from a first population.
It appeared that FDA felt that the use of outlier testing would make it too easy to discard an OOS test result. This was a misrepresentation and misinterpretation of the outlier test. Although in some industries outlier testing is used to discard unusual individual observations, in the case of pharmaceutical test results, the detection of an outlier must result in an investigation into its cause. Also, statistical theory says that finding a true outlier must be an infrequent event. The frequent finding of outliers must cast doubt upon the specificity of a testing procedure or raise questions about the appropriate use of the test, as this suggests that contamination of test results with results from a second population is a frequent event.
1996 Proposed GMP Regulations
In 1996—two years before issuing the draft OOS guidance—FDA proposed new good manufacturing practices (GMP) regulations to cover the OOS problem.8 These regulations were well written and actually served as useful guidance on many GMP subjects in addition to the OOS problem. A proposed definition of the term out-of-specification was given that would have been consistent with 21 CFR 211.160(b). The regulations would have rewritten 21 CFR 211.192 to clarify the OOS problem, and required investigations and standard operating procedures, just as required later by the draft guidance document. Unfortunately, these regulations were never finalized and were rescinded before the final guidance document covering the OOS problem was issued in 2006.
In October 2006, 13 years after the Barr Decision and eight years after issuing the draft guidance document, FDA issued the final guidance document covering the OOS problem.9 The final guidance document is similar to the draft guidance document, and retains many problems from the draft. For example, the final document retains some of the peculiar statements related to averaging and statistical tests. The confusion between testing for uniformity and the use of uniform test samples is also maintained.
One significant difference between the draft OOS guidance and the final version, however, is the restriction of the application of the guidance document to chemical tests. "This guidance applies to chemistry-based laboratory testing of drugs regulated by CDER," the document says. "It is directed toward traditional drug testing and release methods."
The problem is that the nature of "chemical tests regulated by CDER" and the excluded biological tests have not been clearly defined. The guidance document mentions in vivo tests and immunoassays as being among the biological assays that the guidance does not cover. A separate 1998 guidance document (on postapproval changes to analytical testing laboratory sites) notes that "Biological tests include animal, cell culture or biochemical based testing that measures a biological, biochemical, or physiological response."10
The final guidance document also includes a peculiar reference to microbiological testing and the inherent variability of biological tests. It refers to microbiological testing and the use of outlier testing with biological tests as given in the USP, even after insisting that the guidance document does not cover biological tests. These statements create the impression that the document was written by a committee whose members did not communicate with each other.
One point that the 2006 guidance did clarify is the types of materials subject to these chemical tests. It states that, "These laboratory tests are performed on active pharmaceutical ingredients, excipients, and other components, in-process materials, and finished drug products ..." This helps to clarify that the guidance document does not apply to the testing of material that is not a drug substance or does not contain the drug substance.
This clarification is important because, although the original OOS problem was assumed to arise from the failure to meet final product specifications, there was disagreement in the industry about whether the problem should be extended to any failure to meet a specification or if certain types of specifications were not affected. The final guidance document makes it clear that any chemical-test–related specification mentioned in submissions to the agency, established by the manufacturer, or found in Drug Master Files and official compendia can produce OOS results that must be reviewed.
The guidance excludes specifications and criteria related to process analytical technology because they do not use single test results to make batch release decisions. Specifications used for in-process checks and adjustments of process parameters are also exempted from OOS considerations.
A need for corrective actions and preventive actions (CAPA) is expressed in a footnote. It notes that good manufacturing practices for devices require CAPA and states that 21 CFR 211.192 contains an implicit requirement for CAPA in response to the OOS result. It suggests that when the draft guidance for quality systems is finalized, the need for CAPA responses will be explicitly recognized.
Contrary to popular belief, the problem with out-of-specification (OOS) test results did not begin with the Barr Decision in 1993, but existed well before then. The problem was known and understood by quality control laboratory workers since the 1920s,1 but it was not until the 1990s that poor training in mathematics and a lack of statistical thinking combined to make the problem what it is today.
The US Food and Drug Administration's 1993 draft guidance on investigating OOS test results, and the final guidance issued in 2006, repeat many of the misconceptions stated in the Barr Decision. These misconceptions relate to the appropriate use of averages, retesting, outlier tests, and standard deviation. The biggest problem of all, however, is the requirement than any OOS test result be investigated before any retesting is done to determine whether or not the result arises because of extreme statistical variation.
The final guidance document restricts its application to chemical tests only, which could lead analysts dealing with biological tests to breathe a sigh of relief. In fact, this just creates more confusion and leaves a gap in terms of the regulatory expectations with regard to biological tests.
Given these problems with the existing documents, I believe a revised guidance document may be issued in the future.
Lot release testing is important in the pharmaceutical industry because of the potential for OOS products to harm patients. FDA often cites this as the basis for its concerns about OOS test results. There are many other industries, however, where the improper handling of OOS test results could also cause great harm to customers. It is curious that QC workers in the automotive and aerospace industries are not confronted with equally extensive government regulation of OOS test results. This is an interesting point that the regulators may wish to ponder.
Steven S. Kuwahara, PhD, is a principal consultant at GXP BioTechnology LLC, Sunnyvale, CA, 408.530.9338. email@example.com
1. United States vs. Barr Laboratories, Inc. Civil Action No. 92-1744, US District Court for the District of New Jersey: 812 F. Supp. 458. 1993 US Dist. Lexis 1932; 4 February 1993, as amended 30 March 1993.
2. Shewhart WA, Economic control of quality of manufactured product. New York: Van Nostrand Company; 1931. 50th Anniversary Commemorative Reissue, Milwaukee, WI: ASQ Quality Press; 1980.
3. ISPE and GMP Institute. FDA's summary of Judge Wolin's interpretation of GMP issues contained in the court's ruling in USA vs. Barr Laboratories. 2002 April 16. Available at URL: www.gmp1st.com/barrsum.htm.
4. Kuwahara SS. The Barr Decision, part 1, coping with failing test results. BioPharm 1996 Sept;9(9):24-29.
5. Kuwahara SS. The Barr Decision, part 2, Its impact on outlier tests, averages, and validation studies. BioPharm 1996 Sept;9(9):40–45.
6. CDER, FDA, DHHS. Draft guidance for industry: investigating out of specification (OOS) test results for pharmaceutical production. Bethesda Md.;1998 September.
7. Kuwahara SS. Outlier testing: its history and application. BioPharm 1997 Feb;10(2):64–67.
8. FDA, DHHS. 21 CFR Parts 210 and 211. Current good manufacturing practice: amendment of certain requirements for finished pharmaceuticals. Proposed Rule. FR 61, (87):20103–20115; Bethesda Md.; 1966.
9. CDER, FDA, DHHS. Guidance for industry investigating out of specification (OOS) test results for pharmaceutical production. Bethesda Md.; 2006 Oct.
10. CDER, FDA, DHHS. Guidance for industry: PAC-ATLS: post approval changes -analytical testing laboratory sites. Bethesda Md.; 1998 April.