Specification Setting: Setting Acceptance Criteria from Statistics of the Data

November 1, 2006
Terry Orchard

BioPharm International, BioPharm International-11-01-2006, Volume 19, Issue 11
Page Number: 40–45

This article shows how Probabilistic Tolerance Intervals of the form, "We are 99% confident that 99% of the measurements will fall within the calculated tolerance limits" can be used to set acceptance limits using production data that are approximately Normally distributed. If the production measurements are concentrations of residual compounds that are present in very low concentrations, it may be appropriate to set acceptance limits by fitting a Poisson or an Exponential Distribution.

Abstract

This article shows how Probabilistic Tolerance Intervals of the form, "We are 99% confident that 99% of the measurements will fall within the calculated tolerance limits" can be used to set acceptance limits using production data that are approximately Normally distributed. If the production measurements are concentrations of residual compounds that are present in very low concentrations, it may be appropriate to set acceptance limits by fitting a Poisson or an Exponential Distribution.

The International Conference on Harmonization (ICH) Q6 defines specifications as "a list of tests, references to analytical procedures, and appropriate acceptance criteria which are numerical limits, ranges, or other criteria for the tests described."1 These numerical limits and ranges are for engineering criteria such as dimensions or hardness, performance criteria such as the quantity of drug delivered by a device, and toxicological criteria such as the maximum permitted level of a residual chemical compound. Ideally, the acceptance criteria will be set to meet well-established requirements such as those for product performance. However, we do not always have the necessary information to do this and we are often forced to establish the likely range of acceptable values to be seen in production data.

When reviewing acceptance criteria set using preproduction data, regulators tend to favor the 3-sigma limits that are used for control charts. In practice, we have found that limits set using data from a small number of batches are almost always too tight. Thus, we want to increase the limits to 3.5 or 3.8-sigma to allow for the variability of estimating sigma from small samples. This variability was recognized many years ago, as the following quote from Deming indicates:

Shewhart, of his Statistical Method from the Viewpoint of Quality Control (The Graduate School, Department of Agriculture, Washington, 1939) demonstrates how futile it is to attempt to estimate the magnitude of sigma from samples of 4, but that samples of 100 are excellent, while samples of 1000 give practically perfect results.2

This article describes how to calculate "3-sigma" acceptance criteria for sample sizes below 200. In addition to these acceptance criteria for data that can be categorized, as "Distributed like Normal," a method is described for discrete data that can be described as "Distributed like a Poisson or Exponential."

ACCEPTANCE CRITERIA FOR MEASUREMENTS "DISTRIBUTED LIKE NORMAL"

The measured levels of residual 1,3-diacetyl benzene in rubber seals usually have an approximately Normal distribution. The histogram of the level in 62 batches of seals is shown in Figure 1.

Figure 1. Histogram of 1,3-diacetyl benzene

The 62 values appear to be sampled from a Normal distribution. If we knew that the data are Normally distributed and we knew the mean and standard deviation, we could use a tolerance interval to establish the likely range of acceptable values.

Tolerance intervals are similar to the confidence intervals on the mean that many engineers and scientists use and understand. Confidence limits indicate the range that we expect to contain the mean, whereas tolerance intervals indicate the range that we expect to contain a specified percentage of the population.

If we know the mean (μ) and standard deviation (σ) of a Normal distribution, we can calculate that 2.5% of the distribution will be above 1.96 * σ and 2.5% of the distribution will be below –1.96 * σ. Thus, the lower and upper limits for a 95% tolerance interval are μ – 1.96 * σ and μ + 1.96 * σ.

A two-sided tolerance interval is relevant if we require acceptable values to be less than an upper limit and more than a lower limit. A one-sided tolerance interval is relevant if we require acceptable values to be less than an upper limit or more than a lower limit. A 95% two-sided interval is shown in Figure 2.

Figure 2. A 95% two-sided interval

The one-sided 95% intervals are similar, but we expect 5% of the values to be above the upper limit or 5% below the lower limit. For one-sided 95% intervals, the upper 95% limit is μ + 1.64 * σ and the lower 95% limit is μ – 1.64 * σ.

If we do not know μ and σ, we must use early production test results to determine if the distribution can be regarded as Normal and to estimate the mean and standard deviation. As was pointed out earlier, estimates of μ and σ calculated from samples are subject to uncertainty due to sampling variability. This uncertainty requires us to adopt a probabilistic approach to the tolerance limits such that we make statements like the following: "We are 99% confident that 99% of the measurements will fall within the calculated tolerance limits." The uncertainty, or lack of confidence, is reflected in sigma multipliers that are larger than would be the case if we knew μ and σ. The increase depends on the sample size. The comments attributed to Shewhart suggest that the effect of sample variability will be close to zero for samples larger than 250.

Methods of calculating probabilistic tolerance limits can be found in the National Institute of Standards of Technology (NIST)/SEMATECH e-Handbook of Statistical Methods.3

The method for a two-sided tolerance interval can be found in a 1969 paper attributed to Howe.4 This author calculates a variable MUL to provide us with a level of confidence of C% that the limits (x mean – MUL * s) and (x mean + MUL * s) will contain D% of the distribution.

The method for calculating probabilistic one-sided tolerance intervals that is set out in the NIST/SEMATECH e-Handbook of Statistical Methods is attributed to Natrella.5 The upper limit providing a level of confidence of C% that D% of the values will be less than it is given by (x mean + MU * s). The lower limit is defined similarly as (x mean – ML * s).

Since the effect of sample variability will be close to zero for samples larger than 250, when constructing Table 1, the aim was to produce values of the sigma multipliers (MUL, ML, and MU) that are around 3.0 when the sample size is 250. Thus C% was set to 99% and D% was set to 99.25%.

Table 1. Two-sided multipliers (MUL) and one-sided multipliers (MU or ML) for sample sizes up to N = 200

We can use the residual 1,3–diacetyl benzene results shown in Figure 1 as an example. Typically, there is an upper specification limit for such residual compounds. The statistical software Minitab was used to test for Normality and to calculate estimates of the mean and standard deviation.

The Anderson-Darling A-squared statistic was 0.24 and the P-value was 0.75. A P-value larger than 0.05 indicates that the distribution is not significantly different than Normal. Thus, we can regard the distribution of the residual 1,3–diacetyl benzene results as Normal.

The one-sided tolerance interval procedure was used because only an upper limit is needed. Table 1 shows the value of MU for a sample size of 62 to be 3.46. The calculation set out below resulted in an upper limit of 460 μg/g:

Mean = 245.7 μg/g

Standard deviation = 61.91 μg/g

Sample size = 62

Sigma multiplier = 3.46

Upper specification limit = 460 μg/g

What can be done when the distribution looks Normal but fails a test?

A similar upper specification limit was required for 1,4-diacetyl benzene, but this time the Anderson-Darling test indicated that that the distribution of the sample was significantly different from a Normal.

A plot of the data suggests that the lack of Normality may be partially due to outliers. These are values so extreme that they are unlikely to belong with the other values. Tests for outliers are described in the NIST/SEMATECH e-Handbook.3 Grubb's test identified two values, 230 and 241, as potential outliers. Both were much larger than the next highest value, 181. The data were reviewed and it was decided that these two values might have been recording errors that should have been 130 and 141. They were thus removed from the data and the Anderson-Darling test was repeated. This showed that the distribution of the 60 values was not significantly different from a Normal. The mean and standard deviation for the specification limit were calculated using the 60 values and an upper limit was calculated.

In this example, removing the outliers was very effective in dealing with a distribution that was significantly different from a Normal. Values flagged as potential outliers should not be removed without a review of the data, particularly when the data are from preproduction runs that may not fully represent the range to be found in future. If it is decided not to remove the outliers, or if no outliers are found, the pressure to produce a specification limit may force us to calculate the mean and standard deviation even though the Anderson-Darling test shows that the distribution is significantly different from a Normal. The resulting specification limit would have additional uncertainty but it could be used temporarily and recalculated when more data become available.

Acceptance limits for the largest and smallest of Normally distributed measurements

Sometimes the rules for accepting batches are something like the following: "The mean of a sample of 30 parts should be between LL and UL and also the largest value should not be more than X and the smallest value not less than Y." For example, the means must be greater than 50 and less than 60 and also no individual part in a sample of 30 can be smaller than 47 or larger than 63.

These rules require acceptance limits for the means and for individual parts. These limits are calculated separately using the tolerance interval approach. The total standard deviation (σT) is used for the individual values and the standard deviation of the means (σM) is used for the means. Multipliers of 3.0 are typically used for both because several hundred parts are measured.

We can use Excel to calculate the expected failure rate for each of the specification limits: NORMDIST(LSL, μ, σ,TRUE) for the lower limit and (1 – NORMDIST(USL, μ, σ,TRUE) ) for the upper limit. If we use 3-sigma limits for LL and UL, both expected failure rates are 0.0013 (0.13%).

Since there is only one mean per batch, the limits for the means lead to an expected batch failure rate of 0.13%. However, since each part in the sample of 30 has a 0.0013 chance of being above the upper limit we can calculate that the acceptance limit for individual parts leads to an expected batch failure rate of 3.8%.

1 – (Chance that none of 30 parts are above the upper limit) = 1 – (1 – 0.0013)30 = 0.038 (or 3.8%).

Likewise,

1 – (Chance that none are below the lower limit) = 1 – (1 – 0.0013)30 = 0.038.

To have a batch failure rate of 0.13% based on the individual parts, we need to select values of UL and LL such that:

1 – (1 – (1 – ND(UL)))30 = 0.0013 and 1 – (1 - ND(LL))30 = 0.0013.

In these equations, ND(X) is used to denote the Excel statement NORMDIST(X,μ, σ,TRUE).

The "Solver" function of Excel can be used to find the values of LL and UL. Solver is found under the "Tools" menu of Excel. If Solver does not appear in the Tools menu, look for it in the "Add-ins" menu of Tools.

Table 2 shows the means, standard deviations, and minimum and maximum values for hardness measurements of samples of 30 rubber seals from 21 batches and the 3-sigma acceptance limits.

Table 2. Statistics of Hardness measurements of samples of 30 parts from 21 batches

An Anderson-Darling test indicated that the distribution of all 630 hardness measurements was not significantly different from a Normal.

For the lower acceptance limit (LL), the target cell for Solver contained:

=1 – (–NORMDIST(LSL, 58.91,1.649,TRUE))^30 and Solver found that LSL = 52.4 made this equal to 0.0013.

For the upper acceptance limit (UL), the target cell for Solver contained:

=1–(1– (1 – NORMDIST (USL, 58.91,1.649,TRUE))) ^30 and Solver found that USL = 65.4, which made this equal to 0.0013.

Thus, for the Hardness measurements, the rules for accepting batches would be as follows: "The mean of 30 parts should be between 56 and 62 and also the largest value should not be more than 65.4 and the smallest value not less than 52.4."

Acceptance limits for compounds "Distributed like a Poisson or an Exponential"

The residual chemicals in our rubber seals are typically at very low levels, mostly less than the limit of quantification (LOQ). If the LOQ is 1 µg/g, there will be a large number reported as <1, some reported as 1, a few as 2, and rarely any above 5 or 6. Table 3 contains a typical example. These are the measured concentrations of residual phenanthrene for 443 batches. Figure 3 shows the histogram.

Table 3. Fitting a Poisson distribution to 443 measurements of Phenanthrene

The histogram suggests that a Poisson distribution could provide a useful description of the frequency distribution and the values predicted by a Poisson distribution have been added to Figure 3. The predicted Poisson counts indicate that a probabilistic upper specification limit could be calculated by fitting a Poisson distribution to the data.

The Poisson distribution shows the expected frequencies with which 0, 1, 2, etc. (i.e., rare events) occur in a sample of fixed time intervals. Since the events are rare, most of the intervals will have no events, some will have 1, a few will have 2 and none will have 5 or 6. Thus, we can use the Poisson distribution by equating the number of µg/g in a sample to the number of events in a time interval.

Figure 3. Residual phenathrene in 443 batches of seals

The Poisson distribution:

has only one parameter, μ, and that is estimated by the mean of the measurements.

The measurements of concentration in μg/g, are denoted by x, and f(x) is the proportion of the distribution having the value. X! denotes the product of integers from 1 up to x, for example 3! = 1 * 2 * 3 = 6. The value <1 denotes quantities that are less than the limit of quantification. We set these to 0.

We can verify if the Poisson distribution is an acceptable fit to our data by calculating and summing quantities called chi-squares. These are the squared differences between the observed and expected counts adjusted to be comparable by dividing by the expected count:

Chi-square = (Observed Count – Expected Count)2 / Expected Count

The expected frequency (or count) is calculated by multiplying f(x) by the sample size (Expected = N * f(x)).

The Poisson percentages can be calculated in Excel using the formula "=POISSON(E118,F134,FALSE)." The cell reference E118 contains an integer μg/g value and F134 the parameter, μ. Excel truncates any values of X that are not integers.

The P-value of the chi-square test can be calculated in Excel by "= CHIDIST(M15,M16)." Cell M15 contains the sum of chi squares and M16 contains the degrees of freedom. The degrees of freedom are two less than the number of chi-square values in the sum. If the P-value is larger than 0.05, the Poisson distribution is a good fit to the data.

If the distribution is not a good fit, estimating the value of μ that minimizes the sum of the chi-squares can produce a better fitting distribution. This could be done using an iterative method in which we increase and decrease μ in increasingly smaller steps until we find the value that minimizes the sum of chi-squares or we can use the Solver function of Excel to find the minimum value of the cell containing the sum of chi-squares by changing the value of the cell containing μ.

Table 3 shows the counts, probabilities, and chi-squares for 443 residual phenanthrene measurements.

The mean calculated by (292*0 + 116*1 + 25*2 + 8*3 + 2*4) / 443 is 0.447. This gave a sum of chi-squares of 9.95. The P-value for a sum of 8 chi-squares was 0.127, indicating an acceptable fit. Although this is acceptable, Solver was used to find the minimum chi-square estimate of μ. The value of μ was 0.470, the sum of 8 chi-squares was 9.25, and the P-value was 0.160.

Since the chi-square calculation divides the squared difference by the expected frequency, there may be one or two very large chi-squares in the tail where the frequencies are small. This could lead to a rejection of the goodness of fit even though the distribution fits well for most of the measurements. If the fit looks acceptable for the majority, we can group all the last few observed and expected counts to produce a chi-square for the group. Since we lose degrees of freedom when we group to achieve a better fit, the smaller sum of chi-squares that results could still have a P-value that is less than 0.05. For the Phenanthrene measurements we can see large chi-squares for 3 and 4 µg/g. Grouping the observed and expected counts for 2 to 7 µg/g produced a chi-square of 0.12. The sum of the resulting three chi-squares was 1.28 and the P-value was 0.26.

Having fitted a distribution, the estimated upper specification limit is the value associated with a cumulative percentage of 99.9%. This is roughly equivalent to that of a μ + 3 * σ upper limit for a Normal distribution.

Based on these 443 residual phenanthrene measurements, the upper specification would be set at 4 μg/g.

When the histogram of measurements shows a long tail to the right with a few values as high as 10 or 12 μg/g, an exponential distribution may be an appropriate alternative to a Poisson.

The exponential distribution is given by:

In this equation, x denotes the observed measurement values, which must be greater than 0. The parameter, λ, is estimated by the mean of the measurements.

The exponential distribution is typically used to estimate the expected units of time that occur between consecutive occurrences of an event, for example the number of days between breakdowns of a machine. In our application of the distribution, we replace units of time with units of concentration in μg/g. An exponential distribution is fitted to the data in a similar way to the Poisson. In Excel, the function EXPONDIST(D87, L96, FALSE) is used to calculate the percentage in the interval centered on the value in cell D87 when the value of λ is in cell L96. Since the exponential is fitted to the mid-point of intervals, for example 1 is the center of 0.51 to 1.50, the value <1 is replaced by 0.3.

The cumulative percentage for a value of x is given by (1 – e-λx) and so the value of x corresponding to a required cumulative percentage (P) is x = –ln(1 – P) / λ. Thus, a 99.9% upper limit could be calculated by 6.91 / λ. If λ is estimated by the mean of the measurements, this is 6.91 times the mean.

More information about the Poisson and Exponential distributions can be found in the NIST/SEMATECH e-Handbook of Statistical Methods.3 Information about Solver, POISSON, and EXPONDIST is provided in Excel's Help option. Minitab provides the option to add an exponential distribution plot to a histogram.

REFERENCES

1. US Food and Drug Administration. International Conference on Harmonization; Guidance on specifications: Test procedures and acceptance criteria for biotechnological/biological products. Federal Register; 1999 Aug 19 [cited 2006 October 13];6(159):44934. Available from: www.fda.gov/cber/gdlns/ichtest.pdf

2. Deming WE. Some theory of sampling. NY: Dover Publications; 1966.

3. NIST/SEMATECH e-handbook of statistical methods. 2003 June 1 (updated 2006 July 18) [cited 2006 October 13]. Available from: www.itl.nist.gov/div898/handbook.

4. Howe WG. Two-sided tolerance limits for normal populations-some improvements. J of Am Statistical Assoc 1969;64:610-20.

5. Natrella MG. National Bureau of Standards handbook 91: Experimental statistics. Washington, DC: US Department of Commerce; 1963.

Excel procedures to apply all these methods can be obtained from terry.orchard@bespak.com

Terry Orchard is a statistician at Bespak Europe, Blackhill Dr., Featherstone Road, Wolverton Mill South, Milton Keynes, Buckinghamshire, MK12 5TS, England, + 44 (0)1908.552600, fax +44(0)1908.552613, terry.orchard@bespak.com

APPENDIX

Calculation of two-sided multipliers (MUL) by Howe's method and one-sided multipliers (MU/ML) by Natrella's method

Howe's method calculates a variable MUL to provide us with a level of confidence of C% that the limits (x mean – MUL * S) and (x mean + MUL * s) will contain D% of the distribution.

The sigma multiplier MUL is given by

in which N is the number of values in the sample, Z(1 – D)/2 is the critical value of the Normal distribution that is exceeded with probability (1 – D)/2 and X2(N – 1),Cis the critical value of the chi-square distribution with (N – 1) degrees of freedom that is exceeded with probability C%.

Equation [1] looks complex, and the statement, "We are 99% confident that 95% of the measurements will fall within the calculated tolerance limits" requires a bit of thought, but it is not difficult to calculate MUL using Excel. For example, if cell J24 contains N, if C = 99% and D = 99.25%, the value of MUL is given by

= SQRT((J24 – 1)*(1 + (1 / J24))* (NORMINV(0.00375,0,1)^2) / (CHIINV(0.99,J24 – 1)))........... [2]

Some values of the sigma multipliers calculated by Howe's method appear in Table 1. The entries in Table 1 were calculated with C set to 99% and D set to 99.25%.

These values of C and D were selected by noting that the usual practice for specifications calculated from the mean (x mean) and standard deviation (s) is to set the limits at +/– 3 times the standard deviation either side of the mean. C and D were thus selected so that the two-sided tolerance interval limits would be (x mean – 3 * s) and (x mean + 3 * s) when the sample size is around 250.

Natrella's method for probabilistic one-sided tolerance intervals calculates variables MU and ML to provide us with a level of confidence of C% that D% of the values will be less than an upper limit of (x mean + MU * s) or that D% of the values will be more than a lower limit of (x mean – MU * s). The calculations are the same for MU and ML

and N is the number of values in the sample, ZD is the critical value of the Normal distribution that is exceeded with probability D and ZC is the critical value of the Normal distribution that is exceeded with probability C.

MU can be calculated using Excel. For example, if cell B4 contains N, if C = 99% and D = 99.625%, the values of a and b are given by:

a = 1 – (NORMINV(0.99,0,1))^2 / (2*(B4 – 1))........... [6]

and

b = (NORMINV(0.99625,0,1))^2 – (NORMINV(0.99,0,1))^2 / B4........... [7]

If the result of a appears in cell B5 and that of b appears in cell B6, the value of MU is given by

= (NORMINV(0.99625,0,1) + SQRT((NORMINV(0.99625,0,1))^2 – B5*B6)) / B5........... [8]

Values for some sample sizes with C = 99% and D = 99.625% appear in Table 1. These values of C and D were selected so that the one-sided tolerance interval limits would be (x mean – 3 * s) and (x mean + 3 * s) when the sample size is around 250.

Two-sided multipliers (MUL) are calculated by Howe's method with C = 99% and D = 99.25% (in equation 1).

One-sided multipliers (MU / ML) are calculated by Natrella's method with C = 99% and D = 99.625% (in equations 3, 4, and 5).