Statistical Equivalence Testing for Assessing Bench-Scale Cleanability

February 1, 2010

BioPharm International

Volume 23, Issue 2

Page Number: 40–45

The two-one-sided t-test compares the equivalency of two data sets.


Regulatory authorities expect biopharmaceutical manufacturing facilities to demonstrate that they have an effective and consistent cleaning process in place. For a multiproduct facility, bench-scale characterization offers a useful and cost-effective means to support cleaning validation by comparing the cleanability of a new product to a validated one. Because of the challenges posed by experimental variability in such evaluations, such relative cleanability assessments should be based on a sound statistical analysis. This article describes the application of a two-one-sided t-test (TOST) method to assess the comparability of two groups of cleanability data generated from a bench-scale study.

An effective cleaning process is critical to ensure that product quality attributes are not compromised by contamination or carryover through product contact equipment surfaces shared among different lots. Regulatory agencies, therefore, require all biopharmaceutical manufacturing facilities to establish effective and robust cleaning validation programs.1 Multiproduct facilities can use a worst case–based cleaning validation approach in which cleaning cycles are demonstrated to be capable of cleaning the most difficult-to-clean product; no further large-scale verification is needed for other products.2–4 Such an approach, however, requires that the cleanability of all new products be compared to the validated worst case. Bench-scale cleaning studies provide a useful tool to evaluate the relative cleanability of new products and determine the need for revalidation.4–7 In one of our earlier studies, a scale-down model was developed to study the effect of key operating parameters on the performance of the cleaning process.7 That bench-scale model uses stainless steel coupons spotted with product samples (Figure 1), followed by cleaning under simulated thermal and chemical conditions that are representative of large-scale cleaning cycles. Product removal from the coupon surface is visually monitored and the time required to clean the spot is recorded as the cleaning time.


In this study, we apply the bench-scale model to evaluate the relative cleanability of different protein products. Because of the variability observed in the cleaning times, data points were collected in replicates and the statistical error was estimated. After multiple cleaning time data points were generated for each product, a robust statistical method was needed to adequately assess the comparability of these cleaning time distributions. The two-one-sided t-test (TOST) is a commonly used statistical tool for comparability purposes, especially for method transfers between two laboratories, when the goal is to demonstrate equivalency between the receiving and transferring laboratory. This method is well accepted by the FDA and is widely used in the industry.8–10 This study applies TOST to compare the cleanability of protein drug products.

Figure 1


When comparing two or more groups of data, the more common approach is to determine if the difference in group means (a group mean represents the average of all data within the group) is sufficiently large to be declared statistically significant. The test statement or the null hypothesis is that the groups are not different. The effect of declaring the difference statistically significant indicates that the null hypothesis is rejected; the groups represent two or more different distributions of values and are in fact not equal. In practice, given sufficient sample size, even differences that are too small to be meaningful may be declared statistically significant.

The opposite cannot be declared, however, when no statistically significant difference is observed. One can only reject the null hypothesis or show that the groups are different using the common t-test. This is inconvenient when the goal is to show comparability between two or more groups.

An approach widely used in clinical trial statistics and which is gaining popularity in pharmaceutical and biotech settings, the TOST is a method for declaring the comparability of equivalence that is built around comparing two or more group means and their respective mean difference confidence intervals against predetermined equivalence limits. If the difference between the confidence intervals is within a predefined equivalence limit, then the true difference will be within the limit as well, thus making it possible to claim equivalency between two data sets. The key goal for the cleanability assessment is to compare the cleanability of the two products by an equivalency test.

Experimental data generated during a cleaning characterization study using the bench-scale model showed that some inherent variability exists because of the nature of the cleaning process. In addition, analyst and experimental error contribute to further variability. To adequately establish the predefined equivalence limit, each component contributing to variability should be considered. If the equivalence limit is set too wide, the resolution of the method may be reduced because it would be more difficult to distinguish between two products. If the equivalence limit is to set too narrow, the results may not be accurate in assessing whether two products are truly equivalent. For the scale-down cleaning model, an evaluation of the different components of experimental variability showed that two times the upper 95% confidence limit of the standard deviation estimate of a controlled data set is adequate to differentiate between the cleanability of two products. Variability in the controlled dataset is one of many potential equivalence limit justifications. Often, when specification or acceptance criteria are available, maximum differences that ensure the capability of meeting these criteria maybe used as equivalence limits.


The null hypothesis (also referred to as the equivalence hypothesis) states that the means of the cleaning times of two products are different by an amount θ or larger:

in which θ is the equivalence limit and μa and μb are the means of the two groups. To test for equivalence, the 90% confidence intervals for the difference between two groups are constructed. The null hypothesis that the groups differ by at least θ is rejected if the limits of the interval fall outside the ±θ bounds. Conversely, comparability is demonstrated when the bounds of the 90% confidence interval of the mean difference fall entirely within the ±θ bounds, as shown in Figure 2.

Figure 2

Note that the confidence interval width increases with smaller sample sizes of collected data and with less variability within each data group. The specifics behind the sample size calculation are outside the scope of this article. Larger sample size, however, would naturally result in a narrower confidence interval of the mean difference and hence would make declaring comparability easier. Likewise, although equivalency does not explicitly compare an individual group's variability, wider variance would result in wider confidence intervals, making it more difficult to declare comparability.

This equivalence limit was computed as two times the upper 95% confidence limit of standard deviation estimate of the controlled dataset. For the case of cleaning experiments, equivalence limit was equal to 2 x [1.6 x 1.4] = 4.48, in which 1.6 was the standard deviation of a controlled data set (product A) and 1.4 was the multiplier for the 95% confidence limit of a standard deviation estimate, based on a sample size of 18.11 Using the upper confidence limit of the standard deviation estimate accounts for the uncertainty of such estimates based on a given sample size.

Therefore, the acceptance criterion for equivalency was that the upper and lower confidence limit of the difference between the two means should be within ±4.48. The following two case studies show the application of this statistical approach to comparing the cleanability of different protein drug products.

Figure 3

Case Study 1: Products A and B are Not Equivalent

Two protein products were cleaned using the bench-scale method. A total of 18 data points (for cleaning time) were recorded for each product. Commercially available statistical software (JMP) was used to perform the TOST analysis.12 The one-way analysis "Fit Y by X" function was used with a set alpha level (probability of type 1 error) of 0.1, which represents the 90% confidence interval discussed earlier. Figure 3 shows the distribution of cleaning times for the two products. The box and whisker plot (in red) represents the range and distribution of the data points. The box contains the middle 50% of the data and the line across the middle of the box represents the median of the data set. The difference between the quartiles is the interquartile range. Each box has whiskers that extend from the edge of the box to the outermost data point that falls within the boundary defined by upper quartile + 1.5*(interquartile range) and lower quartile –1.5*(interquartile range).

Table 1. Upper and lower confidence limits of the difference between two groups as determined using the two-one-sided t-test (TOST)

Table 1 shows the output of the TOST analysis performed using JMP. The difference between two group means represents the point estimate of the true difference between the two means. This can be calculated by subtracting the sample mean for data set A from the sample mean for B. The standard error (SE) of the difference between two group means can be calculated by applying the following equation:

in which sA is the standard deviation of group A, nA is the sample size of group A, and sB and nB represents the corresponding values for product B. This value provides an estimate of the variability of the difference between the two data sets. The degrees of freedom are adjusted based on the variability of each data set, which is determined by the statistical software (JMP) using the Satterthwaite approximation.11 The 90% confidence interval for the difference between two means is reflected by the upper confidence limit difference of 70.36 and the lower confidence limit difference of 62.91 of the two group means. Because the equivalence limit is ±4.48, and the upper and lower confidence limit of the difference between two means fall outside the set equivalence limit, it is concluded that product A and product B are not equivalent. Based on the average cleaning time and confidence interval, product B is considered more difficult to clean than product A.

In this case study, the products failed to meet cleanability equivalency mainly because of the large difference (66.64 min) in the mean cleaning times, as shown by the blue bar in Figure 2. It is also possible to fail the equivalency test when the two group means are similar but product B has a high degree of variability, resulting in broad confidence intervals as the one shown by the red bar in Figure 2. In such a scenario, the variability in product B should be further evaluated and the outcome of the cleanability ranking (B<A or B>A) can be made based on an appropriate risk assessment and business considerations.

Case Study 2: Product A and Y are Equivalent

The TOST analysis, as described in the previous case study, was repeated for two other products. Figure 4 shows the distribution of cleaning times for these two products: A and Y.

Figure 4

Table 2 shows the output of the TOST analysis using JMP. The 90% confidence interval for the difference between two means is reflected by the upper confidence limit difference of 1.5547 and the lower confidence limit difference of 0.0564 of the two group means. Because the equivalence limit is ±4.48, the upper and lower confidence limits of the difference between two means fall within the equivalence limit. It is therefore concluded that product A and product Y are equivalent to each other in terms of cleanability.

Table 2. Upper and lower confidence limits of the difference between two groups as determined using the two-one-sided t-test


To ensure consistency and adherence, a procedure should be established and analysts should be trained to perform such experiments. Because this method provides relative product cleanability, it is important that each experiment be conducted in a consistent manner. When performing cleaning evaluations to compare new products to the validated worst case, an additional check can be incorporated to ensure that each evaluation is conducted in a consistent manner. This is achieved by comparing the data for a control molecule (e.g., a worst-case product) to the established data set or the "gold standard" generated for the control during the characterization study. The same statistical method, the TOST, can be used to fulfill this requirement. For example, an analyst may need to perform an experiment to determine the cleanability of new product N relative to the validated product W. The cleanability of validated product W has been pre-established by prior characterization work. To ensure that the analyst performed the experiment adequately, a comparability test using the TOST can be used to compare the equivalency between data generated by an analyst for product W to the established data set. The equivalency between the two data sets would demonstrate that the experiment was indeed adequate and reliable.


The two-one-sided t-test (TOST) is a statistical method well accepted by the FDA and industry for evaluating the comparability between two groups of data. In the case of a scale-down cleaning evaluation, this statistical approach has been applied to determine the relative cleanability of two products. The TOST compares two group means and their confidence intervals by comparing them to a predefined equivalence limit. The predefined equivalence limit should be established by evaluating the variability involved with such experimental evaluations. To incorporate an additional check for analyst consistency, TOST can be applied to ensure that the data obtained from different analysts for a particular product (control molecule) are equivalent.


The authors thank Ed Walls and Erwin Freund (Process Development, Amgen, Inc.) for reviewing this work and providing their valuable suggestions.

Cylia Chen is a senior associate scientist, Nitin Rathore is a senior scientist, and Wenchang Ji is a principal scientist, all in drug product and device development, and Abe Germansderfer is a principal quality engineer, corporate quality, all at Amgen, Inc., Thousand Oaks, CA, 805.313.6393,


1. US Food and Drug Administration. Guidance for industry. Current good manufacturing practice for finished pharmaceuticals. Rockville, MD; 2004. Available from:

2. Sanchez JAM. Equipment cleaning validation within a multi-product manufacturing facility. BioPharm Int. 2006;19(20):38–49.

3. Mollah AH, White EK. Risk-based cleaning validation in biopharmaceutical API manufacturing. BioPharm Int. 2005;18(11):34–40.

4. Sharnez R, Lathia J, Kahlenberg D, Prabhu S. In situ monitoring of soil dissolution dynamics: A rapid and simple method for determining worst-case soils for cleaning validation. PDA J Pharm Sci Technol. 2004;58:203–14.

5. Le Blanc DA. Validated cleaning technologies for pharmaceutical manufacturing. CRC Press LLC;2000.

6. Rathore N, Qi W, Ji W. Cleaning characterization of protein drug products using UV-vis spectroscopy. Biotechnol Prog. 2008;24(3):684–90.

7. Rathore N, et al. Bench scale characterization of cleaning process design space for biopharmaceuticals. BioPharm Int. 2009;22(5):32–45.

8. Chambers D, et al. Analytical method equivalency: an acceptable analytical practice. Pharm Technol. 2005:9:64–80.

9. US Food and Drug Administration. Guidance for industry. Statistical approaches to establishing bioequivalence. Rockville, MD;2001.

10. US FDA. Guidance for industry. Bioavailability and bioequivalence studies for orally administered drug Products—general considerations. Rockville, MD;2003.

10. National Institute of Standards and Technology. NIST/SEMATECH e-Handbook of Statistical Methods. Available from:

11. SAS Institute Inc. JMP statistical discovery from SAS. Release 6. 2006.