Statistical Equivalence Testing for Assessing Bench-Scale Cleanability

The two-one-sided t-test compares the equivalency of two data sets.
Feb 01, 2010
Volume 23, Issue 2


Regulatory authorities expect biopharmaceutical manufacturing facilities to demonstrate that they have an effective and consistent cleaning process in place. For a multiproduct facility, bench-scale characterization offers a useful and cost-effective means to support cleaning validation by comparing the cleanability of a new product to a validated one. Because of the challenges posed by experimental variability in such evaluations, such relative cleanability assessments should be based on a sound statistical analysis. This article describes the application of a two-one-sided t-test (TOST) method to assess the comparability of two groups of cleanability data generated from a bench-scale study.

An effective cleaning process is critical to ensure that product quality attributes are not compromised by contamination or carryover through product contact equipment surfaces shared among different lots. Regulatory agencies, therefore, require all biopharmaceutical manufacturing facilities to establish effective and robust cleaning validation programs.1 Multiproduct facilities can use a worst case–based cleaning validation approach in which cleaning cycles are demonstrated to be capable of cleaning the most difficult-to-clean product; no further large-scale verification is needed for other products.2–4 Such an approach, however, requires that the cleanability of all new products be compared to the validated worst case. Bench-scale cleaning studies provide a useful tool to evaluate the relative cleanability of new products and determine the need for revalidation.4–7 In one of our earlier studies, a scale-down model was developed to study the effect of key operating parameters on the performance of the cleaning process.7 That bench-scale model uses stainless steel coupons spotted with product samples (Figure 1), followed by cleaning under simulated thermal and chemical conditions that are representative of large-scale cleaning cycles. Product removal from the coupon surface is visually monitored and the time required to clean the spot is recorded as the cleaning time.

Figure 1
In this study, we apply the bench-scale model to evaluate the relative cleanability of different protein products. Because of the variability observed in the cleaning times, data points were collected in replicates and the statistical error was estimated. After multiple cleaning time data points were generated for each product, a robust statistical method was needed to adequately assess the comparability of these cleaning time distributions. The two-one-sided t-test (TOST) is a commonly used statistical tool for comparability purposes, especially for method transfers between two laboratories, when the goal is to demonstrate equivalency between the receiving and transferring laboratory. This method is well accepted by the FDA and is widely used in the industry.8–10 This study applies TOST to compare the cleanability of protein drug products.


When comparing two or more groups of data, the more common approach is to determine if the difference in group means (a group mean represents the average of all data within the group) is sufficiently large to be declared statistically significant. The test statement or the null hypothesis is that the groups are not different. The effect of declaring the difference statistically significant indicates that the null hypothesis is rejected; the groups represent two or more different distributions of values and are in fact not equal. In practice, given sufficient sample size, even differences that are too small to be meaningful may be declared statistically significant.

The opposite cannot be declared, however, when no statistically significant difference is observed. One can only reject the null hypothesis or show that the groups are different using the common t-test. This is inconvenient when the goal is to show comparability between two or more groups.

An approach widely used in clinical trial statistics and which is gaining popularity in pharmaceutical and biotech settings, the TOST is a method for declaring the comparability of equivalence that is built around comparing two or more group means and their respective mean difference confidence intervals against predetermined equivalence limits. If the difference between the confidence intervals is within a predefined equivalence limit, then the true difference will be within the limit as well, thus making it possible to claim equivalency between two data sets. The key goal for the cleanability assessment is to compare the cleanability of the two products by an equivalency test.

Experimental data generated during a cleaning characterization study using the bench-scale model showed that some inherent variability exists because of the nature of the cleaning process. In addition, analyst and experimental error contribute to further variability. To adequately establish the predefined equivalence limit, each component contributing to variability should be considered. If the equivalence limit is set too wide, the resolution of the method may be reduced because it would be more difficult to distinguish between two products. If the equivalence limit is to set too narrow, the results may not be accurate in assessing whether two products are truly equivalent. For the scale-down cleaning model, an evaluation of the different components of experimental variability showed that two times the upper 95% confidence limit of the standard deviation estimate of a controlled data set is adequate to differentiate between the cleanability of two products. Variability in the controlled dataset is one of many potential equivalence limit justifications. Often, when specification or acceptance criteria are available, maximum differences that ensure the capability of meeting these criteria maybe used as equivalence limits.

lorem ipsum