Regulatory authorities expect biopharmaceutical manufacturing facilities to demonstrate that they have an effective and consistent
cleaning process in place. For a multiproduct facility, bench-scale characterization offers a useful and cost-effective means
to support cleaning validation by comparing the cleanability of a new product to a validated one. Because of the challenges
posed by experimental variability in such evaluations, such relative cleanability assessments should be based on a sound statistical
analysis. This article describes the application of a two-one-sided t-test (TOST) method to assess the comparability of two groups of cleanability data generated from a bench-scale study.
An effective cleaning process is critical to ensure that product quality attributes are not compromised by contamination or
carryover through product contact equipment surfaces shared among different lots. Regulatory agencies, therefore, require
all biopharmaceutical manufacturing facilities to establish effective and robust cleaning validation programs.1 Multiproduct facilities can use a worst case–based cleaning validation approach in which cleaning cycles are demonstrated
to be capable of cleaning the most difficult-to-clean product; no further large-scale verification is needed for other products.2–4 Such an approach, however, requires that the cleanability of all new products be compared to the validated worst case. Bench-scale
cleaning studies provide a useful tool to evaluate the relative cleanability of new products and determine the need for revalidation.4–7 In one of our earlier studies, a scale-down model was developed to study the effect of key operating parameters on the performance
of the cleaning process.7 That bench-scale model uses stainless steel coupons spotted with product samples (Figure 1), followed by cleaning under
simulated thermal and chemical conditions that are representative of large-scale cleaning cycles. Product removal from the
coupon surface is visually monitored and the time required to clean the spot is recorded as the cleaning time.
(ADAM GAULT, GETTY IMAGES _PHOTO)
In this study, we apply the bench-scale model to evaluate the relative cleanability of different protein products. Because
of the variability observed in the cleaning times, data points were collected in replicates and the statistical error was
estimated. After multiple cleaning time data points were generated for each product, a robust statistical method was needed
to adequately assess the comparability of these cleaning time distributions. The two-one-sided t-test (TOST) is a commonly used statistical tool for comparability purposes, especially for method transfers between two laboratories,
when the goal is to demonstrate equivalency between the receiving and transferring laboratory. This method is well accepted
by the FDA and is widely used in the industry.8–10 This study applies TOST to compare the cleanability of protein drug products.
A STATISTICAL METHOD
When comparing two or more groups of data, the more common approach is to determine if the difference in group means (a group mean represents the average of all data within the group) is sufficiently large to be declared statistically significant.
The test statement or the null hypothesis is that the groups are not different. The effect of declaring the difference statistically
significant indicates that the null hypothesis is rejected; the groups represent two or more different distributions of values
and are in fact not equal. In practice, given sufficient sample size, even differences that are too small to be meaningful
may be declared statistically significant.
The opposite cannot be declared, however, when no statistically significant difference is observed. One can only reject the
null hypothesis or show that the groups are different using the common t-test. This is inconvenient when the goal is to show comparability between two or more groups.
An approach widely used in clinical trial statistics and which is gaining popularity in pharmaceutical and biotech settings,
the TOST is a method for declaring the comparability of equivalence that is built around comparing two or more group means
and their respective mean difference confidence intervals against predetermined equivalence limits. If the difference between
the confidence intervals is within a predefined equivalence limit, then the true difference will be within the limit as well,
thus making it possible to claim equivalency between two data sets. The key goal for the cleanability assessment is to compare
the cleanability of the two products by an equivalency test.
Experimental data generated during a cleaning characterization study using the bench-scale model showed that some inherent
variability exists because of the nature of the cleaning process. In addition, analyst and experimental error contribute to
further variability. To adequately establish the predefined equivalence limit, each component contributing to variability
should be considered. If the equivalence limit is set too wide, the resolution of the method may be reduced because it would
be more difficult to distinguish between two products. If the equivalence limit is to set too narrow, the results may not
be accurate in assessing whether two products are truly equivalent. For the scale-down cleaning model, an evaluation of the
different components of experimental variability showed that two times the upper 95% confidence limit of the standard deviation
estimate of a controlled data set is adequate to differentiate between the cleanability of two products. Variability in the
controlled dataset is one of many potential equivalence limit justifications. Often, when specification or acceptance criteria
are available, maximum differences that ensure the capability of meeting these criteria maybe used as equivalence limits.