Demonstrating Comparability of Stability Profiles Using Statistical Equivalence Testing

March 1, 2011
Leslie Sidor|Rick Burdick

BioPharm International

Volume 24, Issue 3

Page Number: 40–45

The authors present an approach for testing statistical equivalence of two stability profiles.


Statistical comparisons are helpful in objectively assessing comparability between a historical (prechange) and new (postchange) manufacturing process, site, formulation, or delivery device. When the objective of the comparison is to demonstrate that the stability profiles (i.e., slopes of a performance attribute over time) of two processes are highly similar, an equivalence approach is recommended. The authors present an approach for testing statistical equivalence of two stability profiles. The authors discuss concepts, selection of an equivalence acceptance criterion, sampling design considerations, and data analysis.

Regulatory bodies recognize and accept change as a normal part of manufacturing in a cGMP environment. Changes in scale, site of manufacture, manufacturing processes, formulation, and delivery devices are common aspects as products progress through development, to commercialization, and finally, to commercial sustainability. These changes are often made to improve efficiencies, for process control, or to meet product supply demands and patient needs. Because change is recognized as a necessary aspect of a product's life cycle, regulatory guidance has been developed to ensure that when changes are implemented that they have no adverse impact on the product's safety and efficacy.

The guidance in the International Conference on Harmonization Q5E guideline acknowledges that pre- and post-change conditions do not have to be identical, but there must be no negative impact on product safety and efficacy (1). Specifically, the guidance document states the following:

"The demonstration of comparability does not mean that the quality attributes of the pre-change and post-change product are identical, but they are highly similar and that the existing knowledge is sufficiently predictive to ensure that any difference in quality attributes have no adverse impact upon safety or efficacy of drug product."

It has been argued in recent years that statistical tests of equivalence provide the strongest evidence of process comparability. Two articles in the biopharmaceutical literature recommending equivalence have been published recently (2, 3). These papers and others have focused on demonstrating average equivalence of two process means. Briefly, the test of average equivalence for means compares means of two processes (i.e., historical and new). After collecting the data, the difference between the historical and new means is estimated using upper and lower 95% one-sided confidence bounds (i.e., a 90% two-sided confidence interval). Equivalence of the two processes is demonstrated if the resulting confidence interval falls within the range predefined to demonstrate comparable performance of the two processes. The range, based on scientific understanding, is symmetrically centered around a parameter difference of zero and is from –EAC to +EAC, where EAC is an acronym for Equivalence Acceptance Criterion. The EAC is defined as the largest acceptable difference between the historical and new process means. Figure 1 graphically represents three result scenarios of an equivalence test of means along with the EAC range, and the upper and lower confidence bounds on the difference between the two parameters.

For each of the three scenarios in Figure 1, an X represents the estimated difference between the average of the historical process and the new process. The horizontal line through the X represents the length of the confidence interval around the difference. If the interval falls entirely outside the range from –EAC to +EAC, the test of equivalence fails, and in fact a condition of nonequivalence has been demonstrated (see Scenario A). When the interval straddles the EAC, the test is considered inconclusive (i.e., statistical equivalence has neither been proven nor disproven) as shown in Scenario B (4). The test of equivalence passes when the confidence interval around the difference falls completely within the range from –EAC to +EAC. In this case, statistical equivalence has been proven with a type 1 error rate of 5% (see Scenario C).

Figure 1. Result scenarios for an equivalence test. EAC is equivalence acceptance criteria, and X is the estimated difference between the average of the historical process and the new process. (All figures are Courtesy of the Authors)

In the case where the test result is inconclusive (i.e., Scenario B), additional data would enhance further process understanding. From a statistical perspective, the additional data would shrink the size of the confidence interval with minimal impact on the estimated difference in the two process parameters X. Ultimately, increasing sample size would produce a conclusive result (i.e., Scenario A or Scenario C).


The primary purpose of the equivalence test with stability data is to demonstrate that the difference in process slopes is less than the practically important threshold defined by the EAC. The steps employed in a test of average equivalence of slopes with stability data are the same steps used to demonstrate average equivalence of two process means. In particular, the three steps used to perform the average equivalence test of slopes are listed below.

1. Establish the EAC. This is the difference in process slopes thought to be of practical importance.

2. Determine the experimental design and required sample size to ensure acceptable type 1 and type 2 error (Refer to section "Step 2: Study design and determining sample size" for further discussion).

3. Perform the test of equivalence after the data are collected and interpret the results.

An example is now presented to demonstrate each of these three steps.

Step 1: Setting the equivalence acceptance criteria

The example considers a process in which four historical lots have been placed in a stressed temperature condition. In Figure 2, the response is measured over a specified length of time that is the same for each of the "typical" historical lots. The measured response is a purity measurement denoted as Y in percentage units on the vertical axis. The least squares slopes (in percentage of purity per month) were computed individually for each lot and are shown in Figure 2. Note that even for the same process, there is variation among the slopes across lots.

Figure 2. Four stressed stability lots (A, B, C, and D) from the historical process.

For stability data, an EAC represents the largest acceptable difference between the average slopes of the historical and new processes. It is recommended that the EAC for a specific attribute be derived from the following considerations:

  • Scientific knowledge of the critical quality attributes and the impact of degradation over time

  • Process understanding of stability data from material with clinical exposure

  • Variability among the lot slopes of the historical process (e.g., the range of the historical slopes in Figure 2 varies from –1.53% per month for lot A to –0.844% per month for lot D.)

In the present example, the EAC is established at ±1% per month. To demonstrate equivalence, the true average slope of the new process cannot differ from the true average slope of the historical slope by more than 1% per month.

Step 2: Study design and determining sample size

As noted above, once the EAC is established, the adequacy of the study design to assess comparability must be evaluated. This work is performed before the new process data are collected. Study design and sample-size calculations are critical when conducting a statistical test. As in all statistical investigations, there is a possibility that the conclusions may be incorrect. In the context of an equivalence test, two different errors exist. The first error takes the form of a false positive—declaring two processes are equivalent when they are in fact not equivalent. This is called a type 1 error and represents the risk to the consumer, because stating that the two processes are equivalent when they are not could possibly compromise efficacy or patient safety. The second error is a false negative, which is declaring that two processes are not equivalent when they are in fact equivalent. This is the type 2 error. The type 2 error represents the manufacturer's risk, becuase not claiming equivalence when equivalence exists will result in unnecessary delays in the implementation of the change. Table I organizes the declarative statement concerning equivalence relative to the true state of nature with the corresponding error type.

Table 1

For an equivalence test, the type 1 error is generally set at 5%. This represents the statistical test size and is the reason why the 90% confidence interval is computed in the testing process (5). Once the type 1 error rate is established, the desired type 2 error is obtained by selecting an appropriate number of new lots.

One advantage of the statistical equivalence approach is that the type 1 error rate (i.e., consumer risk) is established at the desired level and is not a function of the total sample size. Thus, the consumer is always protected at the desired level. The type 2 error is controlled through appropriate study design. When an experimental design has an adequate number of lots, the investigator is rewarded for proper experimental planning.

For the example, the type 1 error is set at 5%. In general, if the desired type 1 error is X%, a two-sided 100-2X% confidence interval is required in the equivalence test. The type 2 error is computed under different study designs for comparing new lots to the four historical lots. The type 2 error is computed assuming the true difference between slopes is 0 and also when the true difference in slopes is 25% of the EAC (0.25% per month) and 50% of the EAC (0.5% per month). Table II lists the type 2 error rates for these different scenarios. These error rates were computed using computer simulation and the estimated variances of the historical data.

Table 2

Based on the information in Table II, four new lots provided adequate control of the type 2 error with measurements to be taken at 0, 2, 4, and 6 months for each of the four lots.

Once the EAC and sample design have been established, it is helpful to represent the EAC graphically. Algebraically, equivalence is demonstrated if

where bHistoric is the average slope estimate of the historical process lots, bNews is the average slope estimate of the new process lots, and ME is the margin of error for a two-sided 90% confidence interval. The formula for the margin of error depends on the assumed statistical design. A formula is given for one particular case in the numerical example that follows.

Equation (1) can be rewritten as

Figure 3 plots the average slope for the historical process (bHistoric) for the four lots shown in Figure 2 with a y-intercept equal to the average of the y-intercepts across all four lots. The two red lines have the same y-intercept, but with slopes of (bHistoric)–EAC+ME and (bHistoric)+EAC–ME. If the average slope for the new process lots (bNews) falls within the red lines, then Equation 2 is satisfied, and equivalence of average slope is demonstrated. In this example, ME = 0.371%, (bHistoric) = –1.13% per month, y-intercept = 89.5%, and EAC = 1% per month. For purposes of Figure 3, ME is estimated using the variance estimates of only the historical data. The final value for ME will include variance estimates from both the historical and new processes.

Figure 3. Average slope of the historical process with graphical representation of equivalence acceptance criteria (EAC). Equivalence is demonstrated if new process average slope falls within red lines.

Step 3: Computing the equivalence test and interpreting the results

The parameter of interest in the equivalence test is the average difference between a historical process slope and a new process slope. For the example, the EAC is defined as 1% per month. A 90% two-sided confidence interval on the difference in average slopes between the historical and new processes is now computed with samples of lots from the two processes. If this confidence interval fits within the range from –1% per month to –1% per month, then equivalence is demonstrated. This is analogous to Scenario C in Figure 1. The formula for the 90% two-sided interval depends on the underlying

model. For this example, the model that best fits the data assumes that lots are random. In this case, the 90% two-sided interval is where nH and nN are the number of historical and new lots, respectively, and T is the number of time points for each profile. The term "Est. Slope Variance" in the formula is the estimated variance of a slope estimate based on a single lot. This variance is assumed to be equal for the two processes. In this example, bH = -1.13, bN = -1.56, T =4, nH = nN = 4, Est. Slope Variance = 0.0967, and t22;0.05 =1.717.

Figure 4. Confidence interval on difference for equivalence test of slopes.

The lower bound of the 90% two-sided confidence interval shown in Equation 3 is 0.05 % per month, and the upper bound is 0.81 % per month. Because the interval 0.05% per month to 0.81% per month is entirely contained in the range from –1% per month to 1% per month, equivalence has been demonstrated. Figure 4 shows the confidence interval for the difference in average slopes relative to the EAC. The fact that the confidence interval does not include the value 0 implies there is a statistically significant difference between the two slopes with a statistical test size of 0.10 (p-value = 0.0633).

Figure 5. Plots of historical and new process slopes and equivalence acceptance criteria (EAC). EAC is donated in red lines.


Assessing product and process comparability culminates in the understanding of the consequence of a change as it relates to ensuring the safety and efficacy of products. The information that feeds the understanding includes historical product/process variability, the science of the molecules, and objective comparability acceptance criteria that provide the path forward in the decision making. At the end of the study, the goal is to make the desired change and state with confidence that the historical and new process/product are comparable with no adverse impact on product quality, safety, or efficacy.

The most appropriate methodology when assessing comparability is the use of an equivalence test. The steps for conducting an equivalence test are the same when comparing means or stability slopes. Specifically, one defines the EAC using scientific knowledge and historical process variability, determines an appropriate experimental design to minimize type 1 and type 2 errors, collects the data, and interprets the results. If the EACs are appropriately established, passing an equivalence test provides the strongest statistical evidence of comparability.

Leslie Sidor* is director of quality engineering and Rick Burdick is principal quality engineer, both at Amgen Global Quality Engineering; Darrin Cowley is director of product quality and Brent Kendrick is scientific director, both at Amgen Analytical Sciences,


1. ICH, Q5E Comparability of Biotechnological/Biological Products Subject to Change in Their Manufacturing Process (Geneva, June 2005).

2. D. Chambers et al., Pharm. Technol. 29 (9), 64–80 (2005).

3. C. Chen et al., BioPharm Int. 23 (2), 40–45 (2010).

4. M.J. Chatfield and P.J. Borman, Anal. Chem. 81 (24), 9841–9848 (2009).

5. S. Welleck, Testing Statistical Hypotheses of Equivalence (Chapman and Hall/CRC, Boca Raton, FL, 2003).