Equivalence Testing for Comparability

Published on: 
BioPharm International, BioPharm International-02-01-2015, Volume 28, Issue 2

Understanding the influence of change events on product performance is a necessity to routine drug development, transfer, and validation.

FDA’s guidance on comparability protocols (1) discusses the need and considerations for assessing any product or process change that may impact safety or efficacy of a drug product or drug substance. Areas to consider may include:

  • Changes to the manufacturing process 

  • Changes to the analytical procedure or analytical method

  • Changes in manufacturing equipment

  • Changes in location or manufacturing facilities

  • Changes to container closure systems

  • Changes in materials, concentrations, and/or formulation

  • Changes in process analytical technology (PAT) or process controls

  • Any change that may influence safety or efficacy of the product.

Hypothesis Testing

Generally, a comparability protocol includes an analytical method(s), a study design, a representative data set, and associated acceptance criteria. The defined protocol is used to demonstrate comparability. There are typically two types of data analysis techniques that are used: statistical significance and practical significance or equivalence.  In the case of statistical significance, the differences are always considered to be zero.  In the case of practical significance, they are not considered to be zero; however, they are considered to be so small that they are considered to be practically zero. Often, testing using statistical significance (zero change) may result in the detection of real differences that are not practically meaningful and do not identify practically meaningful differences in the product.


The United States Pharmacopeia (USP) chapter <1033> (2) indicates the preference for equivalence testing over significance testing:

“This is a standard statistical approach used to demonstrate conformance to expectation and is called an equivalence test. It should not be confused with the practice of performing a significance test, such as a t-test, which seeks to establish a difference from some target value (e.g., 0% relative bias). A significance test associated with a P value > 0.05 (equivalent to a confidence interval that includes the target value for the parameter) indicates that there is insufficient evidence to conclude that the parameter is different from the target value. This is not the same as concluding that the parameter conforms to its target value. The study design may have too few replicates, or the validation data may be too variable to discover a meaningful difference from target. Additionally, a significance test may detect a small deviation from target that is practically insignificant.”

Equivalence Testing

Equivalence testing is used when one wants assurance that the means do not differ by too much. In other words, the means are practically equivalent. The analyst sets the threshold difference acceptance criteria for each parameter under test. The means are considered equivalent if the difference in the two groups is significantly lower than the upper practical limit and significantly higher than the lower practical limit. Typically the two one-sided t-test (TOST) is used to demonstrate equivalence once the acceptance criteria have been defined.

Setting acceptance criteria for an equivalence test
There are three different groups of response parameters that are used to set acceptance criteria: two-sided specifications (upper specification limit [USL] and lower specification limit [LSL]), one-sided upper specification limit only or one-sided lower specification limit only, and no specification limits possibly just a target or set point.  Practical differences should be viewed relative to a target, tolerance or as a function of design margin (3). Acceptance criteria should be risk based (4). Higher risks should allow only small practical differences, and conversely, lower risks should allow larger practical differences. Scientific knowledge, product experience, and clinical relevance should be evaluated when justifying the risk. Another consideration is the potential influence on process capability (parts per million [PPM] failure rate) and/or out-of-specification (OOS) rates. If the product shifted by 10%, 15%, or 20%, for example, what will be the likely difference in OOS rates? Z-scores and area under the curve can be used to estimate the impact to PPM rates. A best practice is to always assess the OOS impact of the difference detected. The risk based acceptance criteria in 

Table I are not absolutes; however, they are typical risk-based acceptance criteria.

High riskMedium riskLow risk

USP <1033> agrees with a risk-based approach and the impact to OOS rates. The USP chapter states (2):

“The validation target acceptance criteria should be chosen to minimize the risks inherent in making decisions from bioassay measurements and to be reasonable in terms of the capability of the art. When there is an existing product specification, acceptance criteria can be justified on the basis of the risk that measurements may fall outside of the product specification.”

Conducting an equivalence test
The TOST approach to equivalence (Figure 1) is commonly used to demonstrate comparability.  The two one-sided t-tests are constructed, and if both tests reject the null hypotheses, then there is no practical difference and thus the measured differences are considered comparable for that parameter. The mean is considered to be within the equivalence window where there is no practical difference in performance. In cases where there are only one-sided tests such as impurities and/or purity, the acceptance criteria may not be a uniform distance from zero as the risk is not the same for lower impurities than baseline versus higher impurities than baseline. Equivalence is not just a window test to see if the difference is in the window; it includes key sources of variation such as the analytical and process error to assure it is significantly within the window. The difference must be significantly higher than the lower practical limit and significantly lower than the upper practical limit. Inside the window but not significantly within the window may indicate excessive variation and/or insufficient sample size or power to detect the difference. Confidence intervals are also a best practice (Figure 2) and should be included in any equivalence test report.

Figure 1: Two one-sided t-test.


Figure 2: One-sided confidence intervals.




Application of equivalence tests to study designs

Anytime a statistical test is used, an equivalence alternative may be possible or even preferred. The following are study designs where equivalence testing is an option:

  • Comparison to a reference standard or target

  • Comparison between two groups

  • Comparison between n groups

  • Repeated measures or paired t-tests

  • Multiple factor equivalence testing

  • Comparison of slopes for stability 

  • Comparison of intercepts

  • Comparisons of curve parameters (linear or sigmoidal).

For the purposes of this paper, one example will be presented: comparison to reference or standard. The logic is similar for each study design.

Equivalence testing comparing performance to a standard
The following is the procedure conducting an equivalence test to a standard:

1. Select the standard to be used in the comparison and assure the standard value is known.

2. Determine the upper and lower practical limits where deviations are considered to be practically zero. Make sure to consider risk and the three types of groups when setting practical limits. Risk is medium for pH; therefore, a difference of 15% of tolerance was selected. Upper specification limit is a pH of 8, and lower specification limit is 7 so the lower practical limit (LPL) = -0.15 and the upper practical limit (UPL) is 0.15.

3. Determine the power and sample size needed for the study design. A sample size calculator for a single mean (difference from standard) will make sure you have sufficient sample size and power (5). For this example, the minimum sample size is 13, and a sample size of 15 was selected (2 over minimum). Notice Alpha is set to 0.1, 5% for one side and 5% for the other side (6). Formula for sample size is n=(t1−α+t1−β)2(s/δ)2 for one sided tests (Figure 3).


Figure 3: Sample size and power.


4. Subtract the measurements from the standard value. Use the differences in the equivalence test.

5. Perform two one-sided t-tests (Figure 4). The two hypothesized values are the lower practical limits pH -0.15 from standard and the upper practical limit 0.15 from standard.


Figure 4: Two one-sided t-tests for equivalence.


6. p-value is calculated for the UPL and LPL (Figure 5).


Figure 5: p-value equations for upper and lower practical limits.


7. If both p-values are significant (<0.05), the results are considered to be practically significant/equivalent.

8. Draw conclusions of equivalence. Make sure to include the scientific rational for the risk assessment and associated limits are documented.

9. Failure to demonstrate equivalence requires a proper root-cause analysis to determine why the instruments, probes or methods are not measuring correctly the standard value within the risks and practical limits indicated.

10. It is not appropriate to change the acceptance criteria until the protocol passes equivalence and then set the passing limits as the acceptance criteria. This practice is not using a risk-based approach correctly and biases the statistical procedure.


Equivalence testing is a concept every CMC team member needs to be familiar with and there needs to be expertise in the development team to make sure a systematic statistically sound, risk-based approach is followed and integrated into comparability protocols. Statistical software with sample size and equivalence testing features built in make the design and reporting of results much easier and more reproducible. Inclusion of confidence intervals and the evaluation and calculation of PPM failure rates associated with the measured differences completes the study and reporting of results and provides a meaningful and defendable report of comparability and equivalence.


1. FDA, Guidance for Industry, Comparability Protocols-Chemistry, Manufacturing, and Controls Information (CDER, Rockville, MD, 2003), www.fda.gov/downloads/drugs/ncecomplianceregulatoryinformation/guidances/ucm070545.pdf.
2. USP, USP <1033> Biological Assay Validation (2010).
3. ICH, Q6B, Specifications: Test Procedures and Acceptance Criteria for Biotechnological/Biological Products (ICH, March 1999).
4. ICH Q9, Quality Risk Management (ICH, November 2005).
5. T. Little, BioPharm International 27 (11) (November 2014).
6. NIST, Sample sizes required, Engineering Statistics Handbook, www.itl.nist.gov/div898/handbook/prc/section2/prc222.htm

Article Details

BioPharm International
Vol. 28, No. 2
Pages: 45–48


When referring to this article, please cite it as T. Little, “Equivalence Testing for Comparability,” BioPharm International 28 (2) 2015.