Statistical Approaches for Determining Comparability of Biosimilars

Published on: 
BioPharm International, BioPharm International, August 2021 Issue, Volume 34, Issue 8
Pages: 42–45

Correct organization and appropriate methods for demonstrating biosimilar comparability are important for supporting regulatory filings.

Biosimilars go through the same rigorous evaluation of safety and efficacy as any other biological drug product or drug substance. Because there is prior knowledge and a grounded scientific understanding of the reference molecule, drug demonstration of comparability of the biosimilar molecule to a reference molecule may speed up and simplify the regulatory requirements for the biosimilar. Data requirements are determined on a case-by-case basis working with FDA and/or other health authorities (1).

Comparability is multi-faceted and typically includes clinical trials (pharmacokinetic [PK] profiles), animal studies, analytical comparability (functional and or structural) of the molecule, analytical method comparability where possible, process comparability (lot-to-lot), and in-process controls/monitors. Comparability in all cases is with respect to the reference drug/molecule compared to the biosimilar under evaluation.

Comparability for biosimilars is risk-based (2), and the level and complexity of the comparison is commensurate with the risk to safety and efficacy. Also, using a risk-based approach, the statistical method for demonstration of comparability can be likewise grouped into a three-tiered approach for comparison.

Three-tiered approach to comparability

It is useful to group comparability methods into three distinct categories. Tier one is the most rigorous basis of comparison and is typically used for critical quality attributes (CQAs), clinical performance, and structural and functional analytical comparability. Tier one also includes process comparability for CQAs as well as analytical method comparability for CQAs. Tier two is for in-process controls or for other lower priority quality attributes that may be included in the characterization of a molecule. Tier three is for in-process monitors or other visual comparisons when a quantitative assessment is not practical or possible (3).

Tier one

There are two techniques that may be used for determining comparability: 1) equivalence test and 2) K sigma comparison. Minimum sample sizes are three of more lots of the reference molecule and three or more of the comparison product lots. Measuring each reference and comparison lot three to six times will help to improve the understanding of the analytical error of the method. All analytical methods should have been qualified/validated prior to conducting the comparability studies. Equal number of lots in the comparison is recommended but not required. A sample size and power analysis of the comparison study design is generally required to demonstrate that the study design is adequately powered to reliably detect the mean differences used in the comparison.Evaluation of sample uniformity is also desirable but not required.

Tier one: comparability using equivalence test

The word biosimilar indicates that the protein/molecule in the comparability study is not bioidentical. Equivalence testing is used when one wants assurance that the means do not differ by too much. In other words, the means are practically equivalent. A threshold difference acceptance criteria is set for each parameter under test. The means are considered equivalent if the difference in the two groups is significantly lower than the upper practical limit and significantly higher than the lower practical limit. Typically a two one-sided t-test (TOST) is used to demonstrate equivalence once the acceptance criteria has been defined (4). Statistical software such as SAS/JMP should be used in the equivalence testing.

Setting comparability acceptance criteria

There are three different groups of response parameters that are used in an equivalence test: 1) two-sided specifications (upper specification limit [USL] and lower specification limit [LSL]), 2) one-sided USL only or one-sided LSL only, and 3) no specification limits. Practical differences should be viewed relative to a target, tolerance, or as a function of design margin (5). Acceptance criteria should be risk based (2); higher risks should allow only small practical differences and, conversely, lower risks should allow for marginally larger practical differences. Scientific knowledge, product experience, and clinical relevance should be evaluated when justifying the risk. The following risk-based acceptance criteria shown in Table I are not absolutes; however, they are typical risk-based acceptance criteria.

The example shown in Figure 1 has a practical limit of four between the means of the reference and comparison CQA. The confidence interval must be fully inside the limits to be considered comparable. Sample size and power for every equivalence test must be determined and reported in the comparability report. The example in Figure 1 indicates that, for this CQA, they are comparable. Equivalence tests control for the means difference and for the variation in the samples, the sample size, and a defined risk factor.

Tier one: K sigma means testing

Another acceptable method for showing comparability is based on a K sigma comparison. K sigma means testing is a bit simpler to perform and not as statistically rigorous. It takes the mean difference between the test article minus the reference and divided by the reference standard deviation for multiple lots and measurements per lot. This calculated value is called a z-score, and the absolute value of the z-score is reported as K sigma. Acceptance criteria are normally set at less than or equal to the 1.5 K sigma of the reference to demonstrate comparability. Table II shows the same data used in the equivalence test reported in K sigma.The advantage of this approach is there are no defined specification limits and it uses the sample means and variation to determine comparability.


Tier two range test

Tier two range tests are applied to in-process controls and those less critical quality attributes for which we still want to demonstrate comparability, but they do not have the same rigor as is used in the tier one equivalence or K sigma means tests. A formal risk assessment is used to determine a tier one, tier two, or tier three approach.

The following is the procedure for setting up and checking a range test:

  • Using the reference lots only fit for an appropriate distribution (e.g., normal, gamma, Weibull etc.)
  • Set limits for the range at either 99% (2.576 K sigma) or 99.73% (3 K sigma) (Figure 2)
  • Using the reference limits demonstrates the percent of the comparison measurements that are within the reference limits (Figure 3)
  • Acceptance criteria may be set at greater than or equal to 85%, 90%, or 95% based on risk.

Limits set from the reference materials using 3 K sigma are 57.4 and 97.0 and are then applied to the biosimilar to show comparability. Percent actual (Figures 2 and 3) equals the percent of measurements of the biosimilar that are within the range of the reference samples.

Tier three

Tier three is used for attributes that are simply monitored during the production process or for quality attributes where quantitative analysis is either not possible or not desirable. Tier three simply uses graphical comparison or pictures of the molecular structure, the growth curve, sensor profile, power in a mixer, etc. In most cases a side-by-side graphical comparison or an overlay of the graphical analysis may be used in the demonstration of comparability. For tier three, no acceptance criteria are used; however, it is important to point out visual comparability where areas are similar and/or differences have been visually detected in the comparison. Figure 4 shows a visual comparison of growth curves by location and change in scale.

Putting it all together

It is highly recommended to use a risk-based approach when performing biosimilar studies. Tables III and IV show examples of how that may be performed.

Selection of the responses, placing them in the appropriate tier evaluation based on their merit as a CQA of their influence and uncertainty (risk) on safety and efficacy, stating the method of comparison, defining acceptance criteria, and coming to a conclusion make organizing and reporting comparability clear and concise. A tier three summary table is also recommended to summarize the attributes evaluated in any conclusions.


Demonstration of comparability for biosimilars is a critical element for many drug companies. FDA’s involvement in the mapping out of data and study requirements for a specific drug is a critical step in demonstration of comparability. Once study areas have been defined and a clear road map of the attributes, the sampling plans, sample sizes, and methods of comparison have been established, the productivity of the development team in designing and executing protocols for comparability will be greatly improved. Correct organization and appropriate methods for demonstration of comparability aid both internal and regulatory health authority review of reported values in filings and submissions.


1. FDA, “Biosimilars,”, accessed July 19, 2021.
2. ICH, Q9 Quality Risk Management, Step 4 version (2005).
3. S. Chow, F. Song, and H. Bai, AAPS J 18, 670–677 (2016).
4. T. Little, BioPharm International 28 (2) 45–48 (2015).
5. ICH, Q6B Specifications: Test Procedures and Acceptance Criteria for Biotechnological/Biological Products, Step 4 version (1999).

About the author

Thomas A. Little, PhD,, is president of Bioassay Sciences.

Article Details

BioPharm International
Vol. 34, No. 8
August 2021
Pages: 42–45


When referring to this article, please cite it as T.A. Little, “Statistical Approaches for Determining Comparability of Biosimilars,” BioPharm International 34 (8) 42–45 (2021).