Using Design of Experiments in Validation

Published on: 
BioPharm International, BioPharm International-05-01-2005, Volume 18, Issue 5
Pages: 40–45

Saturated fractional factorial plans minimize the number of trials by one-half or better, which saves time and money.

The statistics-based method called Design of Experiments (DoE) has a long history when applied to optimising a product or a process. This technique is well suited for running robustness trials as part of validation. A shift in emphasis from discovery to verification is all it takes. DoE techniques should improve the efficiency and the effectiveness of trials.

George R. Bandurek, Ph.D.

A thorough validation is essential if a company wants to introduce a new product or process and not spend a year in troubleshooting. DoE is a powerful tool that gives early warning of potential problems. If a process can pass a validation carried out with DoE, then the number of problems in volume manufacture will be reduced.

DoE techniques ordinarily are used to optimize a product or process, where they are an efficient way of identifying factors that improve performance or save money. The methods and statistical tools are described in many books on the topic. Their application to validation is less common but examples in electronics and mechanical engineering have been published.1,2 The pharmaceutical industry probably avoids trying DoE in validation, for two reasons:

  • Regulatory requirements have encouraged people to use tried and tested methods.

  • Some companies prefer to follow the minimum route of Installation Qualification (IQ), Operational Qualification (OQ), and Process Qualification (PQ), which will satisfy an auditor although it is not a reliable indicator of robustness in service.

FDA has proposed a new approach to validation that may further encourage the use of designed experiments.3 The advantages are simply better results for less work:

  • When compared to traditional one-at-a-time experiments, which force factors to the extremes they are likely to encounter, the number of trials is typically halved and potentially reduced to one tenth.

  • The DoE approach will identify the presence of unwelcome interactions between any two factors, something that one-at-a-time methods will always miss.


A validation test demonstrates that a product or process is fit for its intended purpose. A thorough test is a necessary stage-gate between development and production and it may be called validation, verification, qualification, acceptance, release or some other name, which is the common industry jargon. A receiving company, department, or individual desires a validation that is thorough and complete because this reduces the risk that the product or process is in some way deficient. A failed validation is grounds for rejection and rightly puts the burden of fixing the problem on the supplier.

If validation is incomplete and the product passes despite its hidden faults, then the vendor may have made a sale, but the purchaser will ultimately be dissatisfied and unlikely to buy again. This even applies within a company where a development department creates a process that works in a laboratory but does not provide the same yields when it is scaled up.

The first experimental stage in validation is to show that a process (or product) works as intended when the factors that can affect performance are set to nominal or target levels. Typical factors are running speed, dimensions of a component, strength of a solution, and room temperature. This is the basis of an OQ and is often used to release a design from development to production. The risk is that variation in some of these factors will have a significant effect on performance. A good validation will run all the factors that could affect performance through their ranges of expected values.

A simplistic OQ + PQ approach relies mostly on random, natural variation to allow the factors to travel to their limits. The minimal OQ test is one run at nominal conditions although sometimes a "worst case" combination is also required.4 The PQ phase is to make a minimum of three batches, with the underlying assumption that any natural variation which is ever likely to occur will reveal itself. This is an optimistic view and as an unintended consequence, many products or processes require modifications and adjustments long after they enter service.

A robustness trial as referred to by FDA uses experiment rather than passive observation to see the effects of factors that could affect a product or process.5,6 If significant factors such as temperature, flowrate, or mix-ratio are deliberately forced to their extreme values, then the natural variation, which takes place in a year, can be simulated in a sequence of designed trials. This article focuses on the skills of identifying factors and running the lowest number of trials required to exercise them.


A secondary purpose of validation experiments is to identify the causes of poor performance. This is a symptom of unclear phases in a project. A development phase is the time to identify factors and improve performance. A validation gate at the end of development is the time to demonstrate that the performance meets specification. The expectation in a validation is that the product or process will pass a severe test, and not serve as another development experiment. If this line of thought is taken to its logical conclusion, a validation test becomes a "black box" activity; there is no need to know how a product or process creates an output from its inputs.

Development work is "white box," where it is essential to know how a process works so that it can be optimized. The compromise for validation is often "grey box" experiments, where some internal parameters are measured (in addition to the final output) or the effects of selected factors need to be studied in more detail. When this distinction between development and validation is fully understood, it is possible to make major gains in the efficiency of testing, reducing the number of trials, and simplifying the analysis. If a validation test fails it means that the development work was incomplete. If the results from the validation tests give an indication of the cause, this is a benefit beyond the primary purpose.


Some programs for robustness tests are easy to record in the plan even if they are inefficient. FDA allows the performance of a test at a "worst case" combination.


This may be realistic in some applications such as plastic injection molding, where the worst case result might be one dimension which is over size. The process conditions that make it large are generally well understood and it is quite simple to run the process at the one extreme combination of settings to demonstrate adequate performance. However, even with injection moulding the best engineers can make a mistake and miss a significant problem. With many other products or processes it is difficult to predict which combinations will be "worst case" for various elements of a specification.

When there are many factors that could affect performance and the "worst case" combination is unknown, a typical approach is to choose each factor in turn and take it to its expected limits. While one factor is being tested the others are kept at their nominal or standard values. The number of trials is twice the number of factors (assuming they are quantitative variables with a high and a low extreme) plus one trial with everything set to normal. One advantage of this approach is that the effect of each factor can be clearly identified, although this is not the primary purpose of validation. This approach has disadvantages: the number of trials can be high (twenty or thirty is not uncommon), and the effect of each factor is measured in isolation, so interactions can never be found.

Interactions occur when the effect of one factor depends on another. For example, a variation in humidity within allowed limits may have a small effect on the amount of virus grown in an egg, as might a variation in the incubation time. The higher humidity might increase the amount of virus and the longer time might increase the amount of virus, but the combination of the two may promote the growth of enough mold to prevent a useful harvest from being collected. Validation experiments that only vary one factor at a time will miss this interaction, and it is prohibitively difficult to plan for all possible combinations of all the factors.



When the number of significant factors is small (up to five), it is realistic to use traditional designed experiments such as fractional factorials to investigate the combinations. The number of trials will be between eight and sixteen. The design of the experiments and the associated statistical analysis allows the effects of the individual factors and the interactions between them to be measured. If the validation master plan (VMP) requires this information, then this becomes the natural route to follow. However, if the number of possible significant factors is more than five, then it becomes expensive and time consuming to perform fractional factorial experiments. In this situation some "black box" compromises need to be invoked so a more efficient design of experiments can be selected.

The essential requirements for DoE for validation are:

  • Quantitative factors such as temperature or flow rate are tested at both extremes of their specifications.

  • Qualitative factors such as reagent or supplier are tested at all available options; it's typically two, three or four items

  • For any pair of factors, all possible combinations are tested so unwelcome interactions can be found.

  • The analysis reveals if the results are out of specification at any time.

  • The analysis predicts process capability from a measured average and spread.

  • Trials are kept to a minimum.

The secondary, negotiable requirements for DoE for validation are:

  • Interactions between any three factors are tested. Three-way interactions are much rarer than two-factor interactions and additional tests to check for them give diminishing returns.

  • The analysis reveals the magnitude of the effect of each factor; it is only needed for improvement work if validation fails.

  • The analysis reveals which factors have a large interaction; it is only needed if validation fails.


There is an efficient solution to the requirements, going back to 1897 when Jacques Hadamard published a set of mathematical matrices. The classical DoE name for the equivalent forms is saturated fractional factorials, but we also find these called Placket-Burman designs in honor of two British statisticians who utilized them in the middle of the last century. In the 1980s they were renamed as Taguchi arrays when the eponymous Japanese engineer developed his approaches for quality improvement. The Taguchi L12 array (Table 1) is one example, which in our experience is an appropriate design in many different industries.

Table 1. Taguchi L12 Array

The columns in the array represent factors that are controlled for the trials and the digits 1 and 2 are the high and low settings (levels) for the factors. For example, Column 1 may be temperature, with Level 1 as 35ºC and Level 2 as 30ºC. Column 2 may be the supplier of a reagent with Level 1 as a European manufacturer and Level 2 as a US manufacturer. The array rows give the combinations that must be tested.

The pattern of ones and twos is balanced so that in each column there are six ones and six twos. Where there is a one in a column, for each of the other columns there are three ones and three twos in the corresponding rows. Similarly, where there is a two in one column, there are three ones and three twos in the corresponding rows for each of the other columns. This can be seen most easily if Column 1 is taken as the starting point; column 1 has Level 1 in the first six rows. Each of the other columns has three ones and three twos in these rows. The same is true if any other column is made the starting point. The consequences of this are:

  • The effect of any one factor can be measured, though it may be distorted or confounded by interactions between other factors

  • For any pair of factors, all of the possible combinations (1-1, 1-2, 2-1, 2-2) are tested so that an interaction has the opportunity to reveal itself

  • For any three factors, all the combinations (1-1-1, 1-1-2, 1-2-1, etc.) are tested so that an interaction which depends on all three can be found.

The L12 array uses 12 trials to obtain information about each of 11 factors, and it checks for interactions between pairs of factors and for any three factors. A one-at-a-time approach needs 22 trials (high and low for each factor) and gives no information about interactions. The gains in efficiency (number of trials) and effectiveness (information obtained) from this designed experiment are unambiguous. Other arrays and rules for modification allow similar trials to be designed for different numbers of factors and for factors with more than two levels each. We do not have space in this article to cover all possibilities.

Multiple columns of results are created when the twelve combinations are run. For example: yield, an impurity level, and surface defects. If all the results are in specification, then the validation has passed the first level of analysis. The average and standard deviation of the results can be used to predict process capability or give a coefficient of variation. They also can set preliminary limits for statistical process control before there is a history of data from production.

If any of the results are out of specification or the predicted process capability is not acceptable, then the validation test has found a failure. Analysis and interpretation can point to the likely cause:

  • If one half of the results are out of specification, it is likely that one factor, the one where the pattern of ones and twos matches the good and bad results, is the cause of the problem.

  • If three of the results are out of specification, then it may be an interaction between two factors (Is there a 1-1, 1-2, or other combination for two factors that matches?) or it may be the build up of individual effects. Judgement, further statistical analysis, and probably a few more experiments are needed in this situation.

  • If one or two of the results are out of specification, then there may be an interaction between any three of the factors, or it may be that the individual effects of the factors build up to give an extreme result. Again, judgement, further experiments, and analysis are needed.

If the validation fails, then further development activity is needed to find the major sources of variation and control them. The results from validation trials are only an indication of the likely problem. To make the results more conclusive would require many more trials to be performed for the validation. This is a poor use of resources because it assumes validation will be a failure rather than a plan for success.


Supersaturated arrays are suitable for validation and exhibit even greater efficiency.


An array with only eight trials can be used to investigate 35 factors when a Taguchi approach needs 36 trials and a one-at-a-time method needs 70. The supersaturated array checks for the effect of high and low settings for each factor and for all interactions between pairs of factors. A "pass" from an experiment is still powerful because the effectiveness of the tests is maintained.

Poor information about the effects of individual factors is the compromise that pays for efficiency. It is not possible to state definitively which factors or interactions are the cause of a failure if one is found. Supersaturated means the number of degrees of freedom in the experiment is less than is needed to measure the main effects of the factors. Conventional statistical methods for analysis such as Analysis of Variance (ANOVA) and level averages can give misleading results, and so judgement must be used instead of a rigorous mathematical treatment. This is a serious disadvantage if these designs are used for development tests but is an acceptable compromise for validation. The requirement is to show that performance remains in specification while input factors vary within their allowed ranges.

One previously unpublished mildly supersaturated array is based on the L12 array. It has nine columns (factors) with two levels each, and one column that accommodates a factor with four levels (Table 2). In strict mathematical terms, there is one too many degrees-of-freedom to allow the effects of the factors to be separated.

Table 2. L12 Supersaturated Array with 4-level Factor


The company wishes to validate an assay, based on a European Pharmacopoeia methodology, for the quantification of Polysorbate 80 in a tuberculosis (TB) control solution.


The aim of the protocol is to verify the method meets the relevant criteria for limit of detection, linearity, precision, and other properties.

Polysorbate 80 is a mixture of partial esters of various fatty acids, mainly oleic acid, and sorbitol and its anhydrides copoly-merized with approximately 20 moles of ethylene oxide for each mole of sorbitol and sorbitol anhydrides. The fatty-free fraction can be of vegetable or animal origin.

After extraction, the TB samples will be analyzed by gas chromatography using palmitic acid as an internal standard. The concentration of Polysorbate 80 in both sample and standard can be defined from the ratio of the internal standard. The control solution matrix samples will be spiked by a 10,000-ppm Polysorbate 80 stock solution to give a polysorbate concentration of 50 ppm. This sample will then be extracted, dried, and analyzed in triplicate on days 1, 2, 5, or 12 of a twelve-day period with the deliberate variations described in Table 3. The levels for the factors have been chosen to simulate the worst-case variation expected for each quantitative factor, or qualitative differences (operator and matrix).

Table 3. Factors and Levels for Polysorbate 80 Trials

The L12 supersaturated array for 10 factors allows us to challenge all potentially critical steps within the assay in twelve trials. This design will be used to prove the robustness of the assay (Table 4). In the array all the factors are set to Level 1 for Trial 1: the 1,000 ppm standard is TB control matrix, the time to heat is 25 min, the delay before adding BF3-methanol is zero, and so on. The settings take their given levels for the other trials, so for Trial 12 the standard is water, the time to heat is 35 min, and (Factor 10, Level 3) the delay before making the measurement is five days. The acceptance criterion for this part of the study is that the coefficient of variation (CV) must be 10%.

Table 4. L12 Supersaturated Array and Results for Polysorbate 80 Trials

The overall average and standard deviation of the 36 measured values indicates acceptance. The standard deviation seen in the trials is likely to be a pessimistic overestimate for production. This is because the limits for the factors are the extremes, and the results from the trials are all worst-case combinations. However, a validation is intended to show that worst-case does not cause problems, so this is the appropriate route to take. (In tolerance-design or value-engineering, the limits for the factors are set at less extreme levels since the intention is to save money, and a decent estimate of variation is preferred. Further discussion is outside the scope of this paper.)

Further examination of the results shows that the variation within any triplicate group is much smaller than the variation between different combinations of the array. This indicates negligible experimental error, and that the controlled factors account for most of the observed variation. An ANOVA calculation could be used here but it would be an unnecessary complication.

The level averages are computed at each factor as the average results of the six rows with ones and the six rows with twos. For example with Factor 1, the average recovery for the first six rows is 49.0 and for the last six is 51.9. Changing the matrix from TB control to water increases the recovery from 49.0 to 51.9.

An analysis of all the level averages shows which factors have the largest effects (Figure 1). Factor 5, the second time to heat, is the most significant, followed by the matrix, quantity of BF3-methanol, and the delay before measuring which all have roughly the same size of effect. The interpretation must be treated with caution because the numerical analysis cannot be rigorous. The effect of the four-level factor, delay time, confounds with each of the other factors, and any interaction between two factors confounds with the other main effects.

Figure 1. Level Averages from Polysorbate 80 Trials

To improve the coefficient of variation for recovery, return to the lab and run more development trials. This validation analysis gives a strong indication of which factors to investigate first, but it does not output a definitive statement as to which ones are significant and how much more tightly they need to be controlled.

A later review of the results concluded that the level settings were in many cases too wide apart. For example, volumes of 1.5 and 2.4 mL were chosen for the extreme settings of three constituents but the errors when using a pipette are much smaller than this. If more realistic levels had been used in the trials, then simulation shows that the coefficient of variation would have fallen below 5%.


This assay determines aluminum hydroxide in a sample by first dissolving it in acid to obtain free aluminum ions. These are then complexed with EDTA to take them out of solution by chelation. The excess EDTA is back-titrated with copper sulfate utilizing a Radiometer 865 Autotitrator. Samples (water and saline solution) were spiked with 1.0 mg/mL for the trials. The L12 supersaturated array is again a convenient match for the factors and levels as shown in Table 5. The acceptance criteria is CV <5%.

Table 5. Factors and Levels for Aluminium Trials

The results in Table 6 reveal the variation within each triplicate set and between groups. The average and standard deviation of the 36 results gives a coefficient of variation of 1.8%, which is well within the limits.

Table 6. L12 Supersaturated Array and Results for Aluminium Trials

If the validation had been a failure, then it would have been worthwhile to analyze the results in greater depth to identify the major sources of variation. The graph of level averages (Figure 2) shows that Factor 7 (quantity of EDTA solution), Factor 8 (time to heat at 100ºC), and Factor 10 (days delay) have the largest effects. The first two agree with theoretical understanding of the reactions and the assay method. Adding different amounts of EDTA will give a corresponding bias to the results, and shorter heating times will prevent the reaction from full completion. The third of these, delay, suggests that either the solutions change with time or there is another source of variation.

Figure 2. Level Averages from Aluminium Trials

An ANOVA for these trials "steals" one degree of freedom from the error term to perform the calculations (Table 7). It also uses a pooling strategy, which combines the effects of the insignificant factors with the error term. The factors that were pooled into the error all had risk levels greater than 25 percent, which is a reasonable limit. The highest risk level for the remaining factors, after pooling is 5.4% for "quantity sulfuric acid," which indicates that there is 94.6% confidence that the effect of this factor is real. The confidence levels for the other remaining factors are even higher.

Table 7. ANOVA for Aluminum Trials

The ANOVA confirms the interpretation from the level averages. The contribution column shows that the quantity of EDTA solution and the time to heat at 100ºC between them account for 56% of the total variation observed in the whole experiment. The quantity of sulfuric acid and the choice of 1,000 ppm standard are both statistically significant, but they are of minor practical importance because their contributions to the variation are only 2% and 6%.

Factors that have a small effect give scope for further savings. Look at Factor 5, the pH after adding NaOH. The original procedure called for NaOH to be added in small amounts until a pH of 4 was obtained. The trials showed that pH values of 3.9 and 4.1 had no discernible effect on the assay. The procedure has now been simplified and a single, fixed quantity of NaOH is added in one shot. This reliably gives a pH value between 3.9 and 4.1. The time saved from this and other changes reduced the duration of the assay from 150 to 60 min. We gained a capacity increase for the laboratory and relieved the technicians from a tedious sequence of actions.

We want to reemphasize this — a rigorous mathematical analysis is not possible because there are not enough degrees of freedom to separate the main effects of the factors or the interactions between them. If it wasn't for repeat measurements, the ANOVA would have been severely strained because the total degrees of freedom for the ten factors are more than the experiment can support. However, the interpretation presented agrees with theoretical and engineering understanding, is highly likely to be right, and serves as a guide for further development experiments if they are needed.


These methods are now routine at a leading UK biopharmaceutical company. The L12 supersaturated array with a four-level factor provides a suitable, standard design for qualification of a wide variety of assays. The qualification trials can be completed in four days, which includes the array and other tests. A selection of applications shown in Table 8 demonstrates both the range of assays and the savings in time.

Table 8. Efficiency Improvements for Different Assays

A further bonus occurred with some assays when the trials revealed that a wide range for some factors had only a very minor effect, even though the factors are tightly controlled in normal use. Control of these factors was relaxed to give a further saving.

Validation tests that employ DoE are much more efficient than simple, one-at-a-time approaches and yet they are also effective at identifying problems in a product or process. Techniques such as supersaturated arrays are particularly appropriate since they balance between reducing the number of trials and the need to test for interactions. They also quite often provide useful diagnostic information.

The methods are compatible with the demands of FDA and tie in closely with definitions of robustness trials. While they go beyond the current minimum requirements of a simplistic IQ + OQ +PQ, the advantages are clear when the requirement is to demonstrate that a product or process meets its specification and will continue to work reliably.


1. Bandurek GR, Hughes HL and Crouch D. The use of Taguchi methods in performance demonstrations.

Quality and Reliability Engineering International

1990; 6:121-131.

2. Bandurek GR. Practical application of supersaturated arrays. Quality and Reliability Engineering International 1999:15:123-133.

3. FDA. FDA proposal Sec.490.100 "Process Validation Requirements for Drug Products", CPG 7132c.08. Bethesda MD. 2004 March 12. Available at:

4. FDA CDER "Guideline On General Principles Of Process Validation", Bethesda MD. 1987 May. Available at:

5. FDA. Guidance to Industry, "Q2B Validation of Analytical Procedures: Methodology", Bethesda MD. 1996 November. Available at:

6. Pifat, D. CBER. "Assay Validation" Bethesda MD. Available at:

7. European Pharmacopoeia. Monograph: Polysorbate 80 Assay by Gas Chromatograph. Strasbourg France. PH Eur 2000, 3rd edition.

George R. Bandurek, Ph.D., is principal of GRB Solutions Ltd, 9 Cissbury Road, Worthing, West Sussex BN14 9LD, England, phone and fax 44.1903.215175,