# Statistical Essentials—Part 4: Regression and Design of Experiments

Published on:
BioPharm International, BioPharm International-11-01-2008, Volume 21, Issue 11
Pages: 40–45

Well-designed experiments can reduce the risk of coming to an incorrect conclusion during a process characterization, assay validation, or process validation study.

This article is the fourth in the four-part series on essential statistical techniques for any scientist or engineer working in the biotechnology field. This installment deals with statistical ideas related to regression analysis and statistically designed experiments for the understanding of variability.

Steven Walfish

### SIMPLE LINEAR REGRESSION

Simple linear regression analysis is a statistical method to determine if a relationship exists between two variables. It is sometimes referred to as correlation analysis, which is the statistical tool used to describe the degree to which one variable is linearly related to another. The correlation coefficient (ρ or r) assesses the strength of the relationship between x and y. The correlation coefficient ranges from –1 to 1, with 1 being a strong positive correlation and –1 being a strong negative correlation. Unfortunately, it is hard to interpret the meaning of the correlation coefficient. Depending on the nature of the analysis, a correlation of 0.7 might be considered sufficient, but in other applications, you might require a correlation of 0.95 for it to be a useful finding.

In a statistically designed experiment, the experimental conditions are controlled by the experimenter. Observational data, on the other hand, usually do not have a controlled set of experiments, but rather are a collection of observations from various sources. For example, if I wanted to see the correlation between height and weight in elementary school children, I could select an even number of boys and girls in the sample; this would be an example of a designed experiment. If I just collect data from all students without regard to gender, I might be subject to a bias in the results. The main drawback of observational data is the potential for having a disproportionate number of observations from one level of a factor or having two or more factors changing simultaneously without controlling for the change.

Regression and correlation analysis can be applied to either observational data or a statistically designed experiment. The main differences are the conclusions that can be drawn and the knowledge that bias could be present in observational data.

### REGRESSION

The simple linear model has a single regressor, x, which has a straight-line relationship with respect to y. Simple linear models have two regression coefficients; slope and intercept. The intercept (β0 ) is the average y-value when x is equal to zero. The slope ( β1) is the change in y for a unit change in x. Additionally, all models have some random error component (εi ).

The simple linear model is:

y ≠ β01 + β + εi

Typically, we test if the slope is significantly different from 0. A slope of 0 would indicate that there is no relationship between x and y. Typically, we also test if the intercept is significantly different from 0. An intercept of 0 would indicate that when x = 0, then y = 0.

An intercept of 0 is equivalent to the saying "the regression line runs through the origin." Using regression requires adherence to some basic assumptions. We assume that the error (εi) is independent and normally distributed with mean equal to zero and variance σ2 . The second assumption is that the x values are known without error. Unfortunately, this assumption gets violated often, especially when performing linearity testing during method validation. If the x values are sample preparations with values that are assigned, the assignment of these values has some error. Usually the violation of this assumption has a minimal impact on the analysis. The third assumption is that ei are constant over the range of x (homoscedasticity). A remediation for violating this assumption is usually a transformation of the data (i.e., taking the log of the x-value).

Some hazards of regression analysis include extrapolating beyond the range of the x values; influential observations or outliers giving misleading models; and the regression of y = x is not the same as the regression of x = y.

A very popular statistic used to assess "goodness" of the model is R 2 . R 2 can be defined as the percent of the variation explained by the model. Unfortunately, R 2 is very sensitive to data distribution and spread. A high R 2 does not necessarily mean a good fit.

Anscombe created a data set that highlighted the importance of graphing data first, before applying any statistical test to the data set.1 In Figure 1, you can see how four different data sets, each with the same mean and standard deviation, can have very different data distributions, giving misleading regression coefficients. The graph of x1 versus y1 gives the appropriate regression line. The other graphs have a curvilinear relationship (x2 versus y2), a single influential point (x3 versus y3), or an outlier (x4 versus y4), leading to inappropriate conclusions.

We always consider the validity of the assumptions to be doubtful and conduct an analysis to examine the adequacy of the model. We cannot detect violations of the assumptions by examining summary statistics. A residual is defined as the difference between the observed value and the predicted value from the model. The standardized residuals have zero mean and unit variance (like a standard normal variable). A standardized residual outside ±3 can be considered an outlier.

Figure 1

Figures 2, 3, and 4 show an example of how to apply regression analysis for assay data. A potency assay is to be validated for the range 50–150% potency. The protocol tested five different theoretical potency samples, three times on each of three days (nine total observations per potency level, for a total of 45 data points). We have done the analysis three different ways—three data points per day per potency level (Figure 2); one data point per day per potency level (Figure 3); and one data point per potency level (Figure 4)—to compare the results.

Figure 2

The same data summarized by days or concentration can give different R 2 values, while having no impact on the parameter estimates. Always use the raw data over summary data, unless the precision of the data (i.e., poor assay repeatability) warrants the use of the means.

Figure 3

### DESIGN OF EXPERIMENTS

Well designed experiments can reduce the risk of coming to an incorrect conclusion during a process characterization, assay validation, or process validation study. The primary goal is usually to extract the maximum amount of information regarding the factors from as few observations as possible. Typically, design of experiments can be categorized into two classes: screening designs and optimization designs. Screening designs are smaller sets of experiments that are intended to identify the critical few factors from the many potential trivial factors. A screening design assumes a linear effect, usually at two different levels or settings of the factor. Typical screening designs are called fractional factorial or Plackett-Burman designs. Optimization designs, sometimes called response surface designs, are larger experiments that investigate interactions of terms and nonlinear responses, and are conducted at more than two levels for each factor. Typical optimization designs are called central composite or Box-Behnken.

Figure 4

The data from a well designed experiment can be used to model the response as a function of the different factors. Regression methods as discussed previously can be applied to these data. Analysis of variance (ANOVA) is the statistical test used to assess differences in factor levels. The basis for ANOVA is the variability between factor levels compared with the average variability within a factor level.

### SUMMARY

A well designed experiment makes it easier to understand different sources of variation. Analysis techniques such as regression and ANOVA help to partition the variation for predicting the response or determining if the differences seen between factor levels are more than expected when compared to the variability seen within a factor level. This paper just scratches the surface of the various statistical techniques available to the researcher or scientist. A more comprehensive course in statistics will help clarify the differences in the methods.

### REFERENCES

1. Anscombe FJ. Graphs in Statistical Analysis. American Statistician, 27, 17–21.

Steven Walfish is the president of Statistical Outsourcing Services, Olney, MD, 301.325.3129, steven@statisticaloutsourcingservices.com