Statistical Essentials—Part 4: Regression and Design of Experiments

A well-designed experiment can make it easier to understand the sources of variation.
Nov 01, 2008
Volume 21, Issue 11

Steven Walfish
This article is the fourth in the four-part series on essential statistical techniques for any scientist or engineer working in the biotechnology field. This installment deals with statistical ideas related to regression analysis and statistically designed experiments for the understanding of variability.


Simple linear regression analysis is a statistical method to determine if a relationship exists between two variables. It is sometimes referred to as correlation analysis, which is the statistical tool used to describe the degree to which one variable is linearly related to another. The correlation coefficient (ρ or r) assesses the strength of the relationship between x and y. The correlation coefficient ranges from –1 to 1, with 1 being a strong positive correlation and –1 being a strong negative correlation. Unfortunately, it is hard to interpret the meaning of the correlation coefficient. Depending on the nature of the analysis, a correlation of 0.7 might be considered sufficient, but in other applications, you might require a correlation of 0.95 for it to be a useful finding.

In a statistically designed experiment, the experimental conditions are controlled by the experimenter. Observational data, on the other hand, usually do not have a controlled set of experiments, but rather are a collection of observations from various sources. For example, if I wanted to see the correlation between height and weight in elementary school children, I could select an even number of boys and girls in the sample; this would be an example of a designed experiment. If I just collect data from all students without regard to gender, I might be subject to a bias in the results. The main drawback of observational data is the potential for having a disproportionate number of observations from one level of a factor or having two or more factors changing simultaneously without controlling for the change.

Regression and correlation analysis can be applied to either observational data or a statistically designed experiment. The main differences are the conclusions that can be drawn and the knowledge that bias could be present in observational data.


The simple linear model has a single regressor, x, which has a straight-line relationship with respect to y. Simple linear models have two regression coefficients; slope and intercept. The intercept (β0 ) is the average y-value when x is equal to zero. The slope ( β1) is the change in y for a unit change in x. Additionally, all models have some random error component (εi ).

The simple linear model is:

y ≠ β01 + β + εi

Typically, we test if the slope is significantly different from 0. A slope of 0 would indicate that there is no relationship between x and y. Typically, we also test if the intercept is significantly different from 0. An intercept of 0 would indicate that when x = 0, then y = 0.

An intercept of 0 is equivalent to the saying "the regression line runs through the origin." Using regression requires adherence to some basic assumptions. We assume that the error (εi) is independent and normally distributed with mean equal to zero and variance σ2 . The second assumption is that the x values are known without error. Unfortunately, this assumption gets violated often, especially when performing linearity testing during method validation. If the x values are sample preparations with values that are assigned, the assignment of these values has some error. Usually the violation of this assumption has a minimal impact on the analysis. The third assumption is that ei are constant over the range of x (homoscedasticity). A remediation for violating this assumption is usually a transformation of the data (i.e., taking the log of the x-value).

lorem ipsum