This article is the fourth in the four-part series on essential statistical techniques for any scientist or engineer working
in the biotechnology field. This installment deals with statistical ideas related to regression analysis and statistically
designed experiments for the understanding of variability.
SIMPLE LINEAR REGRESSION
Simple linear regression analysis is a statistical method to determine if a relationship exists between two variables. It
is sometimes referred to as correlation analysis, which is the statistical tool used to describe the degree to which one variable is linearly related to another. The correlation
coefficient (ρ or r) assesses the strength of the relationship between x and y. The correlation coefficient ranges from –1 to 1, with 1 being a strong positive correlation and –1 being a strong negative
correlation. Unfortunately, it is hard to interpret the meaning of the correlation coefficient. Depending on the nature of
the analysis, a correlation of 0.7 might be considered sufficient, but in other applications, you might require a correlation
of 0.95 for it to be a useful finding.
In a statistically designed experiment, the experimental conditions are controlled by the experimenter. Observational data,
on the other hand, usually do not have a controlled set of experiments, but rather are a collection of observations from various
sources. For example, if I wanted to see the correlation between height and weight in elementary school children, I could
select an even number of boys and girls in the sample; this would be an example of a designed experiment. If I just collect
data from all students without regard to gender, I might be subject to a bias in the results. The main drawback of observational
data is the potential for having a disproportionate number of observations from one level of a factor or having two or more
factors changing simultaneously without controlling for the change.
Regression and correlation analysis can be applied to either observational data or a statistically designed experiment. The
main differences are the conclusions that can be drawn and the knowledge that bias could be present in observational data.
The simple linear model has a single regressor, x, which has a straight-line relationship with respect to y. Simple linear models have two regression coefficients; slope and intercept. The intercept (β0 ) is the average y-value when x is equal to zero. The slope ( β1) is the change in y for a unit change in x. Additionally, all models have some random error component (εi ).
The simple linear model is:
y ≠ β01 + β + εi
Typically, we test if the slope is significantly different from 0. A slope of 0 would indicate that there is no relationship
between x and y. Typically, we also test if the intercept is significantly different from 0. An intercept of 0 would indicate that when x = 0, then y = 0.
An intercept of 0 is equivalent to the saying "the regression line runs through the origin." Using regression requires adherence
to some basic assumptions. We assume that the error (εi) is independent and normally distributed with mean equal to zero and variance σ2 . The second assumption is that the x values are known without error. Unfortunately, this assumption gets violated often, especially when performing linearity
testing during method validation. If the x values are sample preparations with values that are assigned, the assignment of these values has some error. Usually the
violation of this assumption has a minimal impact on the analysis. The third assumption is that ei are constant over the range
of x (homoscedasticity). A remediation for violating this assumption is usually a transformation of the data (i.e., taking the
log of the x-value).