This article is the third in a four-part series on essential statistical techniques for any scientist or engineer working in the biotechnology field. This installment deals with statistical methods for calculating confidence intervals, tolerance intervals, and capability analysis. The difference between the confidence interval and tolerance interval is explained.
In the last installment, the concept of hypothesis testing was presented. In hypothesis testing, we must state the assumed value of the population parameter. In constructing intervals, on the other hand, we determine the range with a certain confidence for the true population parameter. This confidence level is usually set at 95%, though you might also see intervals that are 99% or 90%. A confidence interval is used to estimate the population mean or standard deviation. A tolerance interval is used to estimate the distribution of the individual values in the population. This distinction will become more apparent in the discussion of capability analysis.
The amount of standard errors(s/√n) the observed mean is from the hypothesized population mean can be determined by hypothesis testing. The conclusion is either the mean was statistically different or there was a lack of evidence to reject the null hypothesis. A similar analysis can be performed using confidence intervals. If the hypothesized mean fits within the confidence interval, it is an indication that the sample mean is not statistically different from the hypothesized mean. We will concentrate on the confidence interval when the population variance is unknown. When the population variance is unknown, the t-distribution, which takes into account the uncertainty in estimating the sample variance, is used. The t-distribution is tabled by confidence level and degrees of freedom. The degrees of freedom are the number of observations used to estimate the sample standard deviation minus one. The formula for the confidence interval is as follows:
in which X mean is the sample mean, t1-α/2;n-1 is the t-value from the t-table with a confidence level of 1-α and n-1 degrees of freedom, s is the sample standard deviation used to calculate the sample mean, and n is the sample size used to estimate the mean and standard deviation. Using the protein concentration data from Part 2 of this article series (BioPharm International, June 2008), a confidence interval can be calculated. Table 1 shows the data and calculations for the confidence interval. If the theoretical concentration was thought to be 30, the 95% confidence interval shows that 30 is contained in the interval, therefore the mean of 31.70 is not different from the value of 30.
Another way to look at the relationship between a confidence interval and hypothesis testing is to look at the hypothesis test to determine if the observed mean of 31.70 is statistically different from 28.47 (the lower confidence interval). The formula for the t-test would be:
which gives the same value as the t0.95;5 in Table 1.
The confidence interval for the two-sample condition is to develop an interval that determines if the difference between the means contains zero. The approach is similar to the one used in the above formula. Some textbooks use a pooled standard deviation (sp) when the population standard deviations are unknown, but assumed to be equal; and the samples sizes are small (under 30):
Table 1. Data and calculations for the confidence interval