When analyzing data, you will sometimes find one value that is far from the others. Such a value is called an outlier. When you encounter an outlier, you may be tempted to delete it from the analysis. Assuming that the data point is not attributable to obvious experimental mistakes, do you keep it or delete it?
When analyzing data, you will sometimes find one value that is far from the others. Such a value is called an outlier. When you encounter an outlier, you may be tempted to delete it from the analysis. Assuming that the data point is not attributable to obvious experimental mistakes, do you keep it or delete it?
Here is the analyst's choice. Either the outlier is a random value from the underlying population, in which case you keep it. Or, the outlier is an error that was not detected but is invalid anyway. It is best to delete it.
The problem is choosing the most likely possibility. Statistical calculations can answer this question: If the values were all sampled from a Gaussian ("normal") distribution, what is the chance that one value will be far away from the rest?
A well-known method is the extreme studentized deviation (ESD), also called the Grubbs' Test. After you set the probability limits, a calculation will yield a number in the range of one to four, which is the critical point. "Bad" data lie outside the critical point. We put quotes around "bad" because it is a judgment call to discard a point.
Figure 1. The autocorrelation of Wis is smaller -- and therefore better (more independent) -- than that of Uis.
The critical point depends upon the sample size, all the data points, and the probability values assigned by the analyst. Older desktop computer programs could not handle the large equations and properly converge; numerical methods were used instead. Now, Microsoft Excel 2000 has the capability to handle the beta inverse function. With this capability, the critical point can be exactly determined to five places.
The ESD has many advantages in testing whether an outlier is good data or bad data. It is simple and easy to compute. Tietjen and Moore state that for testing a single outlier, the power of ESD is optimal, because the test deals with the largest value of a statistical set after all data have been sorted by size (1). When computing data with random variables, ESD approaches statistical independence rapidly. But the exact determination of critical point (C) is still challenging. Authors Rosner, Iglewitz, and Hoaglin used a simulation approach; but that gives unsatisfactory results (2, 3). In a recent article, we showed that ESD has an asymptotic distribution that permitted us to derive its critical point analytically (4). The sample size, of course, must be sufficiently large.
Tutorial for Determining BIF
We also introduced an iterative Monte Carlo integration (IMCI) method to determine the value of C. Although the numerical solution using IMCI differs significantly from using Rosner's simulation method, we found that IMCI needs to be improved to calculate a more accurate critical point. Another drawback of IMCI is that its results seem unstable for three different populations. We used N = 104, 105, and 106 random numbers in the program.
Those drawbacks are the reason we present a direct computational method based on the beta inverse function (BIF), which is available in Excel 2000. This is an analytical method and not a simulation approach. We consider it to be more effective than simulation methods. It has the advantage of being independent from the iterative process of generating random numbers, making it more efficient than IMCI. We found Excel 2000 useful in our previous article as well (4). The reasons that were valid then are still useful - namely
McCullough and Wilson said that the 1997 version of Excel is not adequate for calculating critical point (7). This new version is different and improved. It works well.
We review briefly the math behind the critical point of ESD. After that, we show a direct computational method using BIF. The results of the IMCI method (4) and Rosner's simulation method (2) are presented for comparison.
Let X1, X2, ..., Xm be a random sample of size m from a normal distribution with mean µ- and variance σ2, and s2 represent sample mean and sample variance. We can transform the data to a statistic that focuses on deviation. Equation 1 defines Ui for all i = 1, 2, ..., m,
It has long been known that Ui is distributed as in Equation 2.
The generic Beta function is
Uis tend toward statistical independence as m gets larger because X and s tend towards µ and σ.
Define
for all i = 1, 2, ..., m. The largest Wi is singled out to be ESD as in Equation 5.
Furthermore, the distribution function G(w), defined in Equation 6, is the distribution of Wis. Independence improves with increasing m.
So, if m is sufficiently large, we can analytically compute the critical point of ESD. An extreme observation will be positively declared an outlier if the value of the exceeds the critical point at a desired level of significance α.
How large does the value of m need to be for
W
i
s
to be considered independent? An exact answer cannot be given for this question because we have no way of knowing the joint distribution of
U
i
s
. We can use the autocorrelation of W
i
s
as an indication, because it detects nonrandomness. Zero is the ideal.
Table 1. Critical point C of the extreme studentized deviation (ESD) for α=51%.
In Figure 1, for m = 5 to 100, in steps of 5, we present two lines. The red one represents the autocorrelation of the Wis obtained from a simulation with N = 104 random numbers generated from the standard normal distribution. The blue line is the exact autocorrelation of the Uis, which is equal to
The figure shows that the autocorrelation of the Wis is smaller (in absolute value) than the Uis, indicating that Wis tend to be more independent than Uis. This is another advantage of the ESD calculation. For m = 10, the autocorrelation of Wis is about -0.1.
In the previous article we developed an integral equation that could be solved by numerical methods (4). The critical point
C
is the upper limit of Equation 6. All you need to do is to specify α, and use the data to derive equations for
G(w)
and
g(w)
. This is not easy because it is based on Equation 6. It is more convenient to find C based on Equation 2.
Nomenclature
In our previous article, we proposed an IMCI method to determine the numerical value of C (4). The results of that method need to be improved to find a more accurate critical point. The beta inverse function (BIF) using Excel offers that accuracy. This is more effective than the Rosner method and more efficient than the IMCI method.
Let F represent the distribution function of a specific beta distribution:
Based on Equation 2, it can be shown that the value of C in Equation 6 is equal to
where
Fortuitously, K is the value of BIF at
We now have an easy way to customize the ESD critical point for any desired sample size m and level of significance α. Excel 2000 can handle this method of numerical calculation.
We will use data which represents cholesterol values for a group of 15 healthy and normal persons to show the benefit of the BIF method (8). For further information about the use of
ESD
in these data, see our previous article (4). In that article, we computed that ESD = 2.63662 and it corresponds to "subject 15." In Table 2, at level of significance α = 5% and a sample size of
m
= 15, we see that the critical point given by Rosner, IMCI (for N = 10
6
), and BIF (using Excel) are respectively C
1
= 2.65, C
2
= 2.54589, and C
3
= 2.54384. Both BIF and IMCI methods lead to the same conclusion, that the largest extreme value is an outlier because
ESD
> C
3
> C
2
. Although the difference between C
2
and C
3
is only 0.00205, that difference might be important in certain circumstances. Rosner's critical point, however, arrives at a different conclusion: "Subject 15" is not an outlier. Modern computation helps us delete poor data.
Table 2. Critical point C of the extreme studentized deviation (ESD) for α=55%.
I would like to thank Dr. Steven Walfish at Human Genome Sciences for fruitful e-communications on the subject. I would also like to thank the anonymous reviewers for their constructive comments and suggestions.
(1) Tietjen, G.L. and Moore, R.H., "Some Grubbs-Type Statistics for the Detection of Several Outliers," Technometrics 14(3), 583-598 (1972).
(2) Rosner, B. "On the Detection of Many Outliers," Technometrics 17(2), 221-227 (1975).
(3) Iglewitz, B. and Hoaglin, D.C., How to Detect and Handle Outliers, Basic References in Control, Vol. 16 (American Society for Quality, Milwaukee, WI, 1993).
(4) Djauhari, M.A., "Improving ESD Procedure for Outlier Testing," BioPharm 14(3), 42-46 (March 2001).
(5) The Mathworks, MATLAB 5.3.1, Natick, MA
(6) Minitab, Inc., MINITAB 11, State College, PA.
(7) McCullough, B.D. and Wilson, B., "On the Accuracy of Statistical Procedures in Microsoft Excel 97," Comput. Stat. Data Anal., 31(1), 27-37 (1999).
(8) Bolton, S., Pharmaceutical Statistics: Practical and Clinical Applications, 2nd ed., Drugs and the Pharmaceutical Sciences series, Vol. 44 (Marcel Dekker, New York, NY, 1990), p. 356. BPI