October 1, 2003

Maman A. Djauhari**BioPharm International**, BioPharm International-10-01-2003, Volume 16, Issue 10

*When analyzing data, you will sometimes find one value that is far from the others. Such a value is called an outlier. When you encounter an outlier, you may be tempted to delete it from the analysis. Assuming that the data point is not attributable to obvious experimental mistakes, do you keep it or delete it?*

When analyzing data, you will sometimes find one value that is far from the others. Such a value is called an outlier. When you encounter an outlier, you may be tempted to delete it from the analysis. Assuming that the data point is not attributable to obvious experimental mistakes, do you keep it or delete it?

Here is the analyst's choice. Either the outlier is a random value from the underlying population, in which case you keep it. Or, the outlier is an error that was not detected but is invalid anyway. It is best to delete it.

The problem is choosing the most likely possibility. Statistical calculations can answer this question: If the values were all sampled from a Gaussian ("normal") distribution, what is the chance that one value will be far away from the rest?

A well-known method is the extreme studentized deviation (ESD), also called the Grubbs' Test. After you set the probability limits, a calculation will yield a number in the range of one to four, which is the critical point. "Bad" data lie outside the critical point. We put quotes around "bad" because it is a judgment call to discard a point.

Figure 1. The autocorrelation of Wis is smaller -- and therefore better (more independent) -- than that of Uis.

The critical point depends upon the sample size, all the data points, and the probability values assigned by the analyst. Older desktop computer programs could not handle the large equations and properly converge; numerical methods were used instead. Now, Microsoft Excel 2000 has the capability to handle the beta inverse function. With this capability, the critical point can be exactly determined to five places.

The ESD has many advantages in testing whether an outlier is good data or bad data. It is simple and easy to compute. Tietjen and Moore state that for testing a single outlier, the power of ESD is optimal, because the test deals with the largest value of a statistical set after all data have been sorted by size (1). When computing data with random variables, ESD approaches statistical independence rapidly. But the exact determination of critical point (C) is still challenging. Authors Rosner, Iglewitz, and Hoaglin used a simulation approach; but that gives unsatisfactory results (2, 3). In a recent article, we showed that ESD has an asymptotic distribution that permitted us to derive its critical point analytically (4). The sample size, of course, must be sufficiently large.

Tutorial for Determining BIF

We also introduced an iterative Monte Carlo integration (IMCI) method to determine the value of C. Although the numerical solution using IMCI differs significantly from using Rosner's simulation method, we found that IMCI needs to be improved to calculate a more accurate critical point. Another drawback of IMCI is that its results seem unstable for three different populations. We used N = 10^{4}, 10^{5}, and 10^{6} random numbers in the program.

Those drawbacks are the reason we present a direct computational method based on the beta inverse function (BIF), which is available in Excel 2000. This is an analytical method and not a simulation approach. We consider it to be more effective than simulation methods. It has the advantage of being independent from the iterative process of generating random numbers, making it more efficient than IMCI. We found Excel 2000 useful in our previous article as well (4). The reasons that were valid then are still useful - namely

- It allows us to customize the critical point of ESD for all desired sample sizes and level of significance

- It performs as well as MATLAB (5.3.1) in calculating critical point, and it is more attractive than MINITAB 11. (5 ,6)

McCullough and Wilson said that the 1997 version of Excel is not adequate for calculating critical point (7). This new version is different and improved. It works well.

We review briefly the math behind the critical point of ESD. After that, we show a direct computational method using BIF. The results of the IMCI method (4) and Rosner's simulation method (2) are presented for comparison.

Let X_{1}, X_{2}, ..., X_{m} be a random sample of size *m* from a normal distribution with mean µ- and variance σ^{2}, and *s*^{2} represent sample mean and sample variance. We can transform the data to a statistic that focuses on deviation. Equation 1 defines *U _{i}* for all

It has long been known that *U _{i}* is distributed as in Equation 2.

The generic Beta function is

*U _{i}s* tend toward statistical independence as

Define

for all *i* = 1, 2, ..., *m*. The largest *W _{i}* is singled out to be

Furthermore, the distribution function G(w), defined in Equation 6, is the distribution of *W _{i}s*. Independence improves with increasing

So, if *m* is sufficiently large, we can analytically compute the critical point of *ESD*. An extreme observation will be positively declared an outlier if the value of the exceeds the critical point at a desired level of significance *α*.

How large does the value of m need to be for

*W*

_{i}

*s*

to be considered independent? An exact answer cannot be given for this question because we have no way of knowing the joint distribution of

*U*

_{i}

*s*

. We can use the autocorrelation of W

_{i}

*s*

as an indication, because it detects nonrandomness. Zero is the ideal.

Table 1. Critical point C of the extreme studentized deviation (ESD) for Î±=51%.

In Figure 1, for *m* = 5 to 100, in steps of 5, we present two lines. The red one represents the autocorrelation of the *W _{i}s* obtained from a simulation with

The figure shows that the autocorrelation of the *W _{i}s* is smaller (in absolute value) than the

In the previous article we developed an integral equation that could be solved by numerical methods (4). The critical point

*C*

is the upper limit of Equation 6. All you need to do is to specify α, and use the data to derive equations for

*G(w)*

and

*g(w)*

. This is not easy because it is based on Equation 6. It is more convenient to find C based on Equation 2.

Nomenclature

In our previous article, we proposed an IMCI method to determine the numerical value of C (4). The results of that method need to be improved to find a more accurate critical point. The beta inverse function (BIF) using Excel offers that accuracy. This is more effective than the Rosner method and more efficient than the IMCI method.

Let *F* represent the distribution function of a specific beta distribution:

Based on Equation 2, it can be shown that the value of *C* in Equation 6 is equal to

where

Fortuitously, K is the value of BIF at

We now have an easy way to customize the *ESD* critical point for any desired sample size *m* and level of significance α. Excel 2000 can handle this method of numerical calculation.

We will use data which represents cholesterol values for a group of 15 healthy and normal persons to show the benefit of the BIF method (8). For further information about the use of

*ESD*

in these data, see our previous article (4). In that article, we computed that ESD = 2.63662 and it corresponds to "subject 15." In Table 2, at level of significance α = 5% and a sample size of

*m*

= 15, we see that the critical point given by Rosner, IMCI (for N = 10

^{6}

), and BIF (using Excel) are respectively C

_{1}

= 2.65, C

_{2}

= 2.54589, and C

_{3}

= 2.54384. Both BIF and IMCI methods lead to the same conclusion, that the largest extreme value is an outlier because

*ESD*

> C

_{3}

> C

_{2}

. Although the difference between C

_{2}

and C

_{3}

is only 0.00205, that difference might be important in certain circumstances. Rosner's critical point, however, arrives at a different conclusion: "Subject 15" is not an outlier. Modern computation helps us delete poor data.

Table 2. Critical point C of the extreme studentized deviation (ESD) for Î±=55%.

I would like to thank Dr. Steven Walfish at Human Genome Sciences for fruitful e-communications on the subject. I would also like to thank the anonymous reviewers for their constructive comments and suggestions.

(1) Tietjen, G.L. and Moore, R.H., "Some Grubbs-Type Statistics for the Detection of Several Outliers," *Technometrics* 14(3), 583-598 (1972).

(2) Rosner, B. "On the Detection of Many Outliers," *Technometrics* 17(2), 221-227 (1975).

(3) Iglewitz, B. and Hoaglin, D.C., *How to Detect and Handle Outliers, Basic References in Control*, Vol. 16 (American Society for Quality, Milwaukee, WI, 1993).

(4) Djauhari, M.A., "Improving ESD Procedure for Outlier Testing," *BioPharm* 14(3), 42-46 (March 2001).

(5) The Mathworks, MATLAB 5.3.1, Natick, MA

(6) Minitab, Inc., MINITAB 11, State College, PA.

(7) McCullough, B.D. and Wilson, B., "On the Accuracy of Statistical Procedures in Microsoft Excel 97," *Comput. Stat. Data Anal.*, 31(1), 27-37 (1999).

(8) Bolton, S., *Pharmaceutical Statistics: Practical and Clinical Applications*, 2nd ed., Drugs and the Pharmaceutical Sciences series, Vol. 44 (Marcel Dekker, New York, NY, 1990), p. 356. **BPI**