Change-point analysis is an effective and powerful statistical tool for determining if and when a change in a data set has
occurred. The tool provides a confidence level that indicates the likelihood of the change. Change-point analysis can be used
in three distinct applications: 1) determining if improvements or process changes may have led to a shift in an output, 2)
problem solving, and 3) trend analysis. This paper describes how the tool can be used in the pharmaceutical industry for the
three applications. Two case studies are presented to show how change-point analysis was used to verify the potential effect
of process changes.
One question that is commonly asked during the analysis of time-ordered data is "Has there been a shift in the mean of the
data?" One technique for assessing if and when a shift has occurred is a cumulative sum chart (CUSUM chart). CUSUM charts
rely on a visual assessment of whether there is a change in the slope of the CUSUM plot. This technique works well with large
changes because they produce an obvious change in slope. Subtle changes in slope are more difficult to detect and may be missed.
Also, it is difficult to determine if a less pronounced change in slope represents a significant change. Potential change
points identified with CUSUM charts can be confirmed by running a t-test of the data before and after the suspected change point. However, a t-test should only be used if the data is normally distributed and the change point is known. Bandurek provides a review of
the use of CUSUM charts.1
Another statistical tool that uses CUSUM charts to look for shifts in data is change-point analysis. This approach goes one
step further by assigning a confidence level for each change detected. The confidence level is determined using a technique
known as bootstrapping, which takes the subjectivity out of a CUSUM analysis.
This paper describes how to conduct a change-point analysis and discusses three applications for the tool in pharmaceutical
process monitoring and control: demonstrating improvements, problem solving, and trend analysis. The use of change-point analysis
to demonstrate an improvement is highlighted in two examples.
HOW TO CONDUCT A CHANGE-POINT ANALYSIS
Change-point analysis uses cumulative sum and bootstrapping techniques to identify changes and assign a confidence level to
the change.2 First, a CUSUM chart is generated, which displays the cumulative sum of the differences between individual data values and
the mean. If there is no shift in the mean of the data, the chart will be relatively flat with no pronounced changes in slope.
Also, the range (the difference between the highest and lowest data points) will be small. A data set with a shift in the
mean will have a slope change at the data point where the change occurred, and the range will be relatively large.
Figure 1. Example of a CUSUM chart. Each point plotted represents the difference between an individual data point and the
mean, which is added to or subtracted from the previous point on the graph (depending on whether the difference between the
individual data point is positive or negative). The data for this CUSUM chart is shown in Figure 2.
Figure 1 shows an example of a CUSUM chart. Data for this chart is listed in the Excel table in Figure 2. Column B contains
the raw data and column C contains the cumulative sum. The cumulative sum at each data point is calculated by adding the difference
between the current value and mean to the previous sum [i.e., Si = Si–1 + (Xi – Xbar) for i = 1 to n, where S is the cumulative sum, Xi is the current value, and Xbar is the mean]. A CUSUM chart starting at zero will always end with Sn = 0. If a CUSUM chart slopes down, it doesn't necessarily mean that the data are trending down. Rather, it indicates a period
in time when most of the data are below the mean. A sudden change in direction of a CUSUM indicates a shift in the average.
From the CUSUM chart in Figure 1, it appears that a change may have occurred at data point 20. At this point, the slope changes
direction and increases, indicating that most of the data points are now greater than the average. The point at which the
CUSUM chart is furthest from the baseline of zero is the estimated point of change. Interpreting the CUSUM chart would lead
one to the conclusion that a shift in the mean occurred at data point 20. This interpretation relies on a subjective assessment
as to whether there is a change in the slope.
Figure 2. Excel table with data used to generate the CUSUM chart in Figure 1
Change-point analysis builds on a CUSUM chart by determining a confidence level for each change. Confidence levels are calculated
using a technique known as bootstrapping, whereby many random iterations of the data set are generated. For each randomized
data set, the corresponding cumulative sums are determined, along with the ranges for the sums. The percent of times that
the cumulative sum range for the original data exceeds the cumulative sum range for the randomized bootstrap data is the confidence
level. The idea behind this algorithm is that if a pronounced change has occurred, the range on the CUSUM chart for the original
data will be large, and randomizing the data will not lead to data sets with larger ranges, or will do so only infrequently.
Figure 3. Bootstraps of original data from Figure 2. The average of each bootstrap data set is shown at the bottom. Bootstraps
are data sets in which the original data has been randomly reordered. For this example, 100 bootstraps were conducted, the
first 15 of which are shown here.
In the sample data set from Figure 2, 100 bootstraps were produced using the Excel formula =INDEX($B$3:$B$32,1+30*RAND(),1).
The first 15 bootstraps are shown in Figure 3. For each bootstrap data set, the corresponding CUSUM data are generated along
with the range (difference between the highest and lowest CUSUM values). The bootstrap CUSUM values are shown in Figure 4.
All the formulas used for the analysis are in Table 1.
Table 1. Microsoft Excel formulas for conducting a change-point analysis
The final step in determining the confidence level is to calculate the percent of times that the range for the original CUSUM
data exceeds the range for the bootstrap CUSUM data. The CUSUM data range for the bootstraps is in row 115 of Figure 4, whereas
the CUSUM range for the original data is in cell F3 of Figure 2. In this example, the confidence level was 99%. It is appropriate
to set a predetermined threshold confidence level beyond which a change is considered significant. Typically, 90% or 95% is
Figure 4. CUSUM values for the bootstrap data. The range for each data set is shown at the bottom in row 115.
The change at data point 20 that was indicated on the original CUSUM chart has been shown to have a 99% confidence level based
on 100 bootstraps. A histogram of the CUSUM ranges for the 100 bootstraps is shown in Figure 5. As shown in the histogram,
only one of the bootstrap ranges was greater than 9.8, which was the CUSUM range of the original data. Thus, the confidence
level was 99%.
Figure 5. Histogram of bootstrap CUSUM ranges. The CUSUM range for the original data is indicated by the red line at 9.8.
The original CUSUM chart hints at other, more subtle changes. These potential changes can now be assigned a confidence level
by dividing the data into two subsets: data points 1 to 19 and data points 20 to 30. Each data subset can then subjected to
its own change-point analysis to see if the threshold confidence level is exceeded.
Potential changes in variation also can be assessed using the change-point analysis technique. Because biologics manufacturers
produce a relatively small number of batches each week, it is not always practical to perform a change-point analysis on standard
deviation. As an alternative, a change-point analysis is conducted on the difference between consecutive data points. For
the sample data from Figure 2, let X1, X2, ..., X30 represent the 30 data points. From this, 15 consecutive differences, D1, D2, ..., D15 are calculated as follows: Di = |X2i – X2i–1| for i = 1 to 15. The change-point analysis is then performed on D1 through D15.
Microsoft Excel was used to perform the analysis described above. A commercially available software package known as Change-Point
Analyzer (Taylor Enterprises, Inc.) greatly simplifies the analysis. The remaining examples in this paper used Change-Point
Analyzer, version 2.2.