ABSTRACT
Changepoint analysis is an effective and powerful statistical tool for determining if and when a change in a data set has occurred. The tool provides a confidence level that indicates the likelihood of the change. Changepoint analysis can be used in three distinct applications: 1) determining if improvements or process changes may have led to a shift in an output, 2) problem solving, and 3) trend analysis. This paper describes how the tool can be used in the pharmaceutical industry for the three applications. Two case studies are presented to show how changepoint analysis was used to verify the potential effect of process changes.
Baxter Healthcare

One question that is commonly asked during the analysis of timeordered data is "Has there been a shift in the mean of the data?" One technique for assessing if and when a shift has occurred is a cumulative sum chart (CUSUM chart). CUSUM charts rely on a visual assessment of whether there is a change in the slope of the CUSUM plot. This technique works well with large changes because they produce an obvious change in slope. Subtle changes in slope are more difficult to detect and may be missed. Also, it is difficult to determine if a less pronounced change in slope represents a significant change. Potential change points identified with CUSUM charts can be confirmed by running a ttest of the data before and after the suspected change point. However, a ttest should only be used if the data is normally distributed and the change point is known. Bandurek provides a review of the use of CUSUM charts.^{1 }
Another statistical tool that uses CUSUM charts to look for shifts in data is changepoint analysis. This approach goes one step further by assigning a confidence level for each change detected. The confidence level is determined using a technique known as bootstrapping, which takes the subjectivity out of a CUSUM analysis.
This paper describes how to conduct a changepoint analysis and discusses three applications for the tool in pharmaceutical process monitoring and control: demonstrating improvements, problem solving, and trend analysis. The use of changepoint analysis to demonstrate an improvement is highlighted in two examples.
HOW TO CONDUCT A CHANGEPOINT ANALYSIS
Figure 1. Example of a CUSUM chart. Each point plotted represents the difference between an individual data point and the mean, which is added to or subtracted from the previous point on the graph (depending on whether the difference between the individual data point is positive or negative). The data for this CUSUM chart is shown in Figure 2.

Changepoint analysis uses cumulative sum and bootstrapping techniques to identify changes and assign a confidence level to the change.^{2} First, a CUSUM chart is generated, which displays the cumulative sum of the differences between individual data values and the mean. If there is no shift in the mean of the data, the chart will be relatively flat with no pronounced changes in slope. Also, the range (the difference between the highest and lowest data points) will be small. A data set with a shift in the mean will have a slope change at the data point where the change occurred, and the range will be relatively large.
Figure 2. Excel table with data used to generate the CUSUM chart in Figure 1

Figure 1 shows an example of a CUSUM chart. Data for this chart is listed in the Excel table in Figure 2. Column B contains the raw data and column C contains the cumulative sum. The cumulative sum at each data point is calculated by adding the difference between the current value and mean to the previous sum [i.e., S_{i} = S_{iā1} + (X_{i} ā Xbar) for i = 1 to n, where S is the cumulative sum, X_{i} is the current value, and Xbar is the mean]. A CUSUM chart starting at zero will always end with Sn = 0. If a CUSUM chart slopes down, it doesn't necessarily mean that the data are trending down. Rather, it indicates a period in time when most of the data are below the mean. A sudden change in direction of a CUSUM indicates a shift in the average. From the CUSUM chart in Figure 1, it appears that a change may have occurred at data point 20. At this point, the slope changes direction and increases, indicating that most of the data points are now greater than the average. The point at which the CUSUM chart is furthest from the baseline of zero is the estimated point of change. Interpreting the CUSUM chart would lead one to the conclusion that a shift in the mean occurred at data point 20. This interpretation relies on a subjective assessment as to whether there is a change in the slope.
Figure 3. Bootstraps of original data from Figure 2. The average of each bootstrap data set is shown at the bottom. Bootstraps are data sets in which the original data has been randomly reordered. For this example, 100 bootstraps were conducted, the first 15 of which are shown here.

Changepoint analysis builds on a CUSUM chart by determining a confidence level for each change. Confidence levels are calculated using a technique known as bootstrapping, whereby many random iterations of the data set are generated. For each randomized data set, the corresponding cumulative sums are determined, along with the ranges for the sums. The percent of times that the cumulative sum range for the original data exceeds the cumulative sum range for the randomized bootstrap data is the confidence level. The idea behind this algorithm is that if a pronounced change has occurred, the range on the CUSUM chart for the original data will be large, and randomizing the data will not lead to data sets with larger ranges, or will do so only infrequently.
Table 1. Microsoft Excel formulas for conducting a changepoint analysis

In the sample data set from Figure 2, 100 bootstraps were produced using the Excel formula =INDEX($B$3:$B$32,1+30*RAND(),1). The first 15 bootstraps are shown in Figure 3. For each bootstrap data set, the corresponding CUSUM data are generated along with the range (difference between the highest and lowest CUSUM values). The bootstrap CUSUM values are shown in Figure 4. All the formulas used for the analysis are in Table 1.
Figure 4. CUSUM values for the bootstrap data. The range for each data set is shown at the bottom in row 115.

The final step in determining the confidence level is to calculate the percent of times that the range for the original CUSUM data exceeds the range for the bootstrap CUSUM data. The CUSUM data range for the bootstraps is in row 115 of Figure 4, whereas the CUSUM range for the original data is in cell F3 of Figure 2. In this example, the confidence level was 99%. It is appropriate to set a predetermined threshold confidence level beyond which a change is considered significant. Typically, 90% or 95% is selected.
Figure 5. Histogram of bootstrap CUSUM ranges. The CUSUM range for the original data is indicated by the red line at 9.8.

The change at data point 20 that was indicated on the original CUSUM chart has been shown to have a 99% confidence level based on 100 bootstraps. A histogram of the CUSUM ranges for the 100 bootstraps is shown in Figure 5. As shown in the histogram, only one of the bootstrap ranges was greater than 9.8, which was the CUSUM range of the original data. Thus, the confidence level was 99%.
The original CUSUM chart hints at other, more subtle changes. These potential changes can now be assigned a confidence level by dividing the data into two subsets: data points 1 to 19 and data points 20 to 30. Each data subset can then subjected to its own changepoint analysis to see if the threshold confidence level is exceeded.
Potential changes in variation also can be assessed using the changepoint analysis technique. Because biologics manufacturers produce a relatively small number of batches each week, it is not always practical to perform a changepoint analysis on standard deviation. As an alternative, a changepoint analysis is conducted on the difference between consecutive data points. For the sample data from Figure 2, let X_{1}, X_{2}, ..., X_{30} represent the 30 data points. From this, 15 consecutive differences, D_{1}, D_{2}, ..., D_{15} are calculated as follows: D_{i} = X_{2i} ā X_{2iā1} for i = 1 to 15. The changepoint analysis is then performed on D_{1} through D_{15}.
Microsoft Excel was used to perform the analysis described above. A commercially available software package known as ChangePoint Analyzer (Taylor Enterprises, Inc.) greatly simplifies the analysis. The remaining examples in this paper used ChangePoint Analyzer, version 2.2.