Characterizing a Bioprocess with Advanced Data Analytics

Modeling at various stages of the data analytics continuum aids scale comparison of a bioreactor.
Mar 01, 2018
Volume 31, Issue 3, pg 18–23

Advanced data analytics tools are used by industry to find the golden nuggets in historical data, to aid in process development, to fine-tune production, and to achieve long-term improvements in product quality and throughput. During recent years, four key stages of data analytics have been defined (1), which can be seen as being part of a data analytical continuum (see Figure 1). Handling those key stages diligently by using advanced data analytics tools is expected to give manufacturers a competitive edge. 

Along the data analytics continuum (described in detail in Figure 1), the most advanced challenge is being able to predict what will happen in the future and, in the event of an undesirable outcome, prescribe certain activities or interventions to prevent it from happening. Although looking ahead into the future is of greatest commercial interest, value is created at every stage of data analytics; it all depends on the specific need and the tools and approach to the analytics process.

Figure 1Figure 1. Illustration of the data analytics continuum. At the base level of descriptive analytics, data aggregation and data mining provide insight into the past and answer: "What has happened?". Diagnostic analytics examines data or content to answer the question: "Why did it happen?", characterized by drill-down, data discovery, mining, and correlations. Predictive analytics uses statistical models and forecast techniques to understand the future and answer: "What could happen?". At the pinnacle of data analytics, prescriptive analytics uses optimization and simulation algorithms to advise on possible outcomes and answer: "What should we do?" or "How can we avoid this happening?". All figures are courtesy of the authors.

In the bioprocessing industry, parts or all stages of the data analytics continuum are applicable. In early-stage development, advanced data analytics is used to compare batches, for example, to figure out how to modify cell expansion steps so that they lead to higher cell densities and product titers. In late-stage development, advanced data analytics is used when scaling-up manufacturing processes to verify comparable performance at different scales. And in full production, advanced data analytics is used for real-time bioprocess monitoring and early fault detection of batches deviating from good, normal operating conditions.

A primary objective in bioprocess development and scale-up is to accomplish a consistent, uniform, and predictable environment across scales. The following case study describes the essentials of a bioprocess characterization and scale comparison project at various stages of the data analytics continuum.

Case study

In total, a development project for a fed-batch cell-culture biological process encompassed 75 batches. Metabolite measurements were taken once per day. The metabolite data included cell performance metrics, selected process conditions, and nutrient concentrations. In addition to the batch evolution trajectory measurements, initial and final batch conditions were registered. Some calculated parameters were added to the trajectory data to enhance information content and assist in interpretation. These data come from three operating scales (micro, micro+, and pilot) including both stable operating conditions and some runs planned using design of experiments (DOE) (2).

Descriptive analytics

There are multiple uses of these types of data. In the descriptive analytics stage, data analytics tools such as partial least squares (PLS) or orthogonal partial least squares (OPLS) can be used for batch evolution modeling to understand what happened within and between batches during the batch lifetime (3). Popular graphic representations involve control charts displaying time evolution trajectories of important process variables or model statistics to diagnose which batches conform to good normal developmental trajectories and which batches do not (see Figure 2).

figure 2Figure 2. Control chart from a data analytics model capturing batch evolution. The horizontal axis shows batch lifetime in days. The vertical axis is the charted statistic (t), which is a parameter of the data analytics model. The red dashed lines represent upper and lower control limits. Each batch is represented by a single line. As long as a batch line stays within the control limits, normal operating conditions are inferred.

Another perspective often used in descriptive analytics is to compare batch-to-batch variations by representing the complete batch trajectory in a single summary. In this comparison, an often-used data analytics tool is principal component analysis (PCA) (3). A typical output from a PCA is shown in Figure 3, in which color coding is used to represent batch scale and marker sizing is used to represent peak viable cell density (VCD). The pilot scale has the most consistent operation with high peak VCD, as shown in Figure 3, where the green markers are all of similar size and operating conditions. The micro scale is most similar to the pilot scale, as indicated by the blue markers in Figure 3. The micro scale has consistent operation, but there is considerable variation in peak VCD, with some very low VCD runs shown by the smaller markers in Figure 3. The micro+ scale (red markers in Figure 3) demonstrates extensive variation in its process performance with large variation in peak VCD.

Figure 3Figure 3. Scatter graph arising from a batch-scale comparison model. Each plotted point represents one completed batch. Points that are close to each other correspond to batches of similar process operating conditions. Points that are far away from each other correspond to batches of very different process operating conditions. Symbol color indicates batch scale: blue is micro, red is micro+, and green is pilot. Symbol size indicates final peak viable cell density (VCD). In particular, there is a conspicuous deviation on the micro+ scale for some batches marked by the frame at the top of the graph.

Diagnostic analytics

Subsequent to descriptive analytics, diagnostic analytics can be performed. In this case study, the batch-scale comparison model is interrogated to uncover why any differences detected (shown in Figure 3) did occur. A data scientist could here be interested in comparing batches between scales. Given the deviating cluster on the micro+ scale, an equally relevant scenario would be to first try to interpret the split of the batches on this scale. Regardless of interpretation objective, similar diagnostics pathways are used.

Figure 4 shows a plot often used in diagnostic analytics denoted “contribution plot”. This plot was created to find sources of variation contributing to the deviating cluster of micro+ batches framed in the top of Figure 3. Figure 4 shows that this cluster of batches started with a higher than average sodium trajectory (brown line) and a lower than average osmolality profile (purple line), for example; other interesting features are also revealed. Conceptually, similar plots can be readily created to interpret scale differences.

Figure 4Figure 4. Contribution plot visualizing process parameter contribution to the discovered deviation for the framed batches shown in Figure 3. Parameter lines far away from zero (both in the positive and negative numerical direction) point to variables that contribute to the deviation. These variables should be considered when taking action to remove batch and or scale differences.

Predictive analytics

The third stage of the data analytics continuum corresponds to predictive analytics. The term predictive analytics may carry different connotations for different people. In one aspect, predictive analytics may refer to the act of, given a fresh set of batch process measurements, “classifying” (i.e., predicting) the currently evolving batch into a process monitoring model to uncover whether this batch conforms to, or deviates from, normal operating conditions. In another aspect, predictive analytics may mean an attempt to explain (i.e., predict) final batch critical quality attributes (CQAs), such as peak VCD or titer, in terms of their correlations to batch critical process parameters (CPPs). Irrespective of the connotation of the term, predictive analytics to find out what will happen is conveniently carried out by using tools like DOE, PCA, PLS, and OPLS (2, 3).

Prescriptive analytics

Prescriptive analytics is the last stage of the data analytics continuum, and, at this stage, focus lies on forecasting the future, either to drive the manufacturing toward an optimal situation or to avoid certain undesirable process hiccups from happening. In general, in real-time applications, the stages of predictive and prescriptive analytics are often intimately linked (4). According to reference 4, “one approach that has become common in biologics manufacturing is the use of multivariate modeling and calibrate models to determine how a process should run. We turn such models into control charts and track a process as it is running.” Beyond this real-time monitoring step, model predictive control tools are described, which are available in terms of a real-time bioprocess forecasting, control, and optimization toolbox (4). With this toolbox, “smart changes” (i.e., “advised future”) to setpoints of bioprocess CPPs resulting in optimized production is given.


The data analytics continuum is a way to concatenate the four major steps of data analytics into an organized and chronological work order. Alongside this continuum, the dimensions of “business value” and “complexity” are often appended to further enrich the understanding of the concept. The continuum is general in the sense that its stages apply to any type of data (e.g., production, finance, medical), and the different stages may be addressed using different types of data analytical tools. For example, as shown in this article, the various stages of bioprocess characterization and monitoring are stratifiable according to the steps of the continuum.

This article illustrates how evaluation of batch type data, including initial and final conditions and batch evolution trajectories, can be accomplished using projection methods such as PCA, PLS, and OPLS. Such methods, preferably prudently combined with DOE, enable powerful scale-up modeling possibilities of bioreactors to support process development and improvement in a predictable, timely, and cost-effective manner. The main conclusion of the use case cited was that micro- and pilot-scale batches performed similarly, albeit with some variability in peak VCD, and that the batches of the micro+ scale deviated substantially and with a strong clustering among themselves. By using ­contribution-based diagnostics, reasons for scale differences and clustering were visualized, interpreted, and mitigated. 


1. Gartner, “2017 Planning Guide for Data and Analytics,” (accessed Jan. 31, 2018).

2. L. Eriksson, et al., Design of Experiment: Principles and Applications, (MKS Umetrics AB, Sweden, 3rd Ed., January 2008).

3. L. Eriksson, et al., Multi- and Megavariate Data Analysis Basic Principles and Applications, (MKS Umetrics AB, Sweden, 3rd Ed., May 2013).

4. C. McCready, Bioprocess International (November 2017) (accessed Jan. 31, 2018). 

Article Details

BioPharm International
Vol. 31, No. 3
Pages: 18–23


When referring to this article, please cite it as L. Eriksson and C. McCready, "Characterizing a Bioprocess with Advanced Data Analytics," BioPharm International 31 (3) 2018.

About the authors

Lennart Eriksson, PhD, is senior lecturer and data scientist, [email protected], and Chris McCready is lead data scientist, both at Sartorius Stedim Data Analytics.

lorem ipsum