Recent Trends in Data Analytics for Upstream Process Workflows

, , , ,
BioPharm International, BioPharm International, January 2022, Volume 35, Issue 1
Pages: 20–25

Modeling techniques can improve process control and monitoring in biopharmaceutical production.

Upstream process development in biologics has seen several improvements in robustness, productivity, and stability. The wealth of bioprocess data generated can be potentially utilized by modeling and data analytics tools, such as mechanistic modeling, machine learning (ML), and artificial intelligence (AI), to gain process knowledge and perform predictions. However, the industrial applications of these tools still leave a lot to be desired. Though many applications of such techniques are found in industry, their widespread application is still not prevalent. This review revisits the data analytics modeling methodologies for upstream processes, provides a perspective on their potential applications across the upstream process development to validation workflow in the biopharma industry, and presents a value chain peak to use it toward better process robustness, process control, and process monitoring with quality by design (QbD) and process analytical technology (PAT) applications.

In addition to predictive power and improved process understanding, the application of mathematical modeling based methods in biopharma process development has the benefits of increased visibility and agility, which can potentially improve productivity. The role of mathematical modeling for drug development with respect to QbD has already been exemplified by FDA. Furthermore, there is an increased focus on mathematical modeling approaches due to their robustness, predictive power, and understanding driven by PAT and QbD concepts defined by FDA and the European Medicines Agency.

A typical end-to-end upstream process workflow consists of cell-line development, selection of appropriate clones, process development, scale-up, risk assessment scale down model (SDM) development, process characterization, and technology transfer to manufacturing. This review will focus only on the modeling techniques used across upstream process development for biopharmaceutical therapeutics.

Process development

Upstream process development includes conceptualization of the process train—including media and feed development—and optimization of bioreactor parameters for a successful scale-up. It also identifies and assesses critical process parameters (CPPs) that influence the critical quality attributes (CQAs) of the product through activities such as process characterization studies. Different types of bioprocess controls are available, and model predictive control (MPC) seems promising over others. An overview of applications of statistical, mechanistic, machine-learning, and hybrid models in upstream process development, optimization, and characterization is summarized in Figure 1, while the available commercial technologies for upstream process development is presented in Table I.

Because of the presence of a large number of correlated decision variables and objectives, the statistical techniques are best suited for cell-culture processing and are applied for defining the design space; improving cell growth, titer, and glycosylation; performing root cause analysis; predicting CQAs; studying interactions for scale-up parameters; scaling-up/scaling-down from clone to bench-scale; and controlling process parameters across scales. During process development, scale-down models are important because they can replicate results at larger scales, where it is impractical to perform factorial experiments. Pfizer developed a system wherein they used a total of 180 micro-bioreactors to assess scaled-down parameters compared to conventional 3-L bioreactors (1). State-of-the-art statistical modeling tools like SIMCA and MODDE from Sartorius, JMP from SAS, Unscrambler from CAMO, and DesignExpert from Statease have also contributed to the ubiquity of statistical methods in the biologics industry.

The primary purpose of mechanistic models is to develop a mechanism-based relationship between inputs and outputs, and they can be used as a predictive tool, once validated. Flux balance analysis (FBA) was applied at UCB Pharma by describing the evolution of intracellular fluxes for four industrial cell lines through a curated Chinese hamster ovary (CHO) cell genome scale metabolic model (GSM) (2), which exemplifies the utility of GSMs. Another study at Bristol-Myers Squibb demonstrated media optimization and process understanding through FBA (3).

In most situations, detailing every aspect through microscopic cellular models is analytically and computationally expensive. For such scenarios, macroscopic kinetic models can provide enough information for process optimization and to test hypotheses and make predictions. Even today, the simplest unstructured-unsegregated Monod kinetics-based models are most commonly preferred even for multicomponent CHO growth kinetics. Even with the limitations in parameter estimation methods and kinetic data, the abundance of multi-omics data gives the scope to add regulatory, signaling, and product-related information to eventually form large-scale models. These models can be applied to gain process understanding and for process optimization, process design, and process control. The kinetic model equations can also be combined with complex metabolic pathways to describe the dynamics in cell culture trends and help tackle process challenges. Some of the capabilities demonstrated by the kinetic modeling of CHO cell culture include analysis of metabolic behavior, glycosylation prediction through feed development, incorporating temperature shift changes, and explaining glycolysis through a kinetic-metabolic model. Similar methodologies are also applied to explain the dynamics in suspension culture for the other expression systems like baculovirus and Vero cell expression systems.

Because mechanistic models cannot be implemented for processes that are not yet fully understood and data-driven models suffer from being unreliable beyond their input data-set, hybrid modeling is a promising approach that aims to establish a combination of the mechanistic and data-driven framework to obtain better flexibility and robustness. A few comprehensive reviews discuss the hybrid models as an element of process systems engineering (4). Combining GSM, bioreactor kinetics, and artificial neural network analysis (ANN), Insilico Biotechnology has claimed to have a commercial hybrid modeling workflow, also known as a ‘digital twin’ for biopharmaceutical process development (5) and clone selection (6). Datahow is another such organization that provides stand-alone software as well as consulting services to apply deterministic and deterministic knowledge-based hybrid models for process development and process monitoring. Software solutions for hybrid modeling are also provided by Novasign based on mechanistic and statistical models. Novasign has demonstrated the approach of intensified-DoE on an industrial Escherichia coli production process (7).

To ensure that product heterogeneity of complex and fragile therapeutic proteins remains within predefined specifications, bioprocesses must be monitored and controlled through well-established measurement systems to enable their real-time control. This monitoring is also crucial to ensure that the workflow remains within the recommendations of FDA on PAT (8), which incorporates quality risk management and process understanding. According to the location of the analytical system, the bioreactor monitoring techniques can be classified as off-line, on-line, and at-line. Off-line methods for process monitoring are time-consuming and therefore cannot provide a real-time picture of the ongoing process. Ideally, the sensor should be non-invasive, non-destructive, robust, fast, sensitive, and adaptable to the dynamic conditions of the bioreactor. It should also be able to generate good quality multivariate data for data analysis techniques like chemometrics. Several spectroscopic techniques, such as ultraviolet-visible (UV-vis), near-infrared (NIR), mid-infrared (MIR), dielectric spectroscopy, Raman spectroscopy, and fluorescent spectroscopy have been investigated for their usefulness in bioprocess monitoring. Of these, NIR and Raman spectroscopy are most popular in mammalian cell culture (9,10) and cell therapy (11). A direct comparison between NIR and Raman reveals the superiority of the latter in terms of sensitivity, selectivity, and limit of detection. However, since NIR is more precise but less accurate, it is not affected to a larger extent by minor perturbations in spectra compared to Raman (12). Several monitoring systems, such as ProCellics by Merck KGaA, Raman RXN2 analyzer by Kaiser, Matrix-F FT-NIR by Bruker, MB 3600 analyzer series by ABB, EVO i200 biomass system by Hamilton, InVia confocal Raman microscope by Renishaw, NIRSystems Process Analytics process spectrophotometer by Foss Analytics, and analysis software, such as Bio4C PAT Raman Software by Merck KGaA, iC Raman by Kaiser, and SIMATIC SIPAT by Siemens are available in the market.

Many advances have been made in the real-time monitoring of CPPs, such as glucose, lactate, cell viability, amino acids, post-translational modifications (e.g., N-glycosylation), and even viral titers. The huge spectra dataset produced by such spectroscopic measurements can be used to extract information through the application of multivariate data analysis (MVDA). Typically, predictive regression models are made through partial least squares (PLS) or principal component analysis (PCA), which are readily available in instrument software. Biogen has patented a PLS-based method of monitoring manufacturing-scale bioreactors up to 4000 L on Raman spectra (13), and another publication describes the product quality control using a feedback loop process automation platform for glucose (14). Similar to the PLS model based on Raman, NIR spectroscopy is also used for online glucose monitoring during scaling-up of bioreactors. In addition to statistical tools, mechanistic models and machine learning methods, such as support vector machines and neural networks, have also been implemented in recent years. One study at Janssen reveals the better performance of Cubist over other statistical and machine learning models (15), which shows that the performance of prediction algorithms are strongly dependent on the characteristics of data and the choice should be made by a holistic approach. Amgen has proposed and validated an automated machine learning-based approach for calibration, assessment, and maintenance of Raman models (16). However, one of the most interesting concepts is the application of hybrid models for bioprocess monitoring and control. Their early applications for mammalian cell cultures can be found in the mid-1990s, and were based on ANN, Monod kinetics, and fuzzy logic. Other research groups have used ANN to compensate for prediction errors of first-principle models.

Conclusions and future prospects

Surveys have pointed out the impact of laborious data pre-processing steps in the data analytics chain, which become even more tedious in the case of bioprocessing data, where high levels of heterogeneities are present. To improve productivity over paper processes, electronic laboratory notebooks (ELNs) integrated into laboratory information management systems (LIMS) have now become the industry norm to document experiments, find and reuse information, and promote efficient collaboration. Initiatives like Findable, Accessible, Interoperable, Reusable (FAIR) involving major biopharmaceutical players are already in place (17).

Early stage cell-line development is boosted by the rapid clone screening platforms (e.g., ClonePix, Thermo Fisher) and high-throughput productivity and CQA analysis platforms (e.g., LabChip, Perkin Elmer, and Octet, Sartorius), which make the clone screening process more robust and efficient. The high dimensional data generated from such instruments can be used for multivariate/statistical modeling, and, additionally, deep learning models can be used to select high-performance clones from data generated by instruments. Various bottom-up mechanistic approaches like constraint-based modeling (CBM) and omics-based technologies have been proposed. However, obtaining industry-relevant output through modeling techniques is still in its infancy. In this regard, cybernetic modeling approaches (18), which have already shown success for microbial systems, can also be implemented for mammalian systems. More efforts are expected in robust multi-objective modeling and predictive computational frameworks for glycosylation optimization, which is a critical component in biosimilar upstream development.

The data-driven modeling methodologies described in this article for optimization, monitoring, and control that attempt to model the system are manual and require much human intervention. Moreover, high variability persists in the objectives and data generated in every step of the drug development process, which poses a challenge as the default hyperparameters (e.g, number of hidden layers and nodes) of machine learning or hybrid models are often suboptimal for a given problem. The conventional procedure of manually fitting different hyperparameter combinations could be ineffective for complex problems. A modeling platform incorporating data preprocessing, feature extraction, model selection, and hyperparameter optimization can be implemented through automated machine learning in R&D as well as biomanufacturing.

For process monitoring, process development, and process control, some recent unconventional process control strategies, such as a novel glucose control strategy through oxygen transfer rate, a gassing-based pH control strategy, and a lactate-based feeding strategy, could be explored even further. Up-and-coming monitoring technologies like dielectric spectroscopy have shown promise for biomanufacturing. More complex deep learning algorithms can be used for soft sensor modeling to account for perturbations in monitoring bioprocesses. Novel strategies based on hybrid modeling and oxygen transfer flux can be applied alongside standard practices for scaling-up. The most appropriate combination of monitoring technique, instrument sensitivity, and modeling algorithm should be selected for best results. This selection is more crucial for continuous manufacturing, for which measurement, monitoring, and control tools must be highly robust and accurate.

The application of computational fluid dynamics (CFD) and compartment modeling would benefit the organizations during scaling down, scaling up, and technology transfer, which usually gets hampered due to poorly understood fluid dynamics.

Though other industries have readily adapted digital twins, their dearth still persists in the biopharma industry. The application of digital technologies will reduce the capital expenditure of drug development and manufacturing by reducing experimentation and timelines, improving control and knowledge, and overcoming regulatory bottlenecks. The advances in allied fields such as software and monitoring instrumentation will directly impact their successful implementation. As highlighted in Figure 2, the traditional and innovative technologies are categorized based on their present state of applicability. Industry needs to develop synergy between bioprocess subject matter experts, automation engineers, and data scientists for the smooth implementation of these technologies.


1. R. Legmann et al., Biotechnol. Bioeng. 104 (6), 1107–20 (2009).
2. C. Calmels, et al., Metab. Eng. Commun. 9 (March) (2019).
3. Z. Huang et al., Biochem. Eng. J. 160 (January) 107638 (2020).
4. S. Badr and H. Sugiyama, Curr. Opin. Chem. Eng. 27, 121–128 (2020).
5. S. Nargund and K. Mauch, “Hybrid Models for Biopharmaceutical Process Development—Making the Best of Mechanistic Know-How and Data-Driven Insights,”, Jan. 7, 2020.
6. O. Popp et al., Biotechnol. Bioeng. 113 (9) 2005–2019 (2016).
7. B. Bayer, G. Striedner, and M. Duerkop, Biotechnol. J. 15 (9) 1–10 (2020).
8. FDA, Guidance for Industry, PAT—A Framework for Innovative Pharmaceutical Development, Manufacturing, and Quality Assurance (CDER, October 2004).
9. M.R. Riley, C.D. Okeson, and B.L. Frazier, Biotechnol. Prog. 15 (6) 1133–1141 (1999).
10. R.C. Rowland-Jones, et al., Biotechnol. Prog. 33 (2) 337–346 (2017).
11. M.O. Baradez, et al., Front. Med. 5 (MAR) (2018).
12. M. Li, et al., Biochem. Eng. J. 137, 205–213 (2018).
13. B. Berry and J. Moretto, “Cross-Scale Modeling of Bioreactor Cultures Using Raman Spectroscopy,” US patent 10,563,163, February 2020.
14. B.N. Berry et al., Biotechnol. Prog. 32 (1), 224–234 (2016).
15. C. Rafferty et al., Biotechnol. Prog. 36 (4) (2020).
16. A. Tulsyan et al., Biotechnol. Bioeng. 117 (2) 406–416 (2020).
17. J. Wise et al., Drug Discov. Today 24 (4) 933–938 (2019).
18. L. Aboulmouna et al., Curr. Opin. Chem. Eng. 30 (December) 120–127 (2020).

About the authors

Prashant Pokhriyal is associate scientist; Prateek Gupta is senior general manager; Sridevi Khambhampaty is vice-president; Rajesh Ullanat is executive vice-president; and Mili Pathak*,, is principal scientist, Process Science, R&D; all at Intas Pharmaceuticals Ltd. (Biopharma Division), Ahmedabad, Gujarat, India.

*To whom all correspondence should be addressed.

Article Details

BioPharm International
Vol. 35, No. 1
January 2022
Pages: 20–25


When referring to this article, please cite it as P. Pokhriyal, et al., “Recent Trends in Data Analytics for Upstream Process Workflows,” BioPharm International 35 (1) 20–25 (2022).