Nurturing Knowledge from Disparate Data Streams

February 1, 2019

BioPharm International

Volume 32, Issue 2

Page Number: 10–14

Leveraging vast quantities of analytical data requires digitalization and platform integration.

Analytical instruments are increasing in sensitivity and accuracy. Pharmaceutical methods applied today generate increasing quantities of data. These data are stored and manipulated using a variety of software platforms from different suppliers. Pharmaceutical scientists must be able to integrate and interpret these data in an efficient manner that ensures reliability, security, and regulatory compliance.

Digitalization creates difficulties

Today’s modern laboratory is highly dependent on a range of instrumental technologies, all reliant on data output that has to be interpreted, managed, and maintained in a retrievable, secure, and non-corruptible manner, according to Mark Rogers, global technical director at SGS Life Sciences. The volume and complexity of these data make the move to electronic capture inescapable, he adds.

For many of these techniques, however, the inevitable evolution from paper-based data collection procedures to electronic systems has not been without difficulties. “It may be argued, for example, that the relatively simple process of recovering and reviewing chart recordings from a well-maintained paper archive has advantages over the recovery of data from older software/operating platforms that may have become obsolete, making data difficult to retrieve,” Rogers explains.

Regulatory considerations have also now become intrinsically linked to the traceability, security, and non-corruptibility of scientific data, again resulting in difficulties with older equipment, he notes. “The electronic modernization of older-but well-established and effective-equipment may be one of the most significant and ongoing challenges,” asserts Rogers.

In addition, despite advances in the digitalization of analytical data, significant human intervention and manipulation is still required during any decision-making process based on data acquisition. For lot release, once a specification (identify, purity, physical form, etc.) has been established and test methods validated for a drug substance, there are many human intervention steps involved to get from a request for release to fulfilling that request, explains Andrew Anderson, vice-president of innovation and informatics strategy at ACD/Labs. Spectra must be collected and separation analyses performed. The data obtained must be generated as a table, chart, or other picture that describes the results, which must then be compared to reference standards to determine whether they conform or not.

“Many of these steps cannot be digitized or even easily automated. Bringing data back into the decision-making mechanism is very difficult without specialized informatic tools that handle all different types of data. This challenge is the greatest latent pain for development organizations,” Anderson asserts.

Volume and complexity

There will also always be challenges with getting the right data, according to Anderson. “There are trade-offs between project timelines and the data that can possibly be generated and ultimately handled by existing infrastructure,” he explains.

One example relates to the move to a quality-by-design approach to process development. With this approach comes a geometric expansion in the quantity of data being acquired to characterize a process. Each possible parameter variant requires a data point for the modeling system used to determine the optimum operating conditions.

In addition, the data itself is becoming more complex. A nuclear magnetic resonance (NMR) spectrum provides a significant amount of detailed information about a given molecule and is therefore attractive for small-molecule API proof-of-structure analysis. Interpretation of this type of spectrum, such as comparison with a reference standard to confirm identity, requires an expert, however. “A method for plant operators to use for determining that a lot spectrum conforms to a standard should be rugged and allow for simple acquisition and validation. Ultraviolet (UV) analysis is one example. Such a method leaves some doubt, however; NMR provides more fidelity, so there is a tradeoff between absolute confirmation and practicality,” Anderson observes.

The complexity of analytical experiments will, in fact, always outpace the ability to implement them in quality processes, he adds. “There are often challenges to implementing more complex analytical techniques in a quality paradigm. All of the aspects of quality assurance are increasing in complexity: software systems and how they handle data, quality practices, standard operating procedures, and test methods. Each of these factors must be considered to ensure that drug products introduced to the market are of the highest quality,” notes Anderson.

It is not only the volume and complexity of raw data being generated that is a key issue for Marco Galesio, R&D team leader in the analytical chemistry development group at Hovione, but the lack of specific and user-friendly software to treat and interpret these data. “Most of the time there is a lot of relevant data being generated, but due to lack of suitable software and sufficient time, it is not possible to take the most out of it,” he says.



Bigger challenges for biologics

Biologic drugs pose some additional challenges beyond those associated with chemical APIs. Small molecules have one structure and composition. Because biologic drug substances are produced in cells or microorganisms, their structures and compositions can vary slightly from batch to batch, leading to inherent variability, according to Tiffani A. Manolis, senior director of Agilent Technologies’ global pharma strategic program.

Added to this variability is the complexity of large-molecule structures, which necessitates the use of a greater number of more complex analytical methodologies, according to Íñigo Rodríguez-Mendieta, technical client manager biopharmaceuticals, SGS Life Sciences. In most cases, specific expertise is required to interpret large-molecule data, notes Constança Cacela, associate director of R&D in the analytical chemistry development group at Hovione.

The necessity to fully define the relatively complex and diverse structural features of large molecule entities has led to expansion in the variability and sophistication of the analytical tools in the biologics field, according to Rodríguez-Mendieta. “Many of these techniques have traditionally been employed in the academic sector where data interpretation relies heavily on subjective manual review, and this is often difficult to translate to machine-based understanding,” he says. Approaches to digitalization have included automated reference to library data as in the case of circular dichroism (CD) and/or the application of complex algorithms as in the case of mass spectrometry (MS)-based sequencing. “None of these approaches has yet been perfected, however, particularly for large-molecule applications where structural diversity can prove a significant impediment to accurate interpretation,” adds Rodríguez-Mendieta.

The key challenge lies in the fact that with biologic molecules, identity has various meanings related to structure (primary, secondary, etc.), post-translational modifications (PTMs), etc. “Orthogonal analyses are required to ensure full characterization. Chromatography may be used to determine purity, while other methods must be used to determine the identity (sequence, location of PTMs, etc.). Structural assays are often augmented with behavior immunoassays. All of these data must be ‘assembled’ together to create the complete picture. That can be challenging to do digitally, because orthogonal data are often generated in a variety of vendor formats,” Anderson explains.

The same issues exist for biosimilar comparability studies, Rodríguez-Mendieta notes. “Results have to be considered holistically, which remains a considerable challenge, as the data can be somewhat contradictory and often produced from a wide range of techniques with instrumentation from multiple manufacturers with different data platforms,” he says.

Next-generation drug substances can complicate the situation even further. Anderson points to antibody-drug conjugates (ADCs) as an example. “ADCs comprise an antibody, a small-molecule cytotoxic payload, and a linker of some kind, which can be an oligomer or other small to medium-sized component. The dynamic range of MS, NMR, or other instruments must therefore be sufficiently wide to allow analysis of small and large molecules, and of course also validatable,” he explains.

Difficulties in MS and NMR

Separation techniques such as gas chromatography (GC), high-performance liquid chromatography (HPLC), capillary electrophoresis (CE), and gel electrophoresis when coupled to UV, Raman, infrared, and fluorescence do not provide sufficient data per se, and particularly for large molecules, according to Cacela. “Data generated by these techniques must often be validated by more comprehensive techniques such as mass spectrometry,” she notes.

In general, however, LC–MS methods, especially high-resolution MS (quadrupole time-of-flight [Q-TOF], triple quadrupole, etc.), generate overwhelmingly large data files, according to Manolis. “Data sets are becoming very large and pose challenges to any IT [information technology] infrastructure as they are transmitted across networks for storing and analysis,” she observes.

An additional challenge is related to data interpretation. The acquisition of MS data occurs in much less time than it takes to interpret it, according to Anderson. “With TOF instruments it is possible to gather detailed spectra in just three minutes, so the technique can be useful for screening experiments or monitoring the progress of a synthetic reaction or cell culture. The data files that result can be a gigabyte in size, which is challenging not only from a hard disc perspective, but for interpretation, which can take much longer,” he comments. Furthermore, deep expertise is required to accurately evaluate and filter the meaningful data, Cacela adds.

Similarly, the volume of data generated in NMR and surface plasmon resonance analyses can be challenging to manage.

Issues with data abstraction

Another challenge given the volume and complexity of data generated today is the need for some sort of data abstraction in order to be able to efficiently leverage analytical data in the decision-making process. “Very large data sets need to be transformed in some way to make them manageable and interpretable in a practical and timely manner. The difficulty, however, is the potential for loss of information,” asserts Anderson.

Reduction of data creates risk from a total analysis point of view, he adds. Centroided MS data vs. full-profile MS data is one example. The former involves a mathematical transformation of the data that provides simplification and easier interpretation, but the risk of missing important details must be weighed against that simplification.

There is a big push, according to Anderson, to find ways to use data in a non-reduced way for more practical interpretations.  One example involves modifying the way signals are acquired. In NMR, for instance, Anderson points to nonuniform sampling, which involves pulsing the sample at nonuniform frequencies and allows the reconstruction of spectra close to what would be obtained with full-frequency pulsing but in a smaller data file.

There is also a focus on the application of machine learning and artificial intelligence in this area, letting machines identify trends and make observations that humans could probably find but don’t have time to do, according to Anderson.



A multi-modal view

A theme underlying all of the above areas of concern is the need for software platforms that enable viewing of data generated by different instruments from different suppliers. “The systems we have today generate huge amounts of data. But these same systems are not necessarily capable of integrating these large data sets and providing researchers with actionable information in a multi-modal view,” Manolis states. While data generation and usability of technologies are improving with better user interfaces and easier-to-use systems and applications, the integration of data sets and connectivity between data sets from different technologies and vendors remains a challenge.

“Software analysis tools, rather than only gather data, should also be able to analyze and interpret that data and provide information as the output,” asserts Lucia Sousa, associate analytical chemist for R&D in the analytical chemistry development group at Hovione. “Vendors are creating software that interprets data, but even these advanced systems still require great effort from the user. Scientists in the laboratory should have knowledge in their fields of expertise and not need to be experts in software applications as well,” she adds. 

Throughout pharma discovery, development, and commercial manufacturing, data are acquired and stored on separate and disparate systems that contain both structured and unstructured information, according to Manolis. The challenge is further complicated by collaboration with outsourcing partners, suppliers, distributors, and government agencies that have their own disparate systems. Overcoming these challenges will dramatically improve researchers’ abilities to make faster decisions and will ultimately accelerate drug development and approval, she adds.

Analytical manufacturers are working with pharma companies and others to build software solutions that ease these processes. The Agilent data system OpenLab CDS ChemStation Edition, according to Manolis, is the first to support the new Allotrope Data Format (ADF) (, an emerging standard developed by a consortium of pharmaceutical companies.

“ADF standardizes the collection, exchange, and storage of analytical data captured in laboratory workflows and enables labs to transfer and share that data across platforms,” Manolis explains. “There needs to be continued focus on evolving customer requirements concerning analysis and integration of data sets, compliant solutions for analysis and storage, and the IT infrastructure challenges around large data sets, specifically those generated from high-resolution MS analyses,” she continues.

It is an exciting time with respect to advances in data analysis and interpretation, according to Anderson. “Instrument manufacturers and software vendors like ACD/Labs are facing a new paradigm. No longer are we dealing with the integration of monolithic systems through document exchange. Copying and pasting spectra into a word document isn’t good enough any longer. Integration of not only data, but applications is needed to provide a seamless user experience. Decision-making interfaces of the future will be interoperable and able to leverage a variety of heterogeneous and orthogonal data,” he explains.

The horizon for software solutions to address pharma customer needs is positive and a fast-moving and competitive space, agrees Manolis. “Broadly speaking, there are many organizations evaluating and using tools that were not designed specifically for pharma but perhaps for business or consumers, and these organizations have been adapting them to their needs. Moving forward, we see these boundaries continuing to blur as solutions are delivered from a variety of software vendors.”

Importance of data science experience

Pharmaceutical scientists will need to adapt as well. An understanding of data science is becoming essential. “This experience is key for analytical scientists to understand the data and challenge the data with the right questions and subsequent analyses,” Manolis says. “To avoid data silos and facilitate the sharing of data, the expertise gained by the data owner will be invaluable when developing ways to use existing information or to integrate additional data,” she adds.

Furthermore, implementing end-to-end data integration requires capabilities that include trusted sources of data and documents, robust quality, and maintenance of data integrity. “The objective is to tackle the most important data first to achieve a rapid return on investment. Identifying the right infrastructure will therefore be of paramount importance,” states Manolis.

Hovione also believes that data science will allow data-driven predictions from big data through modeling and statistical inference. “Data science will allow the extraction of meaningful information for decision making and consequently allow increased efficiency in pharma laboratories,” says Galesio. He does note, however, that many pharma researchers still lack expertise in this field. “Going forward it will be especially relevant to combine both data science and analytical expertise. We believe many companies are investing more and more in this area,” Galesio observes.

Many pharma organizations are building structures or have built structures in the past 5–10 years specifically supporting data science and analysis, according to Manolis. “We would assume this investment will continue as modalities and data sets continue to increase in complexity,” she concludes.

Article Details

BioPharm International
Vol. 32, No. 2
February 2019
Pages: 10–14


When referring to this article, please cite it as C. Challener, “Nurturing Knowledge from Disparate Data Streams," BioPharm International 32 (2) 2019.



download issueDownload Issue : BioPharm International-02-01-2019