Commentary|Events|June 24, 2026

Building a Strong Data Foundation for AI in Drug Manufacturing

Listen

0:00 / 0:00

AI models can deliver transformative insights into drug manufacturing, but only when fed complete, traceable, and representative datasets.

Manufacturing biologics is a complex, multistep process that relies on numerous instruments to support production, monitoring, and analysis. In many facilities, these systems operate independently, requiring laboratory personnel to manually collect, integrate, and interpret data from multiple sources.

Centralizing that information in a unified platform could help train AI tools to identify process deviations and support troubleshooting, potentially improving manufacturing efficiency and product quality. However, realizing those benefits depends on establishing a robust data foundation and implementing the technology effectively.

Bad data can lead to poor predictions. In a study of 300,000 retail employee shifts conducted by Harvard Business School, researchers found that 84% of the shifts recommended by an AI scheduling tool were ultimately overridden by managers.¹

Although the model was built using a large dataset, investigators later discovered that nearly 8% of the training data were inaccurate. As a result, the tool failed to consistently deliver reliable recommendations, eroding user trust and limiting adoption.

The experience demonstrates how data quality issues can undermine AI initiatives before they have an opportunity to prove their value. Gartner has similarly predicted that organizations will abandon 60% of AI projects through 2026 because of inadequate data quality and readiness.²

A central challenge arises from the diverse array of laboratory instrumentation. Many instruments rely on proprietary data platforms that create isolated data silos.

Some devices provide no automated capture at all, forcing scientists to rely on copying and pasting from Excel and CSV files or manually transcribing data. This fragmentation complicates compliance efforts, as organizations must still demonstrate where data originated, who generated it, and how it moved through internal systems.

Even when IT departments implement user authentication or machine-level controls, these measures primarily satisfy security requirements rather than scientific auditability. Getting this data and using a model or AI to help with identify manufacturing patterns and process deviations that might otherwise go unnoticed until much later in production.

By improving process visibility and enabling earlier intervention, these tools have the potential to reduce risk and improve manufacturing outcomes. However, success depends on building a strong data foundation.

Key questions to consider include:

What defines success?
What kinds of data will you need?
How to get the data?
How much data do you need?
How do you make sense of the data?

Best Practices in AI Use

Define Use Case, Success Criteria, and Data Requirements First

Start with a question that the data should answer. Think about what task you want to be able to use the AI to do (prediction, anomaly detection, optimization), acceptable error bounds, and regulatory constraints.

Next, map required signals. List every process variable, analytical measurement, and metadata field needed to answer the question.

Prioritize variables that domain experts identify as causal or confounding; however, keep in mind that some variables may be undefined, so more data is better.

Inventory and Classify Data Sources

Create a data catalog that records instrument type, communication protocol (RS-232, TCP/IP, FTP, file share, etc.), sampling frequency, and owner.

Classify data into three tiers: process, analysis, and metadata. Treat metadata (method IDs, operator notes, instrument UID, site) as first-class data because it enables traceability and correct interpretation.

Data Categories

There are many real-world data capture challenges across analytical instruments, process control systems, and laboratory workflows. Broadly speaking, manufacturing data falls into three categories: process data (PLC/SCADA/OPC tags), analytical data (instrument outputs from HPLCs, blood gas analyzers, etc.), and metadata (instrument UIDs, methods, operator observations).

Further complicating data integration, instruments that perform similar functions may communicate in entirely different ways. For example, two blood gas analyzers, such as the Siemens RapidPoint 500e and the Radiometer ABL90 FLEX PLUS, may generate comparable measurements but use different communication protocols or data transmission methods.

As a result, bringing data together into a unified format often requires significant integration effort.

Case Study

Coffee Roast Pilot (Analogy for Batch Processes)

Coffee roasting and brewing as an analogy to biologics batch processes to demonstrate multivariate analysis with AI. The goal was to understand what method led to the best brewed coffee. The coffee was characterized for aroma, taste, flavor, bitterness, acidity, and everything that described the bean themselves.

Three data types:

1. Process Variables

Roaster temperature
Drum Speed
Time

2. Analytical measures

Refractometer,
pH

3. Metadata

Grind size
Water temperature
Perceived flavor

After 10 runs, the pilot revealed that insufficient batch count led to underpowered statistical conclusions and required expanding the dataset. After 20 runs a model was able to achieve statistical relevance, which shows the importance of getting adequate data in order to have a successful model.

Automate Data Capture at the Source

“NIH studies show that 3 to 9% of manually transcribed data is erroneous.”^3-5

Automate data capture at the source whenever possible to avoid transcription errors and duplicate records. Many legacy capture methods, including manual copy and paste, flash drives, and serial terminals, introduce risk and data loss.

Use standardized drivers and connectors that support automated uploads and common communication protocols; this reduces duplication and simplifies onboarding of heterogeneous instruments.

Retain Failed Runs and Out-of-Specification Data

Do not discard failed runs or out-of-specification results. AI models trained only on successful experiments cannot learn to recognize failure modes or process deviations.

Including the full range of Design of Experiments (DOE) conditions, including failed experiments, helps create more representative models and reduces the risk of biased conclusions. If data must be excluded, such as in cases of instrument malfunction, document the reason and retain that information as metadata.

Retain negative and out-of-spec data. AI models trained only on successful experiments will not learn to recognize failure modes and may hallucinate plausible but incorrect values.

Including full range of Design of Experiment (DOE) conditions, including failed experiments, helps avoid biased inferences. Make sure to annotate reasons for exclusion if any data must be removed (e.g., instrument malfunction) and store the annotation as metadata.

Ensure Data Provenance, Audit Trails, and GLP/GMP Alignment.

Capture key contextual information alongside the data, including timestamps, instrument unique identifiers, method versions, operator information, and transfer logs. This information, often referred to as “data provenance,” supports reproducibility and regulatory review.

As organizations adopt more digital workflows, regulators are placing greater emphasis on audit trails and compliance with GxP requirements. For critical datasets used in model development, organizations should also verify data integrity through mechanisms such as checksums, hashes, or signed logs.

Regulatory Considerations

Regulatory expectations in scientific, clinical, and manufacturing environments continue to evolve as organizations transition from paper-based workflows to fully digital ecosystems. Under the broader GxP framework, encompassing GLP, GCP, and GMP, regulators increasingly require organizations to demonstrate complete traceability, integrity, and provenance of all scientific data.

These categories span “C for clinical, M for manufacturing, and L for laboratory” and vary in strictness, with GLP generally less stringent than GCP and GMP. Nevertheless, across all domains, regulators now scrutinize not only the final data but also the processes, systems, and controls that generate and transport it.

Depending on theGxP or non-GxP requirements, data capture may require 21 CFR Part 11 compliance, which requires demonstrable traceability, immutability, and attribution. Audit files should be tamper-proof ensuring that even internal personnel cannot alter historical records. This design prevents tampering, supports trust in the data lifecycle, and aligns with regulatory requirements.

Collaborate Closely with IT and Cybersecurity

Engage IT early because they are a key part of your digital transformation project succeeding. Opening ports, setting firewall rules, and access to process networks require IT coordination.

Negotiate one-way data flows between segmented networks, where it is possible to reduce risk. IT departments are the gatekeepers and working with them is essential to avoid outages and security incidents.

Assure them and demonstrate that this process will be secure and will avoid insecure transfer methods (uncontrolled USB drives) and implement secure, auditable transfer mechanisms that have not capability of altering the data.

Pilot, Quantify, and Iterate

Run pilots to estimate sample size. In the coffee example, we found that it requires at least 20 replicates to reach statistical significance. Use pilot studies to determine the number of batches, replicates, and parameter ranges needed.

Measure data quality metrics (missingness, duplication rate, transcription error rate) and set thresholds before scaling.

Standardize Formats and Units

Normalize units and method identifiers at ingestion. Record whether values are measured by volume (typically done at lab scale) or by mass (typically done at larger scales) or adjusted and convert to canonical units before transfer.

Standardizing to mass for product additions ensures that formats are standard across scale. Adopt controlled vocabulary for method names, sample IDs, and site identifiers to avoid semantic mismatches.

Conclusion

People and process matter as much as technology. Successful data programs require domain experts, data engineers, and IT to collaborate and agree on priorities.

Budget for data engineering, instrument drivers, secure connectors, and validation are non-trivial investments but are essential to avoid downstream model failures. The regulatory landscape demands unified, tamper-proof, and fully traceable scientific data systems.

For models that will inform regulated decisions, design data capture with auditability and traceability from day one. AI models can deliver transformative insights into drug manufacturing, but only when fed complete, traceable, and representative datasets.

Teams should define use cases and success metrics first, automate and standardize data capture, preserve failed and boundary data, and work closely with IT to ensure secure, auditable pipelines. Iterative pilots will reveal the true data volume and quality requirements needed for robust models.

About the Author

Brendan Lucey, Chief Commercial Officer, Kivi Bio.

References

Rand B. Bad data, bad results: when AI struggles to create staff schedules. Harvard Business School Working Knowledge. February 27, 2025. Accessed June 24, 2026. https://www.library.hbs.edu/working-knowledge/bad-data-bad-results-when-ai-struggles-to-create-staff-schedules
Edjlali R. Lack of AI-ready data puts AI projects at risk. Gartner Newsroom. February 26, 2025. Accessed June 24, 2026. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
Feng JE, Anoushiravani AA, Tesoriero PJ, Ani L, Meftah M, Schwarzkopf R, Leucht P. Transcription error rates in retrospective chart reviews. Orthopedics. 2020;43(5):e404–e408. doi:10.3928/01477447-20200619-10
Mays JA, Mathias PC. Measuring the rate of manual transcription error in outpatient point-of-care testing. J Am Med Inform Assoc. 2019;26(3):269–272. doi:10.1093/jamia/ocy170
Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18

Stay at the forefront of biopharmaceutical innovation. Subscribe to BioPharm International for expert insights on drug development, manufacturing, compliance, and more.

Building a Strong Data Foundation for AI in Drug Manufacturing

Best Practices in AI Use

Case Study

Automate Data Capture at the Source

Retain Failed Runs and Out-of-Specification Data

Ensure Data Provenance, Audit Trails, and GLP/GMP Alignment.

Regulatory Considerations

Collaborate Closely with IT and Cybersecurity

Pilot, Quantify, and Iterate

Standardize Formats and Units

Conclusion

Related Content

Voyager's AAV Gene Therapy VY1706 Shows Sustained Tau Reduction in 6-Month Primate Study

Innovent, Spero Ink $1.1 Billion Deal for Fc-Silent Anti-CD40L Antibody IBI355 in IgG4-RD, Sjögren Disease

The BioPharm Brief: Regeneration, Responses, and Regulatory Wins

Novartis’ Carrie Bracco Says Flexibility in Biomanufacturing is the Most Underaddressed Need for Emerging Biotech Companies

Navigating career pathways in biopharmaceutical engineering: Talent attraction and retention in a competitive landscape