Knowledge Management Strategies for Biologics Research

The sheer volume of research and drug discovery data can thwart efforts at integrating research knowledge. An automated database system with useful algorithms could help solve this problem.
May 01, 2005
Volume 18, Issue 5

Christopher Larsen, Ph.D.
Information and Knowledge Management (KM) are crucial components in any life sciences research and development strategy. Without an effective information management strategy, there is little point in doing biological research. This article focuses on the technology and guidance required to achieve good KM. We contend that the literature represents the largest source of information to be reaped, and also that a variety of software tools are needed to successfully integrate information from all relevant sources.

Provided here is an outline for establishing good KM practice, with discussions of how to integrate heterogeneous information concerning proteins, complexes, genes, compounds, and their interactions, modifications, mutations, and expressions. Potential pitfalls to these strategies are presented, along with companies offering products with solutions. There is some forward thinking involved because the current software suites only cover fractions of the totality of data.

THE NEED FOR CONTENT INTEGRATION The amount of information available to a life sciences researcher is extremely large, dynamic, and rapidly growing. The scientific literature alone is expanding at an astronomical rate, and advances in high-throughput technologies have significantly increased the amount of information that researchers must assess. They are faced with the daunting challenge of efficiently identifying, managing, and analyzing relevant scientific information from both public and proprietary sources.

Our scientific goal as informaticists is a complete conceptual unification of content — data, information, and knowledge. This content has a heterogeneous format or origin. It is in the literature, industrial assay output, research abstracts, conference meetings, and in personal communications. It is disseminated in various formats including raw text, email, web documents, traditional literature, patents, notebooks, spreadsheets, relational databases, marked-up text (XML/HTML), tables, and lists. It may address overlapping subsets of human macromolecules and metabolites and may not exhibit much formatting regularity.

Financially, there is good reason to use effective informatics. A successful informatics approach would keep data from being redundantly generated, make precise content searches possible, and allow rapid exchange of ideas in a research environment. One report from the Boston Consulting Group suggests that $200 million and two years could be shaved off a drug's development time by using informatics effectively.1 Other reports from Acumen mirror this conclusion, suggesting that 20 to 40 percent of total time spent on a typical proteomics project is wasted on searching for appropriate information.2

REQUIREMENTS FOR A SUCCESSFUL KM APPROACH Given these problems, what must we do to successfully integrate a host of unrelated knowledge? Three prerequisites (at least) are needed to unite heterogeneous research data: 1) standardized vocabularies, ontologies, and systematic nomenclatures for describing biomedical research knowledge, 2) a relational database platform capable of accepting all the data, and 3) import tools such as scripts, application program interfaces (APIs), natural language processing (NLP), and data loaders that identify, capture, and integrate the data.

The resulting solution will help researchers share, access, import, standardize, write to, and link to data. Data cannot be well organized, searchable, and manipulatable if it exists in layers and formats without a common underlying structure.

Standard vocabularies are needed so the user can search, navigate content, or utilize high-level functionalities or metadata content. For example, two semantic terms for endoplasmic reticulum exist, but rough and smooth ER need to be separated conceptually, and a parent-child relationship needs to be built into a storage system for the two subtypes ("ER" is the parent, "rough ER" and "smooth ER" the children). The biologist understands the large difference between the ER functions-the computer does not.

In relational databases that accept such data, search terms also need to cross-reference molecular synonyms so they can avoid a somewhat schizoid data access. SwissProt is an excellent source of protein synonyms.3 By uniting different protein names such as Sentrin, SUMO, SUMO_1, SUMO-1, ySUMO, and others, database searches and content organization can be complete. These vocabularies are also necessary to mine unstructured literature data.

NLP approaches can be created with standard vocabulary syntax parsers that mine the literature with high speed and recall. Also, collection and archiving methods will fail without a strong vocabulary set.

lorem ipsum