Information and Knowledge Management (KM) are crucial components in any life sciences research and development strategy.
Without an effective information management strategy, there is little point in doing biological research. This article focuses
on the technology and guidance required to achieve good KM. We contend that the literature represents the largest source of
information to be reaped, and also that a variety of software tools are needed to successfully integrate information from
all relevant sources.
Christopher Larsen, Ph.D.
Provided here is an outline for establishing good KM practice, with discussions of how to integrate heterogeneous information
concerning proteins, complexes, genes, compounds, and their interactions, modifications, mutations, and expressions. Potential
pitfalls to these strategies are presented, along with companies offering products with solutions. There is some forward thinking
involved because the current software suites only cover fractions of the totality of data.
THE NEED FOR CONTENT INTEGRATION
The amount of information available to a life sciences researcher is extremely large, dynamic, and rapidly growing. The scientific
literature alone is expanding at an astronomical rate, and advances in high-throughput technologies have significantly increased
the amount of information that researchers must assess. They are faced with the daunting challenge of efficiently identifying,
managing, and analyzing relevant scientific information from both public and proprietary sources.
Our scientific goal as informaticists is a complete conceptual unification of content — data, information, and knowledge.
This content has a heterogeneous format or origin. It is in the literature, industrial assay output, research abstracts, conference
meetings, and in personal communications. It is disseminated in various formats including raw text, email, web documents,
traditional literature, patents, notebooks, spreadsheets, relational databases, marked-up text (XML/HTML), tables, and lists.
It may address overlapping subsets of human macromolecules and metabolites and may not exhibit much formatting regularity.
Financially, there is good reason to use effective informatics. A successful informatics approach would keep data from being
redundantly generated, make precise content searches possible, and allow rapid exchange of ideas in a research environment.
One report from the Boston Consulting Group suggests that $200 million and two years could be shaved off a drug's development
time by using informatics effectively.1 Other reports from Acumen mirror this conclusion, suggesting that 20 to 40 percent of total time spent on a typical proteomics
project is wasted on searching for appropriate information.2
REQUIREMENTS FOR A SUCCESSFUL KM APPROACH
Given these problems, what must we do to successfully integrate a host of unrelated knowledge? Three prerequisites (at least)
are needed to unite heterogeneous research data: 1) standardized vocabularies, ontologies, and systematic nomenclatures for
describing biomedical research knowledge, 2) a relational database platform capable of accepting all the data, and 3) import
tools such as scripts, application program interfaces (APIs), natural language processing (NLP), and data loaders that identify,
capture, and integrate the data.
The resulting solution will help researchers share, access, import, standardize, write to, and link to data. Data cannot be
well organized, searchable, and manipulatable if it exists in layers and formats without a common underlying structure.
Standard vocabularies are needed so the user can search, navigate content, or utilize high-level functionalities or metadata
content. For example, two semantic terms for endoplasmic reticulum exist, but rough and smooth ER need to be separated conceptually,
and a parent-child relationship needs to be built into a storage system for the two subtypes ("ER" is the parent, "rough ER"
and "smooth ER" the children). The biologist understands the large difference between the ER functions-the computer does not.
In relational databases that accept such data, search terms also need to cross-reference molecular synonyms so they can avoid
a somewhat schizoid data access. SwissProt is an excellent source of protein synonyms.3 By uniting different protein names such as Sentrin, SUMO, SUMO_1, SUMO-1, ySUMO, and others, database searches and content
organization can be complete. These vocabularies are also necessary to mine unstructured literature data.
NLP approaches can be created with standard vocabulary syntax parsers that mine the literature with high speed and recall.
Also, collection and archiving methods will fail without a strong vocabulary set.