ECCB

Sessions

Saturday
October 1, 2005

AREA 10: TEXT MINING
Area chairs: Christian Blaschke and Lynette Hirschman

OP-43 Probabilistic Model for Mining Implicit "Chemical Compound - Gene" Relations from Literature
Shanfeng Zhu (1), Yasushi Okuno (2), Gozoh Tsujimoto (2), Hiroshi Mamitsuka (1)
1) Bioinformatics Center, Institute for Chemical Research, Kyoto University, 2)Graduate School of Pharmaceutical Sciences, Kyoto University

ABSTRACT:
Motivation: The importance of chemical compounds has been emphasized more in molecular biology, and "chemical genomics" has attracted a great deal of attention in recent years. Thus an important issue in current molecular biology is to identify biological related chemical compounds (more specifically, drugs) and genes. Co-occurrence of biological entities in the literature is a simple, comprehensive and popular technique to find the association of these entities. Our focus is to mine implicit "chemical compound and gene" relations from the cooccurrence in the literature.
Results: We propose a probabilistic model, called the mixture aspect model (MAM), and an algorithm for estimating its parameters to efficiently handle different types of co-occurrence datasets at once. We examined the performance of our approach not only by a crossvalidation using the data generated from the MEDLINE records but also by a test using an independent human-curated dataset of the relationships between chemical compounds and genes in the ChEBI database. We performed experimentation on three different types of co-occurrence datasets (i.e. compound-gene, gene-gene and compound-compound co-occurrences) in both cases. Experimental results have shown that MAM trained by all datasets outperformed any simple models trained by other combinations of datasets with the difference being statistically significant in all cases. In particular, we found that incorporating compound-compound cooccurrences is the most effective in improving the predictive performance. We finally computed the likelihoods of all unknown compound-gene (more specifically, drug-gene) pairs using our approach and selected the top 20 pairs according to the likelihoods. We validated them from biological, medical and pharmaceutical viewpoints.
CONTACT: zhusf@kuicr.kyoto-u.ac.jp

OP-44 Implementing the iHOP Concept for Navigation of Biomedical Literature
Robert Hoffmann (1), Alfonso Valencia (1)
1) National Center of Biotechnology

ABSTRACT: Motivation: The World Wide Web has profoundly changed the way in which we access information. Searching the internet is easy and fast, but more importantly, the interconnection of related content makes it intuitive and closer to the associative organisation of human memory. However, the information retrieval tools currently available to researchers in biology and medicine lag far behind the possibilities that the layman has come to expect from the internet.
Results: By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource. iHOP (Information Hyperlinked over Proteins) is an online service that provides this gene-guided network as a natural way of accessing millions of PubMed abstracts and brings all the advantages of the internet to scientific literature research. Navigating across interrelated sentences within this network is closer to human intuition than the use of conventional keyword searches, and allows for stepwise and controlled acquisition of information. Moreover, this literature network can be superimposed upon experimental interaction data to facilitate the simultaneous analysis of novel and existing knowledge. The network presented in iHOP currently contains 5million sentences and 40000 genes from Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli.
Availability: iHOP is publicly accessible at http://www.pdg.cnb.uam.es/UniPub/iHOP/
CONTACT: hoffmann@cnb.uam.es

OP-45 Expert knowledge without the expert: Integrated analysis of gene expression and the literature to derive active functional contexts
Robert Küffner (1), Ralf Zimmer (1)
1) LMU Munich

ABSTRACT:
The interpretation of expression data without appropriate expert knowledge is difficult and usually limited to exploratory data analysis such as clustering and detecting differentially regulated gene sets. On the other hand, comparing experimental results against manually compiled knowledge resources might limit or bias the perspective on the data. Thus, manual analysis by experts is required to obtain confident predictions about the involved biological processes. We present an algorithm to simultaneously derive interpretations of expression measurements together with biological hypotheses from bio-medical publications. Gene clusters are selected to exhibit both a coherent gene expression pattern as well as a coherent literature profile. Manual intervention by an expert in specifying prior knowledge or during the analysis procedure is not required. The approach scales to realistic applications and does not rely on controlled vocabularies or pathway resources. We validated our algorithm by analyzing a current juvenile arthritis dataset. A number of gene clusters and accompanying concepts are identified as an interpretation of the data that coincide well with the phenotype and biological processes known to be involved in the disease. We demonstrate that generated clusters are both more sensitive and more specific than GeneOntology categories detected on the same data. The method allows for in-depth investigation of both subsets of genes, the associated concepts and publications.
CONTACT: Robert.Kueffner@bio.ifi.lmu.de

Back to Session Index