Next Generation Microbial Phenomics (#323)
Biologists and non-biologists alike intuitively connect to the natural world and its underlying scientific principles through phenotypes. For centuries microbiologists have composed detailed taxonomic descriptions of microorganisms, which had included phemomic data such as morphology, ecology, metabolism and host-cell interactions. In combination with gene sequence and genomic analyses, phenomic information has been important for understanding evolution of microbial traits and possible horizontal gene transfer events, and co-evolution of host-associated microbes. These types of studies require large datasets. Obtaining large phylogenetic datasets has become relatively easy, however, pulling together the phenomic characters from the rich legacy of microbial literature is tedious. The goal of this project is to develop automated natural-language processing tools to assemble phenomic data matrices mined from legacy taxonomic texts that can be used for mapping phenomic characters onto phylogenetic trees for explicit analysis of microbial trait evolution and for visualizing the microbial tree of life. CharaParser is a natural-language processing tool that was developed to analyze the text of phenomic descriptions and produce a structured output and has been used successfully with plant and insect descriptions (Cui 2012). We tested CharaParser with microbial descriptions, but found that the descriptions, which often included chemical terms and growth conditions, were very different from that of other taxa and were not recognized very well by the CharaParser tool. Two other software tools were tested, Stanford Parser and Open-Source Chemistry Analysis Routines (OSCAR), for incorporation into a new natural-language processing tool for use with microbial descriptions. To help with developing a new algorithm, we are searching for microbial ontologies. These approaches will be tested against hand-generated phenomic matrices already generated (Blank and Sanchez-Baracaldo 2010) to assess their accuracy.
- H. Cui, 2012, CharaParser for fine-grained semantic annotation of organism morphological descriptions. Journal of American Society for Information Science and Technology 63(4): 738-754, doi:10.1002/asi.22618.
- C. E. BLANK AND P. SANCHEZ-BARACALDO, 2010, Timing of morphological and ecological innovations in the cyanobacteria – a key to understanding the rise in atmospheric oxygen. Geobiology 8, 1–23, DOI: 10.1111/j.1472-4669.2009.00220.x