ENIGMA researchers are excited to announce a new tool: PaperBLAST for text-mining to pull protein and homolog information from publications, which will make it much easier to interpret genomic data from a wide range of non-model organisms.
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 366,014 different protein sequences linked to 890,102 scientific articles. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators.
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2018). mSystems, 10.1128/mSystems.00039-17
PaperBLAST is available at http://papers.genomics.lbl.gov