Title
Closing the loop: from paper to protein annotation using supervised Gene Ontology classification.
Abstract
Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naive Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/.
Year
DOI
Venue
2014
10.1093/database/bau088
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Field
DocType
Volume
Data mining,Information retrieval,Gene ontology,Computer science,Data curation,Software,Protein Annotation,Bioinformatics,Knowledge base,Classifier (linguistics),Molecular Sequence Annotation,Workflow
Journal
2014
ISSN
Citations 
PageRank 
1758-0463
2
0.37
References 
Authors
19
4
Name
Order
Citations
PageRank
Julien Gobeill130230.42
Emilie Pasche29915.93
Dina Vishnyakova311311.16
P Ruch465038.72