Title
Investigating heterogeneous protein annotations toward cross-corpora utilization.
Abstract
BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.
Year
DOI
Venue
2009
10.1186/1471-2105-10-403
BMC Bioinformatics
Keywords
Field
DocType
proteins,natural language processing,bioinformatics,algorithms,computational biology,genes,microarrays
Annotation,Computer science,Block (data storage),Named entity,Natural language processing,Protein Annotation,Artificial intelligence,Bioinformatics,Named-entity recognition
Journal
Volume
Issue
ISSN
10
1
1471-2105
Citations 
PageRank 
References 
29
0.71
24
Authors
5
Name
Order
Citations
PageRank
Yue Wang1290.71
Jin-Dong Kim2170592.21
Rune Sætre356028.49
Sampo Pyysalo41941100.14
Jun-ichi Tsujii51973219.85