Investigating heterogeneous protein annotations toward cross-corpora utilization. - Citegraph

Paper Info

Title
Investigating heterogeneous protein annotations toward cross-corpora utilization.

Abstract
BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.

Year	DOI	Venue
2009	10.1186/1471-2105-10-403	BMC Bioinformatics
Keywords	Field	DocType
proteins,natural language processing,bioinformatics,algorithms,computational biology,genes,microarrays	Annotation,Computer science,Block (data storage),Named entity,Natural language processing,Protein Annotation,Artificial intelligence,Bioinformatics,Named-entity recognition	Journal
Volume	Issue	ISSN
10	1	1471-2105
Citations	PageRank	References
29	0.71	24
Authors
5

Authors (5 rows)

Cited by (29 rows)

References (24 rows)

Name	Order	Citations	PageRank
Yue Wang	1	29	0.71
Jin-Dong Kim	2	1705	92.21
Rune Sætre	3	560	28.49
Sampo Pyysalo	4	1941	100.14
Jun-ichi Tsujii	5	1973	219.85

1