Abstract | ||
---|---|---|
Motivation: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.Results: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1093/bioinformatics/btab331 | BIOINFORMATICS |
DocType | Volume | Issue |
Conference | 37 | Supplement_1 |
ISSN | Citations | PageRank |
1367-4803 | 0 | 0.34 |
References | Authors | |
0 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Pengyuan Li | 1 | 16 | 5.81 |
Xiangying Jiang | 2 | 4 | 2.42 |
Gongbo Zhang | 3 | 0 | 2.03 |
Juan Trelles Trabucco | 4 | 0 | 0.34 |
Daniela Raciti | 5 | 0 | 0.34 |
Cynthia Smith | 6 | 0 | 0.34 |
Martin Ringwald | 7 | 11 | 2.31 |
G Elisabeta Marai | 8 | 136 | 20.43 |
Cecilia Arighi | 9 | 0 | 0.34 |
Hagit Shatkay | 10 | 910 | 96.13 |