Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification - Citegraph

Paper Info

Title
Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification

Abstract
Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions.

Year	DOI	Venue
2022	10.1093/database/baac066	DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
DocType	Volume	ISSN
Journal	2022	1758-0463
Citations	PageRank	References
0	0.34	0
Authors
12

Authors (12 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Arslan Erdengasileng	1	0	0.34
Qing Han	2	0	0.34
Tingting Zhao	3	0	0.34
Shubo Tian	4	0	0.68
Xin Sui	5	340	31.49
Keqiao Li	6	0	0.34
Wanjing Wang	7	0	0.34
Jian Wang	8	0	0.34
Ting Hu	9	0	0.34
Feng Pan	10	0	0.34
Yuan Zhang	11	0	0.34
Jinfeng Zhang	12	86	10.11

1