Abstract | ||
---|---|---|
Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1177/1460458219869490 | HEALTH INFORMATICS JOURNAL |
Keywords | DocType | Volume |
biomedical literature,biomedical software,biomedical software index,named entity recognition,natural language processing | Journal | 26.0 |
Issue | ISSN | Citations |
SP1.0 | 1460-4582 | 1 |
PageRank | References | Authors |
0.35 | 0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Qiang Wei | 1 | 133 | 30.22 |
Zhang Yaoyun | 2 | 56 | 14.30 |
Muhammad Amith | 3 | 22 | 9.01 |
Rebecca Lin | 4 | 3 | 1.14 |
Jenay Lapeyrolerie | 5 | 1 | 0.35 |
Cui Tao | 6 | 35 | 12.77 |
Hua Xu | 7 | 650 | 69.76 |