Abstract | ||
---|---|---|
In this paper we exploit a combination of off-the-shelf tools for extracting a machine understandable representation of phenotypes and other related concepts that concern the diagnosis and treatment of diseases. These are tested against a gold standard EPR collection that has been annotated with Unified Medical Language System (UMLS) concept identifiers: the ShARE/CLEF 2013 corpus for disorder detection. We evaluate four pipelines as stand-alone systems and then attempt to optimise semantic-type based performance using several learn-to-rank (LTR) approaches - three pairwise and one listwise. We observed that whilst overall Apache cTAKES tended to outperform other stand-alone systems on a strong recall (R = 0.57), precision was low (P = 0.09) leading to low-to-moderate F1 measure (F1 = 0.16). Moreover, there is substantial variation in system performance across semantic types for disorders. For example, the concept Findings (T033) seemed to be very challenging for all systems. Combining systems within LTR improved F1 substantially (F1 = 0.24) particularly for Disease or syndrome (T047) and Anatomical abnormality (T190). Whilst recall is improved markedly, precision remains a challenge (P = 0.15, R = 0.59). |
Year | DOI | Venue |
---|---|---|
2015 | 10.1186/s13326-015-0019-z | Journal of Biomedical Semantics |
Keywords | Field | DocType |
Latent Dirichlet Allocation, Semantic Type, Unify Medical Language System, Word Sense Disambiguation, Entity Recognition | Data science,Data mining,Latent Dirichlet allocation,Text mining,Computer science,Concept selection,Word-sense disambiguation,Expressivity | Journal |
Volume | Issue | ISSN |
6 | 1 | 2041-1480 |
Citations | PageRank | References |
2 | 0.40 | 25 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nigel Collier | 1 | 18 | 5.07 |
Anika Oellrich | 2 | 157 | 13.61 |
Tudor Groza | 3 | 219 | 24.89 |