Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text - Citegraph

Paper Info

Title
Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text

Abstract
Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density, specificity (calculated using word-level depth in MeSH), and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.

Year	DOI	Venue
2014	10.1109/HICSS.2014.330	System Sciences
Keywords	Field	DocType
binary prediction task,critical component,difficult sentence,discover features,machine learning,ambiguity feature,different machine,text simplification tools,umls metathesaurus concept,difficult section,complementary ablation study,combined impact,training size study,identify difficult text,text analysis,health care,natural language processing,learning artificial intelligence	Text simplification,Text mining,Unified Modeling Language,Computer science,Feature extraction,Natural language processing,Artificial intelligence,Encyclopedia,Random forest,Ambiguity,Machine learning,Binary number	Conference
ISSN	Citations	PageRank
1060-3425	3	0.40
References	Authors
18	4

Authors (4 rows)

Cited by (3 rows)

References (18 rows)

Name	Order	Citations	PageRank
David Kauchak	1	363	25.92
Obay Mouradi	2	26	2.13
Christopher Pentoney	3	3	0.40
Gondy Leroy	4	528	47.72

1