Title
Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency.
Abstract
We introduce a word frequency-based classifier for the SemEval 2016 complex word identification task (#11). Words with lower frequency are predicted as complex based on a threshold optimized for G-score. We examine three different corpora for calculating frequencies and find English Wikipedia to perform best (ranked 13th on the SemEval task), followed by the Google Web Corpus and lastly Simple English Wikipedia. Bagging is also shown to slightly improve the performance of the classifier. Overall, we find word frequency to be a strong predictor of complexity. On the SemEval “test” set, a frequency classifier that uses the optimal frequency threshold performs on-par with the best submitted system and a system trained using only 500 labeled examples split from the test set achieves results that are only slightly below the best submitted system.
Year
Venue
Field
2016
SemEval@NAACL-HLT
SemEval,Ranking,Word lists by frequency,Computer science,Speech recognition,Natural language processing,Artificial intelligence,Classifier (linguistics),Test set
DocType
Citations 
PageRank 
Conference
2
0.41
References 
Authors
6
1
Name
Order
Citations
PageRank
David Kauchak136325.92