Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency. - Citegraph

Paper Info

Title
Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency.

Abstract
We introduce a word frequency-based classifier for the SemEval 2016 complex word identification task (#11). Words with lower frequency are predicted as complex based on a threshold optimized for G-score. We examine three different corpora for calculating frequencies and find English Wikipedia to perform best (ranked 13th on the SemEval task), followed by the Google Web Corpus and lastly Simple English Wikipedia. Bagging is also shown to slightly improve the performance of the classifier. Overall, we find word frequency to be a strong predictor of complexity. On the SemEval “test” set, a frequency classifier that uses the optimal frequency threshold performs on-par with the best submitted system and a system trained using only 500 labeled examples split from the test set achieves results that are only slightly below the best submitted system.

Year	Venue	Field
2016	SemEval@NAACL-HLT	SemEval,Ranking,Word lists by frequency,Computer science,Speech recognition,Natural language processing,Artificial intelligence,Classifier (linguistics),Test set
DocType	Citations	PageRank
Conference	2	0.41
References	Authors
6	1

Authors (1 rows)

Cited by (2 rows)

References (6 rows)

Name	Order	Citations	PageRank
David Kauchak	1	363	25.92

1