Title
Comparing and combining a semantic tagger and a statistical tool for MWE extraction
Abstract
Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7-12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.
Year
DOI
Venue
2005
10.1016/j.csl.2004.11.002
Computer Speech & Language
Keywords
Field
DocType
usas system,multiword expression,mwe extraction,semantic tagger,mwe coverage,different tool,henceforth usas,automatic extraction,statistical tool,whilst usas,complementary relation,low frequency,knowledge base
Expression (mathematics),Computer science,Natural language processing,Corpus linguistics,Artificial intelligence,Semantic field,Collocation,Computational linguistics,Speech recognition,Statistical algorithm,Lexicon,Named-entity recognition,Machine learning
Journal
Volume
Issue
ISSN
19
4
Computer Speech & Language
Citations 
PageRank 
References 
24
1.33
12
Authors
4
Name
Order
Citations
PageRank
Scott S. L. Piao19312.65
Paul Rayson253854.59
Dawn Archer3323.31
Tony Mcenery4538.87