Title
Term extraction from sparse, ungrammatical domain-specific documents
Abstract
Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.
Year
DOI
Venue
2013
10.1016/j.eswa.2012.10.067
Expert Syst. Appl.
Keywords
Field
DocType
existing term extraction system,domain-specific term extraction,domain-specific content,rare term,novel term extraction system,detects multi-word term,term extraction,term formation,ungrammatical domain-specific document,te system,domain-specific knowledge resource,text mining,natural language processing,business intelligence
Ontology,Data mining,Information retrieval,Computer science,Unavailability,Extant taxon,Artificial intelligence,Business intelligence,Machine learning,Collocation
Journal
Volume
Issue
ISSN
40
7
0957-4174
Citations 
PageRank 
References 
8
0.48
21
Authors
2
Name
Order
Citations
PageRank
Ashwin Ittoo1616.58
Gosse Bouma248370.88