Title
Mining Infrequent High-Quality Phrases from Domain-Specific Corpora
Abstract
Phrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora with special consideration of infrequent ones. Previous methods might miss infrequent high-quality phrases in the candidate selection stage. And these methods rely on explicit features to mine phrases while rarely considering the implicit features. In addition, completeness is rarely explicitly considered in the evaluation of a high-quality phrase. In this paper, we propose a novel approach that exploits a sequence labeling model to capture infrequent phrases. And we employ implicit semantic features and contextual POS tag statistics to measure meaningfulness and completeness, respectively. Experiments over four real-world corpora demonstrate that our method achieves significant improvements over previous state-of-the-art methods across different domains and languages.
Year
DOI
Venue
2020
10.1145/3340531.3412029
CIKM '20: The 29th ACM International Conference on Information and Knowledge Management Virtual Event Ireland October, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-6859-9
0
PageRank 
References 
Authors
0.34
23
8
Name
Order
Citations
PageRank
Li Wang13815.46
Wei Zhu202.37
Sihang Jiang301.69
Sheng Zhang400.34
KeQiang Wang593.77
Yuan Ni6114.61
Guotong Xie717.51
Yanghua Xiao848254.90