Title
Tibetan Multi-word Expressions Identification Framework Based on News Corpora
Abstract
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
Year
DOI
Venue
2016
10.1007/978-3-319-50496-4_2
Lecture Notes in Computer Science
Keywords
DocType
Volume
Tibetan Multi-word expression,Two-word coupling degree,Inside word probability
Conference
10102
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
5
3
Name
Order
Citations
PageRank
Minghua Nuo1114.22
Congjun Long284.67
Huidan Liu3165.09