Tibetan Multi-word Expressions Identification Framework Based on News Corpora - Citegraph

Paper Info

Title
Tibetan Multi-word Expressions Identification Framework Based on News Corpora

Abstract
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao's N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.

Year	DOI	Venue
2016	10.1007/978-3-319-50496-4_2	Lecture Notes in Computer Science
Keywords	DocType	Volume
Tibetan Multi-word expression,Two-word coupling degree,Inside word probability	Conference	10102
ISSN	Citations	PageRank
0302-9743	0	0.34
References	Authors
5	3

Authors (3 rows)

Cited by (0 rows)

References (5 rows)

Name	Order	Citations	PageRank
Minghua Nuo	1	11	4.22
Congjun Long	2	8	4.67
Huidan Liu	3	16	5.09

1