Title
Text segmentation by language using minimum description length
Abstract
The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
Year
Venue
Keywords
2012
ACL
text segmentation,empirical result,universal declaration,non-major language,proposed solution,multilingual document,minimum description length,large amount,dynamic programming,linguistic data,human rights
Field
DocType
Volume
Declaration,Dynamic programming,Computer science,Minimum description length,Human rights,Text segmentation,Natural language processing,Artificial intelligence
Conference
P12-1
Citations 
PageRank 
References 
8
0.71
8
Authors
2
Name
Order
Citations
PageRank
Hiroshi Yamaguchi180.71
Kumiko Tanaka-Ishii226136.69