Title
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
Abstract
The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of "low density", where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.
Year
Venue
Keywords
2012
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION
corpus creation,text acquisition,minority languages
DocType
Citations 
PageRank 
Conference
4
0.52
References 
Authors
2
3
Name
Order
Citations
PageRank
Dirk Goldhahn1115.22
Thomas Eckart2117.52
Uwe Quasthoff319526.62