Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. - Citegraph

Paper Info

Title
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.

Abstract
The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of "low density", where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.

Year	Venue	Keywords
2012	LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	corpus creation,text acquisition,minority languages
DocType	Citations	PageRank
Conference	4	0.52
References	Authors
2	3

Authors (3 rows)

Cited by (4 rows)

References (2 rows)

Name	Order	Citations	PageRank
Dirk Goldhahn	1	11	5.22
Thomas Eckart	2	11	7.52
Uwe Quasthoff	3	195	26.62

1