Title
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian.
Abstract
This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper sub- corpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of 'light' and 'hard' comparable corpora is introduced. At this stage we aim at producing a 'light' bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.
Year
Venue
Field
2004
LREC
Lexical similarity,Bulgarian,Computer science,Speech recognition,Newspaper,Natural language processing,Artificial intelligence,Corpus linguistics,Croatian,Linguistics
DocType
Citations 
PageRank 
Conference
5
0.49
References 
Authors
2
4
Name
Order
Citations
PageRank
Bozo Bekavac1134.26
Petya Osenova236046.00
Kiril Simov313929.75
Marko Tadić48015.61