Title
Repetition and language models and comparable corpora
Abstract
I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dot-plots. Information Retrieval is more interested in unordered matches (e.g., cosine similarity), which show up as squares in dotplots. Parallel corpora have both squares and diagonals multiplexed together. The diagonals tell us what is a translation of what, and the squares tell us what is in the same language. I would expect dotplots of comparable corpora would contain lots of diagonals and squares, though the diagonals would be shorter and more subtle in comparable corpora than in parallel corpora.
Year
Venue
Keywords
2011
BUCC@ACL/IJCNLP
comparable corpus,language model,interesting dna sequence,cosine similarity,unordered match,parallel corpus,information retrieval,non-standard feature
Field
DocType
Citations 
Cosine similarity,Parallel corpora,Speech recognition,Artificial intelligence,Natural language processing,Language model,Mathematics
Conference
0
PageRank 
References 
Authors
0.34
0
1
Name
Order
Citations
PageRank
Ken Church1373.74