Repetition and language models and comparable corpora - Citegraph

Paper Info

Title
Repetition and language models and comparable corpora

Abstract
I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dot-plots. Information Retrieval is more interested in unordered matches (e.g., cosine similarity), which show up as squares in dotplots. Parallel corpora have both squares and diagonals multiplexed together. The diagonals tell us what is a translation of what, and the squares tell us what is in the same language. I would expect dotplots of comparable corpora would contain lots of diagonals and squares, though the diagonals would be shorter and more subtle in comparable corpora than in parallel corpora.

Year	Venue	Keywords
2011	BUCC@ACL/IJCNLP	comparable corpus,language model,interesting dna sequence,cosine similarity,unordered match,parallel corpus,information retrieval,non-standard feature
Field	DocType	Citations
Cosine similarity,Parallel corpora,Speech recognition,Artificial intelligence,Natural language processing,Language model,Mathematics	Conference	0
PageRank	References	Authors
0.34	0	1

Authors (1 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Ken Church	1	37	3.74

1