Abstract | ||
---|---|---|
I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dot-plots. Information Retrieval is more interested in unordered matches (e.g., cosine similarity), which show up as squares in dotplots. Parallel corpora have both squares and diagonals multiplexed together. The diagonals tell us what is a translation of what, and the squares tell us what is in the same language. I would expect dotplots of comparable corpora would contain lots of diagonals and squares, though the diagonals would be shorter and more subtle in comparable corpora than in parallel corpora. |
Year | Venue | Keywords |
---|---|---|
2011 | BUCC@ACL/IJCNLP | comparable corpus,language model,interesting dna sequence,cosine similarity,unordered match,parallel corpus,information retrieval,non-standard feature |
Field | DocType | Citations |
Cosine similarity,Parallel corpora,Speech recognition,Artificial intelligence,Natural language processing,Language model,Mathematics | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 1 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ken Church | 1 | 37 | 3.74 |