Abstract | ||
---|---|---|
We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs. |
Year | Venue | Keywords |
---|---|---|
2016 | LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | Parallel corpora,Bitext alignment,Statistical Machine Translation |
Field | DocType | Citations |
Computer science,Parallel corpora,Speech recognition,Subtitle,Preprocessor,Natural language processing,Artificial intelligence | Conference | 43 |
PageRank | References | Authors |
1.51 | 6 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Pierre Lison | 1 | 146 | 12.35 |
Jörg Tiedemann | 2 | 744 | 69.87 |