Abstract | ||
---|---|---|
Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. |
Year | Venue | DocType |
---|---|---|
2022 | Transactions of the Association for Computational Linguistics | Journal |
Volume | ISSN | Citations |
10 | 2307-387X | 0 |
PageRank | References | Authors |
0.34 | 0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Robin Algayres | 1 | 2 | 0.68 |
Tristan Ricoul | 2 | 0 | 0.34 |
Julien Karadayi | 3 | 0 | 0.34 |
Hugo Laurençon | 4 | 0 | 0.34 |
Salah Zaiem | 5 | 2 | 1.03 |
Abdelrahman Mohamed | 6 | 15 | 1.70 |
beno it sagot | 7 | 326 | 49.52 |
Emmanuel Dupoux | 8 | 238 | 37.33 |