Abstract | ||
---|---|---|
We propose a method for building a spoken-language text corpus for a spoken- language system. Conventional methods to build a new corpus include transcribing re- corded conversations, collecting text from existing documents, or writing original texts. However, these often have difficulties, such as insufficient corpus size and low cost effec- tiveness, when preparing the text data in the applied system's domain. To address these is- sues, we have developed a method that uses "germ dialogs," which are short-scripted dia- logs that enable writers to continue or replace them in a logical sequence that sounds natu- ral. This enables the corpus size to be proliferated in a cost-effective manner. Our results show that the proposed method can be used to create higher degree of adequateness for the system's domain than conventional methods. The text data collected for the proposed method are used to generate a language model for our speech translation system between English and Japanese. |
Year | Venue | Keywords |
---|---|---|
2005 | EAMT | language model,cost effectiveness,data collection |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
2 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
naoki asanoma | 1 | 0 | 0.68 |
Setsuo Yamada | 2 | 63 | 15.78 |
Osamu Furuse | 3 | 171 | 31.55 |
Masahiro Oku | 4 | 32 | 5.91 |
ntt cyber | 5 | 6 | 1.62 |