Using of heterogeneous corpora for training of an ASR system. - Citegraph

Paper Info

Title
Using of heterogeneous corpora for training of an ASR system.

Abstract
The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on Speech-to-text-translation for low-resource languages. The Pashto language was chosen as a good proxy low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to-text-translation systems development hard. Even when the amount of data is seemingly sufficient, given the fact that the data originates from multiple sources, the preliminary experiments reveal that there is little to no benefit in merging (concatenating) the corpora and more elaborate ways of making use of all of the data must be worked out. This paper concentrates only on the LVCSR part and presents a range of different techniques that were found to be useful in order to benefit from multiple different corpora

Year	Venue	Field
2017	arXiv: Computation and Language	Computer science,Artificial intelligence,Natural language processing,Concatenation,System development,Pashto,Merge (version control)
DocType	Volume	Citations
Journal	abs/1706.00321	0
PageRank	References	Authors
0.34	3	6

Authors (6 rows)

Cited by (0 rows)

References (3 rows)

Name	Order	Citations	PageRank
Jan Trmal	1	235	20.91
Gaurav Kumar	2	82	5.49
Vimal Manohar	3	54	7.99
Sanjeev Khudanpur	4	2155	202.00
Matt Post	5	414	35.72
Paul McNamee	6	425	38.59

1