Multi-language Speech Collection for NIST LRE. - Citegraph

Paper Info

Title
Multi-language Speech Collection for NIST LRE.

Abstract
The Multi-language Speech (MLS) Corpus supports NIST's Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected data was an important aspect of this work. This paper lists the specific data requirements for the collection and provides both a commentary on the rationale for those requirements as well as an outline of the various steps taken to ensure all goals were met as specified. LDC conducted a large-scale recruitment effort involving the implementation of candidate assessment and interview techniques suitable for hiring a large contingent of telecommuting workers, and this recruitment effort is discussed in detail. We also describe the telephone and broadcast collection infrastructure and protocols, and provide details of the steps taken to pre-process collected data prior to auditing. Finally, annotation training, procedures and outcomes are presented in detail.

Year	Venue	Keywords
2016	LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	language recognition,speech,telephone,broadcast
Field	DocType	Citations
Computer science,Speech recognition,NIST,Natural language processing,Artificial intelligence,Multi language	Conference	0
PageRank	References	Authors
0.34	1	5

Authors (5 rows)

Cited by (0 rows)

References (1 rows)

Name	Order	Citations	PageRank
Karen Sparck Jones	1	1158	363.97
Stephanie Strassel	2	512	58.41
Kevin Walker	3	65	21.51
David Graff	4	71	23.77
Jonathan Wright	5	5	2.24

1