Title
Bias And Statistical Significance In Evaluating Speech Synthesis With Mean Opinion Scores
Abstract
Listening tests and Mean Opinion Scores (MOS) are the most commonly used techniques for the evaluation of speech synthesis quality and naturalness. These arc invaluable in the assessment of subjective qualities of machine generated stimuli. However. there are a number of challenges in understanding the MOS scores that come out of listening tests.Primarily, we advocate for the use of non-parametric statistical tests in the calculation of statistical significance when comparing listening test results.Additionally, based on the results of 46 legacy listening tests, we measure the impact of two sources of bias. Bias introduced by individual participants and synthesized text can a dramatic impact on observed MOS scores. For example, we find that on average the mean difference between the highest and lowest scoring rater is over 2 MOS points (on a 5 point scale). From this observation, we caution against using any statistical test without adjusting for this bias, and provide specific non-parametric recommendations.
Year
DOI
Venue
2017
10.21437/Interspeech.2017-479
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION
Keywords
Field
DocType
speech synthesis, listening tests, mean opinion score
Speech synthesis,Pattern recognition,Computer science,Speech recognition,Artificial intelligence,Statistical significance
Conference
ISSN
Citations 
PageRank 
2308-457X
4
0.39
References 
Authors
0
2
Name
Order
Citations
PageRank
Andrew Rosenberg1122.53
Bhuvana Ramabhadran21779153.83