Title
Further meta-evaluation of broad-coverage surface realization
Abstract
We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion of the implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community.
Year
Venue
Keywords
2010
EMNLP
research community,surface realization,ter family,automatic score,automatic metrics,post-edited realization,automatic evaluation metrics,broad-coverage surface realization,human annotators,human judgment,penn treebank data
Field
DocType
Volume
Ranking,Computer science,Fluency,Artificial intelligence,Natural language processing,Treebank,WordNet,Machine learning
Conference
D10-1
Citations 
PageRank 
References 
8
0.68
16
Authors
4
Name
Order
Citations
PageRank
Dominic Espinosa1713.71
Rajakrishnan Rajkumar2946.72
Michael White31017.24
Shoshana Berleant480.68