Abstract | ||
---|---|---|
In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics in various datasets. Code for KPQA-metric will be available at https://github.com/hwanheelee1993/KPQA. |
Year | Venue | DocType |
---|---|---|
2021 | NAACL-HLT | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Hwanhee Lee | 1 | 0 | 0.34 |
Seung-hyun Yoon | 2 | 160 | 26.47 |
Franck Dernoncourt | 3 | 149 | 35.39 |
Doo Soon Kim | 4 | 12 | 2.05 |
Trung H. Bui | 5 | 86 | 21.88 |
Joongbo Shin | 6 | 10 | 2.58 |
Kyomin Jung | 7 | 394 | 37.38 |