
Abstract
Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.
Item Type: | Conference or Workshop Item (Poster) |
---|---|
Faculties: | Mathematics, Computer Science and Statistics > Statistics Mathematics, Computer Science and Statistics > Statistics > Chairs/Working Groups > Methods for missing Data, Model selection and Model averaging |
Subjects: | 500 Science > 510 Mathematics |
URN: | urn:nbn:de:bvb:19-epub-92533-3 |
Place of Publication: | Dublin, Ireland |
Item ID: | 92533 |
Date Deposited: | 01. Jul 2022, 09:41 |
Last Modified: | 01. Jul 2022, 09:41 |