Automatic evaluations are independent of the language. Although they provide practical language quality measures, it says nothing about the contents. Therefore,
assessments based on human judgements are complementary.
In natural language generation (NLG), several metrics show the difficulty of having a source text (written by a human being) and a target
text (written by software). BLEU, ROUGE, METEOR, NIST and WER metrics are used to assign a score for measuring parts of words (N-grams)
and their frequency by comparing a source text and a text target.
Test the automatic evaluation metrics
BLEU (Bilingual Evaluation Understudy) |
gives equal weight to all N-grams.
When BLEU reaches 1, the N-grams between the source and target texts correspond. This metric was developed by IBM and is commonly used in automated translation. |
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) |
gives weight to the higher proportion of N-grams.
There are several metrics ROUGE. The most common is ROUGEN, which calculates the highest proportion of
N-grams of a length N in a reference text.
ROUGE variants correspond to variants of the method of computation (ROUGE-S, ROUGE-L, ROUGE-W,ROUGE-2 and ROUGEU).
ROUGE is commonly used in connection with the generation of automatic text summaries.
|
NIST (National Institute of Standards and Technology) |
NIST is an adaptation of BLEU. While BLEU gives equal weight to all N-grams, NIST gives more importance to the less frequent N-grams.
NIST correlates best with human judgments.
|
METEOR (Metric for Evaluation of Translation with Explicit Ordering) |
gives equal weight to all N-grams.
It adds a recall rate (frequency) and a precision rate (relevance) into its formula.
. This metric is based on the principle of explicit connections (literally between the source text and the target text,
whether it is the exact word or the morphological variation of the word).
|
WER (Word Error Rate) |
this formula is based on explicit correspondence
(exact word or morphological variant). This metric is commonly used in the field of voice recognition.
|
References
-
Agarwal Abhaya et Lavie Alon. METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 115–118. Association for Computational Linguistics, 2008
-
Banerjee Satanjeev et Lavie Alon. METEOR : An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic et extrinsic evaluation measures for machine translation et/or summarization, pages 65–72, 2005.
-
Belz Anja et Reiter Ehud. Comparing automatic et human evaluation of NLG systems. In EACL, 2006.
-
Belz Anja and Reiter Ehud. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4) :529–558, 2009.
-
Chin-Yew Lin. ROUGE : a package for automatic evaluation of summaries. In Text Summarization Branches Out : Proceedings of the ACL-04 Workshop, pages 74– 81, 2004.
-
Dale Robert and White Michael. Shared tasks and comparative evaluation in natural language generation, 2012.
-
Morris A. Cameron, Maier Viktoria, and Green Phil. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.
-
Tomás Jesús, Mas Josep Àngel, and Casacuberta Francisco. A quantitative method for machine translation evaluation. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing : are evaluation methods, metrics et resources reusable?,pages 27–34. Association for Computational Linguistics, 2003.
Readability scores and edit distance
-
Flesch Rudolph. A new readability yardstick. Journal of applied psychology, 32(3) :221, 1948.
-
Kincaid J. Pander, Fishburne Jr Robert P., Rogers Richard L., and Chissom Brad S. Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document, 1975.
-
Li Yujian and Liu Bo. A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence (Impact Factor : 4.8), 29(6) :1091–5, 2007.