Can Large Language Models Be an Alternative to Human Evaluations? Paper • 2305.01937 • Published May 3, 2023 • 2
RARR: Researching and Revising What Language Models Say, Using Language Models Paper • 2210.08726 • Published Oct 17, 2022 • 1
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization Paper • 2112.08542 • Published Dec 16, 2021
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Paper • 2111.09525 • Published Nov 18, 2021
ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks Paper • 2303.15056 • Published Mar 27, 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment Paper • 2303.16634 • Published Mar 29, 2023 • 3
Large Language Models Are State-of-the-Art Evaluators of Translation Quality Paper • 2302.14520 • Published Feb 28, 2023
RAGAS: Automated Evaluation of Retrieval Augmented Generation Paper • 2309.15217 • Published Sep 26, 2023 • 3
Measuring Attribution in Natural Language Generation Models Paper • 2112.12870 • Published Dec 23, 2021
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Paper • 2311.09476 • Published Nov 16, 2023 • 5
L-Eval: Instituting Standardized Evaluation for Long Context Language Models Paper • 2307.11088 • Published Jul 20, 2023 • 4
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation Paper • 2305.14251 • Published May 23, 2023 • 2
Leveraging Large Language Models for NLG Evaluation: A Survey Paper • 2401.07103 • Published Jan 13 • 4