-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
Paper • 2311.05232 • Published -
Lynx: An Open Source Hallucination Evaluation Model
Paper • 2407.08488 • Published -
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Paper • 2401.00396 • Published • 3 -
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
Paper • 2404.10774 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2303.16634
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 29 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 32
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 32 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 29 -
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 53
-
Can Large Language Models Be an Alternative to Human Evaluations?
Paper • 2305.01937 • Published • 2 -
Decontextualization: Making Sentences Stand-Alone
Paper • 2102.05169 • Published -
RARR: Researching and Revising What Language Models Say, Using Language Models
Paper • 2210.08726 • Published • 1 -
SummEval: Re-evaluating Summarization Evaluation
Paper • 2007.12626 • Published
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 10 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35
-
DSI++: Updating Transformer Memory with New Documents
Paper • 2212.09744 • Published • 1 -
Where to start? Analyzing the potential value of intermediate models
Paper • 2211.00107 • Published -
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Paper • 2305.14282 • Published -
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
miracl/miracl-corpus
Viewer • Updated • 77.2M • 5.94k • 44 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 29 -
How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 23
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 32 -
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Paper • 2310.12823 • Published • 35 -
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Paper • 2310.12397 • Published • 1
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 32 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 53 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 11