Post
Some of my results from experimenting with hallucination detection techniques for LLMs 🫨🔍
First, the two main ideas used in the experiments—using token probabilities and LLM-Eval scores—are taken from these three papers:
1. Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation (2208.05309)
2. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models (2303.08896)
3. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models (2305.13711)
In the first two, the authors claim that computing the average of the sentence-level token probabilities is the best heuristic for detecting hallucinations. And from my results, we do see a weak positive correlation between average token probabilities and ground truth. 🤔
The nice thing about this method is that it comes with almost no implementation cost since we only need the output token probabilities from the generated text, so it is straightforward to implement.
The third paper proposes an evaluation shema where we do an extra call to an LLM and kindly ask it to rate on a scale from 0 to 5 how good the generated text is on a set of different criteria. 📝🤖
I was able to reproduce similar results to those in the paper. There is a moderate positive correlation between the ground truth scores and the ones produced by the LLM.
Of course, this method is much more expensive since we would need one extra call to the LLM for every prediction that we would like to evaluate, and it is also very sensitive to prompt engineering. 🤷
First, the two main ideas used in the experiments—using token probabilities and LLM-Eval scores—are taken from these three papers:
1. Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation (2208.05309)
2. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models (2303.08896)
3. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models (2305.13711)
In the first two, the authors claim that computing the average of the sentence-level token probabilities is the best heuristic for detecting hallucinations. And from my results, we do see a weak positive correlation between average token probabilities and ground truth. 🤔
The nice thing about this method is that it comes with almost no implementation cost since we only need the output token probabilities from the generated text, so it is straightforward to implement.
The third paper proposes an evaluation shema where we do an extra call to an LLM and kindly ask it to rate on a scale from 0 to 5 how good the generated text is on a set of different criteria. 📝🤖
I was able to reproduce similar results to those in the paper. There is a moderate positive correlation between the ground truth scores and the ones produced by the LLM.
Of course, this method is much more expensive since we would need one extra call to the LLM for every prediction that we would like to evaluate, and it is also very sensitive to prompt engineering. 🤷