Spaces:
Restarting
on
CPU Upgrade
Adding SummEdits to leaderboard?
Hey,
First of all great initiative!
I'm doing some self-advertising to propose adding a relevant benchmark to the leaderboard, which is SummEdits.
SummEdits is a benchmark we introduced at EMNLP, which frames hallucination detection specifically in summarization, on ten textual domains (all English).
Humans achieve super high performance on the benchmark, and there's still a gap between GPT4 and humans. It's framed as a binary classification, which is very easy to eval. In total, there are ~6,000 annotated samples, but they could be subsampled if needed.
We've already put the data on HF here: https://huggingface.co/datasets/Salesforce/summedits
And if you guys are interested, I'm happy to help integrate if it'd be helpful (I imagine it'd be fairly easy).
Cheers,
Philippe
@philippelaban since you are at it, can you please add it to https://github.com/EdinburghNLP/awesome-hallucination-detection with a pull request? :)
And if you guys are interested, I'm happy to help integrate if it'd be helpful (I imagine it'd be fairly easy).
Sure, we can also do it together in 15-30 min! If you can add it to https://github.com/EleutherAI/lm-evaluation-harness, adding it to the leaderboard will be immediate.