Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Abstract
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
Community
@AdinaY Thanks for your interest in our paper!
You could access the models here:
https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0
https://huggingface.co/prometheus-eval/prometheus-7b-v2.0
Here's the github repo where we prepared (possibly) every functionality you might need:
https://github.com/prometheus-eval/prometheus-eval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CheckEval: Robust Evaluation Framework using Large Language Model via Checklist (2024)
- Rethinking Generative Large Language Model Evaluation for Semantic Comprehension (2024)
- RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners (2024)
- Self-Evaluation of Large Language Model based on Glass-box Features (2024)
- A User-Centric Benchmark for Evaluating Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
There's a plain-english rewrite of this paper available here: https://www.aimodels.fyi/papers/arxiv/prometheus-2-open-source-language-model-specialized
Hi here @seungone et al! Congrats on the paper and the release π
I was just wondering whether you guys did experiment with multi-prompt settings to e.g. critique the last assistant response/s, while using a conversation as input instead of an instruction.
Plus also the fact that some responses to a given instruction can be conditioned by the system prompt, whether you did consider adding a system prompt to the template or if you did run some ablations on that too.
Thanks in advance!
Hey @alvarobartt , thanks for your interest!
We did experiments using MT-Bench that is a multi-turn chat-based benchmark.
All you have to do is append the whole interaction at the {instruction} and insert the latest response in {response} from the template.
Also, we appended the system prompt to the {instruction} placeholder as well. Please let us know your experiences after using Prometheus 2:)