Abstract
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.
Community
Very nice paper! Do you plan to share the code to the community? Best, Yannick
Hi! Thanks! This was my prev answer on Twitter about model+code release:
βWell, as you know releasing models got way harder for corps in the current landscape, and we're a small team in FAIR Labs [+ NYU] with limited resources (e.g., not part of Llama team). For code, we've also had some approvals + other issues .. but hopeful to get there soonβ.
Nice paper, I just waiting for such a cool news like this. π€’
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking (2023)
- Aligning Large Language Models with Human Preferences through Representation Engineering (2023)
- DRLC: Reinforcement Learning with Dense Rewards from LLM Critic (2024)
- Secrets of RLHF in Large Language Models Part II: Reward Modeling (2024)
- Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Let the model reward itself. What could go wrong?
Amazing work!!!
I have some naive concerns and I am wondering if you could help me. If the initial SPF trained model is required to align (for example, it will have some ethical issues). So will this kind of problem be solved after the Iterative training? My concern is that the model acting as a Judger cannot penalize such output. Because there is no human feedback involved in the entire training process, or the discriminator trained by human feedback.
Thanks!!!
The reward model is executed by LLM-as-a-Judge by a chosen prompt (see Fig 2) in paper, so it would actually be very easy to incorporate safety rewards, e.g ethics issues .. simply rewrite the fig 2 prompt to include the wording that you want that would aim to make it safe.. ofc not discounting humans shouldn't be in the loop somewhere in the optimal system.
@davegoldblatt Let humans reward the model. What could go wrong? π
(by which I mean, these are challenging problems requiring fundamental research .. whichever way you go..)
Thanks for your reply. Another idea, will the self-reward and iterative training method work in the Multimodal LLM(A unified model based on LLM that can understand and generate text and image via (de)tokenizor)? Because current MLLMs are mainly based on SFT and the generating capability of MLLM are not so good. So I wonder if it is an approach to improve MLLM.
This is amazing, will try it tonight ;-)
I have been through the code, seems to work well.
Do you have a notebook with a real Datasets (no mock) and model from hugging face?
I guess your x transformers config is Llama 70B⦠Is it compatible with any model?
With this (and other automatically improving / self-correcting LLM approaches), has the MAD paper (https://huggingface.co/papers/2307.01850) been debunked?
How AI Trains Itself: Inside Self-Rewarding Language Models
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/