WARM: On the Benefits of Weight Averaged Reward Models
Abstract
Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.
Community
How is the "Control Reward" calculated?
I was unable to locate a definition for this term.
Hi, author here. Thank you for the question. The answer is in the first paragraph of Section 5, where we state: "we leverage a PaLM-XS RM for pointwise control reward reaching 80.1% accuracy on the OOD dataset. As verified in our experiments, this control RM also detects hacking, as it benefits from a larger architecture and a disjoint pretraining compared to the PaLM-XXS RMs of interest". In other words, the control reward is also a RM, trained on the same dataset, but with larger architecture and different pretraining. This pointwise control reward enables plotting absolute scores, and the observations are consistent with the "pairwise oracle preference metric".
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles (2023)
- Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking (2023)
- Secrets of RLHF in Large Language Models Part II: Reward Modeling (2024)
- DRLC: Reinforcement Learning with Dense Rewards from LLM Critic (2024)
- Language Model Alignment with Elastic Reset (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
How is the "pairwise oracle preference metric" calculated?
The pairwise oracle preference metric is described in the first paragraph of Section 5, and further detailed in Appendix B.2. Roughly, we follow the best AI labelling approach from RLAIF https://arxiv.org/abs/2309.00267.
So in Figure 7, the win or loss is labled by the PaLM-XS RM (i.e. control reward) instead of gpt4 or human?
Thanks for the question. No actually in Figure 7 the win rate is computed with the oracle preference metric, i.e., the AI labelling approach from RLAIF with a "PaLM-L model prompted with chain-of-thought".
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper