arxiv:2401.12187

WARM: On the Benefits of Weight Averaged Reward Models

Published on Jan 22

· Submitted by

akhaliq on Jan 23

Upvote

Authors:

Alexandre Ramé ,

Johan Ferret

Abstract

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

View arXiv page View PDF Add to collection

Community

gojiteji

Jan 23

How is the "Control Reward" calculated?
I was unable to locate a definition for this term.

alexrame

Paper author Jan 23

•

edited Jan 23

Hi, author here. Thank you for the question. The answer is in the first paragraph of Section 5, where we state: "we leverage a PaLM-XS RM for pointwise control reward reaching 80.1% accuracy on the OOD dataset. As verified in our experiments, this control RM also detects hacking, as it benefits from a larger architecture and a disjoint pretraining compared to the PaLM-XXS RMs of interest". In other words, the control reward is also a RM, trained on the same dataset, but with larger architecture and different pretraining. This pointwise control reward enables plotting absolute scores, and the observations are consistent with the "pairwise oracle preference metric".

librarian-bot

Jan 24

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

eyuansu71

Feb 11

How is the "pairwise oracle preference metric" calculated?

alexrame

Paper author Feb 11

The pairwise oracle preference metric is described in the first paragraph of Section 5, and further detailed in Appendix B.2. Roughly, we follow the best AI labelling approach from RLAIF https://arxiv.org/abs/2309.00267.

eyuansu71

Feb 11

So in Figure 7, the win or loss is labled by the PaLM-XS RM (i.e. control reward) instead of gpt4 or human?

alexrame

Paper author Feb 11

Thanks for the question. No actually in Figure 7 the win rate is computed with the oracle preference metric, i.e., the AI labelling approach from RLAIF with a "PaLM-L model prompted with chain-of-thought".