ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 62
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7 • 46
Best Practices and Lessons Learned on Synthetic Data for Language Models Paper • 2404.07503 • Published Apr 11 • 29
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks Paper • 2404.14723 • Published Apr 23 • 10
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published May 20 • 34
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5 • 11
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper • 2402.14740 • Published Feb 22 • 10
HelpSteer2: Open-source dataset for training top-performing reward models Paper • 2406.08673 • Published Jun 12 • 16
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Paper • 2406.09279 • Published Jun 13 • 1
Understanding the performance gap between online and offline alignment algorithms Paper • 2405.08448 • Published May 14 • 14
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Paper • 2312.09390 • Published Dec 14, 2023 • 32
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint Paper • 2312.11456 • Published Dec 18, 2023 • 1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment Paper • 2304.06767 • Published Apr 13, 2023 • 2
Self-Play Preference Optimization for Language Model Alignment Paper • 2405.00675 • Published May 1 • 24
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs Paper • 2406.10216 • Published Jun 14 • 2
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published Jul 3 • 48
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment Paper • 2405.17931 • Published May 28
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning Paper • 2405.00451 • Published May 1
Foundations of Reinforcement Learning and Interactive Decision Making Paper • 2312.16730 • Published Dec 27, 2023
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Paper • 2408.07199 • Published Aug 13 • 20
Disentangling Length from Quality in Direct Preference Optimization Paper • 2403.19159 • Published Mar 28
Contrastive Prefence Learning: Learning from Human Feedback without RL Paper • 2310.13639 • Published Oct 20, 2023 • 24
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment Paper • 2408.06266 • Published Aug 12 • 9
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published Sep 19 • 134
HelpSteer2-Preference: Complementing Ratings with Preferences Paper • 2410.01257 • Published Oct 2 • 19
A Critical Evaluation of AI Feedback for Aligning Large Language Models Paper • 2402.12366 • Published Feb 19 • 3
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning Paper • 2410.08146 • Published Oct 10
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning Paper • 2410.02089 • Published Oct 2 • 11
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF Paper • 2411.01798 • Published 13 days ago • 8