Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf

For 1 epoch with 1e-5 learning rate.

The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

Validation accuracy is currently the best publicly available reported: 75.16% (vs 69.25% for OpenAssistant/reward-model-deberta-v3-large-v2).

Downloads last month: 1,390

Inference Examples

Text Classification

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train sileod/deberta-v3-large-tasksource-rlhf-reward-model

Evaluation results

accuracy on Anthropic/hh-rlhf
validation set self-reported

0,7516

View on Papers With Code

Reward model based deberta-v3-large-tasksource-nli fine-tuned on Anthropic/hh-rlhf

Dataset used to train sileod/deberta-v3-large-tasksource-rlhf-reward-model

Evaluation results

Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf