@singhsidhukuldeep on Hugging Face: "Here is a thought, instead of telling LLMs what to do, show them! 🎭 Language…"

Post

3181

Here is a thought, instead of telling LLMs what to do, show them! 🎭

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. 🗣️🌍

DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! 🤯

What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! 😂)

Here is the step-by-step implementation: 🛠️

Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. 🏁

Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. 🎚️

Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. 🔄

Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. ⏭️

Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. 🎯

DITTO outperforms few-shot prompting. 🚀

Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) 📄

Join the conversation