@anakin87 on Hugging Face: "Ok, you're finally convinced that synthetic data works... ⚗️ 𝐍𝐨𝐰 𝐲𝐨𝐮…"

Post

1033

Ok, you're finally convinced that synthetic data works... ⚗️

𝐍𝐨𝐰 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐞 𝐚𝐧 𝐢𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐢𝐧 𝐚 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐄𝐧𝐠𝐥𝐢𝐬𝐡.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

🐦‍⬛ 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐌𝐚𝐠𝐩𝐢𝐞?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea 👇
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

🪄 The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.

🧗𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐧𝐨𝐧-𝐄𝐧𝐠𝐥𝐢𝐬𝐡 𝐝𝐚𝐭𝐚

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

❌ Unfortunately, it does not work well for other languages (🇮🇹, 🇳🇱, ...)

👇

💡 𝐌𝐚𝐠𝐩𝐢𝐞 𝐰𝐢𝐭𝐡 𝐬𝐲𝐬𝐭𝐞𝐦 𝐦𝐞𝐬𝐬𝐚𝐠𝐞

I had another idea: use the system message to steer generation towards a specific language.

The system message should be in the target language, like:
"You are an artificial intelligence that answers users' questions in TARGET_LANGUAGE in a useful and detailed way. The user asks complex questions in TARGET_LANGUAGE."

It is a simple approach, but it might work...

It turns out the authors had a similar idea, which they included in the latest revision of their paper. 🎉

🍪 Resources

Magpie paper and repository: https://huggingface.co/papers/2406.08464 https://github.com/magpie-align/magpie

Magpie demo by @davanstrien : https://huggingface.co/spaces/davanstrien/magpie

Magpie Ollama Datagen by @mrm8488 : https://github.com/mrm8488/magpie-ollama-datagen

magpie-ultra dataset - massive dataset built with Magpie by Argilla: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1

⚗️ distilabel framework - framework for synthetic data generation and AI feedback at scale: https://distilabel.argilla.io/latest/

Join the conversation