@loubnabnl on Hugging Face: "We've just published a detailed blog post on the creation of Cosmopedia…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

loubnabnl

posted an update Mar 20

Post

6371

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

muhtasham

Mar 21

Amazing write-up!

In this post

loubnabnl Loubna Ben Allal
muhtasham Muhtasham Oblokulov