anakin87 (Stefano Fiorucci)

reacted to davanstrien's post with ❤️ 1 day ago

Post

1071

huggingface.co/DIBT is dead!

Long live https://huggingface.co/data-is-better-together!

We're working on some very cool projects so we're doing a bit of tidying of the Data is Better Together Hub org 🤓

replied to their post about 1 month ago

💡 𝐌𝐚𝐠𝐩𝐢𝐞 𝐰𝐢𝐭𝐡 𝐬𝐲𝐬𝐭𝐞𝐦 𝐦𝐞𝐬𝐬𝐚𝐠𝐞

I had another idea: use the system message to steer generation towards a specific language.

The system message should be in the target language, like:
"You are an artificial intelligence that answers users' questions in TARGET_LANGUAGE in a useful and detailed way. The user asks complex questions in TARGET_LANGUAGE."

It is a simple approach, but it might work...

It turns out the authors had a similar idea, which they included in the latest revision of their paper. 🎉

🍪 Resources

Magpie paper and repository: https://huggingface.co/papers/2406.08464 https://github.com/magpie-align/magpie

Magpie demo by @davanstrien : https://huggingface.co/spaces/davanstrien/magpie

Magpie Ollama Datagen by @mrm8488 : https://github.com/mrm8488/magpie-ollama-datagen

magpie-ultra dataset - massive dataset built with Magpie by Argilla: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1

⚗️ distilabel framework - framework for synthetic data generation and AI feedback at scale: https://distilabel.argilla.io/latest/

posted an update about 1 month ago

Post

1033

Ok, you're finally convinced that synthetic data works... ⚗️

𝐍𝐨𝐰 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐞 𝐚𝐧 𝐢𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐢𝐧 𝐚 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐄𝐧𝐠𝐥𝐢𝐬𝐡.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

🐦‍⬛ 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐌𝐚𝐠𝐩𝐢𝐞?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea 👇
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

🪄 The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.

🧗𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐧𝐨𝐧-𝐄𝐧𝐠𝐥𝐢𝐬𝐡 𝐝𝐚𝐭𝐚

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

❌ Unfortunately, it does not work well for other languages (🇮🇹, 🇳🇱, ...)

👇

1 reply

·

posted an update about 2 months ago

Post

1693

🕵🏻 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐑𝐀𝐆 𝐰𝐢𝐭𝐡 🦙 𝐋𝐥𝐚𝐦𝐚 3.2

I was excited to explore Llama 3.2, but as a simple 🇪🇺 EU guy, I don't have access to Meta's multimodal models 😿

🤔 So I thought: why not challenge the small 3B text model with Agentic RAG?

🎯 The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.

Check out my experimental notebook here: 📓 https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb

My stack:
🏗️ haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
🦙 meta-llama/Llama-3.2-3B-Instruct
🦆🌐 free DuckDuckGo API, integrated with Haystack

✨ 𝘛𝘩𝘦 𝘳𝘦𝘴𝘶𝘭𝘵𝘴? 𝘌𝘯𝘤𝘰𝘶𝘳𝘢𝘨𝘪𝘯𝘨 - 𝘢 𝘧𝘦𝘸 𝘮𝘰𝘯𝘵𝘩𝘴 𝘢𝘨𝘰, 𝘵𝘩𝘪𝘴 𝘭𝘦𝘷𝘦𝘭 𝘰𝘧 𝘱𝘦𝘳𝘧𝘰𝘳𝘮𝘢𝘯𝘤𝘦 𝘧𝘳𝘰𝘮 𝘢 𝘴𝘮𝘢𝘭𝘭 𝘮𝘰𝘥𝘦𝘭 𝘸𝘰𝘶𝘭𝘥'𝘷𝘦 𝘣𝘦𝘦𝘯 𝘶𝘯𝘵𝘩𝘪𝘯𝘬𝘢𝘣𝘭𝘦!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).

posted an update 3 months ago

Post

1082

𝐌𝐲 𝐟𝐢𝐫𝐬𝐭 𝐜𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲 𝐚𝐫𝐭𝐢𝐜𝐥𝐞! 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐯𝐞 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐰𝐢𝐭𝐡 𝐒𝐩𝐞𝐜𝐭𝐫𝐮𝐦 🎯

Full walkthrough on how to get started with Spectrum and TRL for efficient fine-tuning.
📔 👣 https://huggingface.co/blog/anakin87/spectrum

---

Looking to fine-tune Language Models efficiently and save on computational resources?

One popular method is QLoRa, which quantizes the original model and trains low-rank adapters on top.
It's quite effective and uses less GPU than full fine-tuning.

However, QLoRa applies Low-Rank Adaptation uniformly across the entire model.

What if we could identify the most informative layers and only fine-tune those? 🤔

This is exactly what Spectrum does! 👇

🔬 Spectrum analyzes the weight matrices for all layers in a Language Model and calculates a Signal to Noise Ratio (SNR) for each one.
(It uses Random Matrix Theory and Marchenko-Pastur distribution to distinguish signal from noise.)

🎯 Based on a chosen percentage (say, 25%), Spectrum selects the most informative layers of each type (mlp.down_proj, self_attn.o_proj, etc.).

You can then ❄️ freeze the rest of the model and focus your 🏋️‍♂️ training on the chosen layers.

🏆 Results/Evaluation
- Spectrum is competitive with full fine-tuning and beats QLoRA on benchmarks.
- While QLoRA is more memory-efficient on a single GPU, Spectrum shines in distributed training setups.
- Great models trained with Spectrum: Dolphin models, Llama 3.1 Storm, numerous models by VAGO Solutions...

---

For a practical guide, check out the article above.

reacted to grimjim's post with 👀 3 months ago

Post

3227

I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.

2 replies

·

replied to their post 3 months ago

This comment has been hidden

posted an update 3 months ago

Post

1623

💬 🇮🇹 Phi 3.5 mini ITA: a Small Language Model for Italian

Lately, I've spent some time fine-tuning language models.

Now I am happy to release Phi 3.5 mini ITA: a fine-tuned version of Phi-3.5-mini-instruct to improve performance on the Italian language

🔹 Small (3.82 B parameters) but capable model
🔹 128k context length

Chat with it on 🤗 Spaces: anakin87/Phi-3.5-mini-ITA
Model card: anakin87/Phi-3.5-mini-ITA

🗃️ Data
Supervised fine-tuning using a good mix of English and Italian data:
- mlabonne/FineTome-100k by @mlabonne
- efederici/capybara-claude-15k-ita by @efederici
🙏 Thanks to the authors for the datasets.

🎯 Targeted training with Spectrum
I used Spectrum, a relatively new technique for parameter-efficient learning.
The idea is to train only the layers of the model with high Signal-to-Noise Ratio (SNR) and ❄️ freeze the rest.
I trained the top 30% of model layers.

📝 Spectrum paper: https://arxiv.org/abs/2406.06623

📊 Vibe check and performance on Italian benchmarks seem encouraging

2 replies

·

reacted to efederici's post with ❤️ 3 months ago

Post

1551

Finally, I can post! 🚀

I created a Capybara-inspired Italian dataset by translating the initial instruction and running it through a pipeline to generate conversations. I used Claude Sonnet for translation and instruction generation, and Opus for generating the answers.

I hope this dataset proves useful for people working on 🇮🇹 language models.

⛁ Open sourcing the dataset here: efederici/capybara-claude-15k-ita

1 reply

·

reacted to gabrielmbmb's post with ❤️ 4 months ago

Post

2868

distilabel 1.3.0 is out! This release contains many core improvements and new tasks that help us building argilla/magpie-ultra-v0.1!

Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!

Check the new release in GitHub: https://github.com/argilla-io/distilabel

reacted to Ameeeee's post with 🔥 4 months ago

Post

3534

❤️‍🔥 Just released version 2.0 of Argilla!

This small revolution includes:

🔌 You can now integrate with the Hugging Face Hub and get started in under five minutes.
🪂 A single Dataset class is now designed to handle multiple tasks.
🔧 It’s 100 times simpler to configure your dataset now with the new SDK!
📖 The documentation has been revamped to be cleaner and more user-friendly.
🍌 A new feature automates splitting annotation tasks among a team.
✍️ The layout has been made more flexible to accommodate many use cases.

Check out the release highlights for more details: https://github.com/argilla-io/argilla/releases/tag/v2.0.0

1 reply

·

reacted to merve's post with 😎🚀❤️ 4 months ago

Post

3656

At Hugging Face we have an open-source Cookbook with many applied AI recipes 📖
Here are some of the latest recipes contributed ⥥

- "Information Extraction with Haystack and NuExtract": Use Haystack and transformers to build structured data extraction pipelines using LLMs by @anakin87 https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract

- "Build RAG with Hugging Face and Milvus": Learn how to use Milvus with sentence transformers to build RAG pipelines https://huggingface.co/learn/cookbook/rag_with_hf_and_milvus

- "Code Search with Vector Embeddings and Qdrant": Search a codebase by building a retrieval pipeline using Qdrant and sentence transformers https://huggingface.co/learn/cookbook/code_search

- Data analyst agent: get your data’s insights in the blink of an eye ✨: great recipe by our own @m-ric showing how to build an agent that can do data analysis! 😱 https://huggingface.co/learn/cookbook/agent_data_analyst

reacted to giux78's post with ❤️ 4 months ago

Post

1639

We https://mii-llm.ai just released a new LLM Italian benchmark and a set of evaluation: MMLU-PRO-ITA

Thanks to @efederici who released efederici/MMLU-Pro-ita a machine translated version of MMLU-PRO and thanks to a community shared computational effort we published in the "Eval Aggiuntive" tab of https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard the results on Italian open source LLMs.

If you want to deepen read the blog article on hf https://huggingface.co/blog/giux78/mmlu-pro-ita

replied to grimjim's post 5 months ago

Nice!

Have a look at my rap model (built with the same approach as MopeyMule): https://huggingface.co/anakin87/yo-Llama-3-8B-Instruct

reacted to grimjim's post with ❤️ 5 months ago

Post

2262

Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B

3 replies

·

reacted to mrm8488's post with ❤️ 5 months ago

Post

4349

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

posted an update 5 months ago

Post

1034

How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to 🎤 yo-Llama 🦙!

Model anakin87/yo-Llama-3-8B-Instruct

This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. 😎

𝐖𝐡𝐚𝐭 𝐬𝐩𝐚𝐫𝐤𝐞𝐝 𝐭𝐡𝐢𝐬 𝐢𝐝𝐞𝐚?

Lately, I got interested in mechanistic interpretability of LLMs.

💡 A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.

Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.

𝐇𝐨𝐰 𝐝𝐢𝐝 𝐈 𝐜𝐫𝐞𝐚𝐭𝐞 𝐲𝐨-𝐋𝐥𝐚𝐦𝐚?
(📓 notebook in the HF repository, heavily inspired by Failspy's work)

1️⃣ Load the Llama-3-8B-Instruct model.
2️⃣ Load 1024 examples from Alpaca (instruction dataset).
3️⃣ Prepare a system prompt to make the original model act like a rapper.
4️⃣ Run inference on the examples, with and without the system prompt, and cache the activations.
5️⃣ Compute the rap feature directions (one for each layer) from the activations.
6️⃣ Apply the feature directions one by one, checking the results on some examples.
7️⃣ Pick the best-performing feature direction.
8️⃣ Apply this feature direction and voilà!
yo-Llama-3-8B-Instruct is born! 🥳🎶

This was a fun experiment.

📚 Resources

Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717

Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration

Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule

reacted to fdaudens's post with ❤️ 5 months ago

Post

1871

Look at that 👀

Actual benchmarks have become too easy for recent models, much like grading high school students on middle school problems makes little sense. So the team worked on a new version of the Open LLM Leaderboard with new benchmarks.

Stellar work from @clefourrier @SaylorTwift and the team!

👉 Read the blog post: open-llm-leaderboard/blog
👉 Explore the leaderboard: open-llm-leaderboard/open_llm_leaderboard

1 reply

·

Stefano Fiorucci PRO

AI & ML interests

Articles

🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

Selective fine-tuning of Language Models with Spectrum

Organizations

anakin87's activity