davidberenstein1957 (David Berenstein)

posted an update 2 days ago

Post

1739

🧶 We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!

🌊 Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub

⚡️ Powered by Argilla's distilabel and open source LLMs
🆓 Uses Free Serverless HF Inference Endpoints

💡 Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects

🚀 Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
argilla/distilabel-datacraft

posted an update 7 days ago

Post

1620

🦀 Is your SQL a bit rusty? I just created theText To SQL Hub dataset explorer. To write SQL queries based on natural text input. Uses DuckDB, Llama 3.1 70B and the Hugging Face dataset-server API.

davidberenstein1957/text-to-sql-hub-datasets

posted an update 8 days ago

Post

1324

Distilabel and synthetic data community interviews - the outcomes

We've been doing some interview with community members to understand the needs surrounding synthetic data. Many thanks to the participants. Note that, given they interviewees were sourced from our community, so the results will likely represent that.

Things distilabel does well
- security and reliability by caching generations and having serializable pipelines.
- scaling up generation by parallelising inference and Anyscale Ray
- solid implementations of state of the art research papers

Things to improve
- communication about the fact we support structured generation
- customization of existing prompt implementations are difficult
- creation of new tasks prove difficult
- arguments and parameters for tasks aren't available at first glance
- the learning curve can be steep
- more tutorials that represent real-life usage

Things to note
- create small scale and large scale dataset to Millions of records
- people use synthetic data to move away from frontier model providers
- people mostly use 7B or 70B models for generating

Participate here: https://github.com/argilla-io/distilabel/issues

posted an update 10 days ago

Post

1484

Interested in learning about everything Image?

With the rise of recent interest in Vision Language Models (VLMs), we decided to make a push to include an ImageField within Argilla! This means any open source developer can now work on better models for vision ML tasks too and we would like to show you how.

We would love to introduce this new feature to you, so we've prepared a set of notebooks to go over some common image scenarios.
finetune an CLIP retrieval model with sentence transformers
use ColPali+ Qwen VL for RAG and log the results to Argilla
image-generation preference: creating multi-modal preference datasets for free using Hugging Face inference endpoints.

See you on Thursday!

https://lu.ma/x7id1jqu

posted an update 12 days ago

Post

1805

🌟 Argilla v2.1.0 goes multi-modal: Image Field, Dark Mode, Enhanched Hugging Face Hub imports and more!

🖼 Image Field: Seamlessly work with multimodal datasets
🌓 Dark Mode: Reduce eye strain with our sleek new look
🤗 Enhanced Hugging Face Hub import with the SDK
🇪🇸 Spanish UI: Breaking language barriers

Plus more improvements to supercharge your model curation workflow!

Check out the full announcement for details and code examples: https://github.com/argilla-io/argilla/compare/v2.0.1...v2.1.0

posted an update 14 days ago

Post

285

🔥 Dataset Viber 0.3 launches with Synthesizer to synthesise data with a human in the loop, for free, using open source models with Argilla's distilabel but within a quick-and-easy Gradio Interface.

Why? Not trying to be all fancy and formal just to iterate on your data and to get familiar with your prompts and the produced data. Under the hood, it relies on Hugging Face Inference endpoints and the latest LLMs and VLMs like Meta Llama 3.1 and BlackForest Labs Flux models.

An addition to the other Interfaces that are already support.
- CollectorInterface: Lazily collect data of model interactions without human annotation.
- AnnotatorInterface: Walk through your data and annotate it with models in the loop.
- Synthesizer: Synthesize data with distilabel in the loop.
- BulkInterface: Explore your data distribution and annotate in bulk.

⭐️ Give some good vibes: https://github.com/davidberenstein1957/dataset-viber

posted an update 23 days ago

Post

504

🆕 🚀 🏎 fast-sentence-transformers - simply, faster, sentence-transformers

Released an initial version a while ago
Archived it because of a cleaner solution described in a blog by Philipp Schmid
Reimplemented it based on that cleaner solution
Unarchived the project
Packaged it up
Released a 0.5 version

pip install fast-sentence-transformers

https://github.com/davidberenstein1957/fast-sentence-transformers

replied to their post 28 days ago

Want to see more? ⭐️ the repo https://github.com/davidberenstein1957/dataset-viber

posted an update 28 days ago

Post

1297

🎉 Just dropped a fresh version of dataset-viber along with some cool, Gradio-based annotators! These tools aren't about formalities—they're here to help you quickly collect feedback and get your projects moving along to a more serious stage, ahumm @argilla .

Some new features!
- manual import from a CSV or the Hugging Face Hub
- manual export to CSV or the Hub
- improved automated export to the Hub and CSV
- limit interaction with specific components
- stream data with custom next_input features (SO to Ben Burtenshaw for the suggestions)
- model in-the-loop support for all tasks

dataset-viber/gradio-annotators-66c5ce73d5e3bf99caa445b1

3 replies

·

posted an update 30 days ago

Post

2977

🚀 We will be generating a preference dataset for DPO/ORPO and cleaning it with AI feedback during our upcoming meetup!

In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. You’ll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.

This session is perfect for you
- if you’re getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback

Sign up here: https://lu.ma/dt0c7jru

posted an update about 1 month ago

Post

1760

📣 Introducing Dataset Viber: your chill repo for data collection, annotation and vibe checks! 🎉

I've cooked up Dataset Viber, a set of cool tools designed to make data preparation for AI models easier, more approachable and enjoyable for standalone AI engineers and enthusiasts.

🔧 What Dataset Viber offers:
- CollectorInterface: Lazily collect model interaction data without human annotation
- AnnotatorInterface: Annotate your data with models in the loop
- BulkInterface: Explore data distribution and annotate in bulk
- Embedder: Efficiently embed data with ONNX-optimized speeds

🎯 Key features:
- Supports various tasks for text, chat, and image modalities
- Runs in .ipynb notebooks
- Logs data to local CSV or directly to Hugging Face Hub
- Easy to install via pip: pip install dataset-viber

It's not designed for team collaboration or production use, but rather as a fun and efficient toolkit for individual projects.

Want to give it a try? Check out the repository link https://github.com/davidberenstein1957/dataset-viber/.

I'm excited to hear your feedback and learn how you vibe with your data. Feel free to open an issue or reach out if you have any questions or suggestions!

Some shoutouts:
- Gradio for the amazing backbone
- Daniel van Strien for some initial presentations I did on vibe checks
- Emily Omier for the workshop on structuring GitHub repo READMEs
- Hamel Husain for keeping mentioning that people should look at their data.
- Philipp Schmid for his code for ONNX feature-extractors
- Ben Burtenshaw for the first PR

1 reply

·

replied to their post about 2 months ago

Each dataset contains a my_dataset_name/tree/main/creation_script.py to see the full config and creation pipeline.
https://huggingface.co/datasets/argilla/multi-modal-vlm-visit-bench/blob/main/creation_script.py

replied to their post about 2 months ago

Or explore them in our UI by logging in with your @huggingface account!
https://huggingface.co/spaces/argilla/argilla-template-space

posted an update about 2 months ago

Post

2193

💎 I created some shiny new Argilla datasets to go along with the 2.0 release!

import argilla as rg  

ds = rg.Dataset.from_hub(
    "argilla/multi-modal-vlm-visit-bench"
)

argilla/argilla-v20-compatible-datasets-66a8e670f351acac61a0421c

2 replies

·

posted an update about 2 months ago

Post

2386

⚗️ Find reusable synthetic data pipeline code and corresponding datasets on the @huggingface Hub.

Find your pipline and use $ distilabel pipeline run --config "hugging_face_dataset_url/pipeline.yaml"

Some components I used
- Embedded dataset viewer https://huggingface.co/docs/hub/main/en/datasets-viewer-embed
- Hugging Face fsspec https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
- distilabel https://distilabel.argilla.io/latest/
- Gradio leaderboard by Freddy Boulton freddyaboulton/gradio_leaderboard
- Gradio modal by Ali Abid

Space: davidberenstein1957/distilabel-synthetic-data-pipeline-explorer

posted an update about 2 months ago

Post

1403

The Meta Llama-3.1 model series can be used for distilling and fine-tuning but this requires annotated preference data so I created a Human Feedback Collector based on Gradio that directly logs data to the Hugging Face Hub.

- Model meta-llama/Meta-Llama-3.1-8B-Instruct
- Data SFT, KTO and DPO data
- Runs on free Zero GPUs in Hugging Face Spaces
- Might need some human curation in Argilla
- Or provide some AI feedback with distilabel

https://huggingface.co/collections/davidberenstein1957/chatinterface-llm-human-feedback-collectors-66a22859c9e703d2af7500c1

replied to their post about 2 months ago

http://hf.co/join/discord

posted an update about 2 months ago

Post

671

Questions about data, synthetic data, human feedback and data quality?

Argilla has moved its community from Slack to the Hugging Face Discord server!

When part of the Hugging Face Discord, you can select “Channels & roles” and select “Argilla” along with any of the other groups that are interesting to you. “Argilla” will cover anything about argilla and distilabel, and it will give you access to 1) #argilla-distilabel-general, for all general discussions and news, and 2) #argilla-distilabel-help, for any usage-focused questions.

1 reply

·

replied to alex-abb's post 3 months ago

Cool demo!✌🏼

replied to their post 3 months ago

We've also got this playlist of NLP/AI topics applied in practise with Argilla: https://www.youtube.com/watch?v=OO235zLZTT4&list=PLBmuFBJ5cjcbsr49KFoC4DQoo3ZWT7q_d

replied to their post 3 months ago

We are launching our 2.0 SDK these days. Feel free to check our docs and install from main: https://argilla-io.github.io/argilla/dev/.

posted an update 3 months ago

Post

2362

⚗️ Looking to get started with Synthetic data and AI Feedback?

I created this cool notebook for a workshop @davanstrien and I gave it a couple of weeks back. It uses https://distilabel.argilla.io/dev/ and I think it is a good entry point for anyone with a practical interest in the topic.

https://colab.research.google.com/github/davanstrien/data-for-fine-tuning-llms/blob/main/03-synthetic-data-generation.ipynb

5 replies

·

posted an update 6 months ago

Post

1622

🔥🆕🆕🔥 Dataset Drop: 4 KTO signal transformed versions of the highly loved Argilla DPO datasets.

KTO formats for:
- UltraFeedback Cleaned Binarized
- Distilabel Intel Orca
- Distilabel Capybara
- DPO mix

argilla/preference-datasets-for-kto-65f98314d7c1b04ab54d41a7

Paper claims :)

https://arxiv.org/abs/2402.01306

KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal.

KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset.

When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales.

Do you need something custom? Take a look at @davanstrien his guide on creating your own KTO dataset with Argilla and our community.

https://github.com/huggingface/data-is-better-together/tree/main/kto-preference

replied to their post 7 months ago

On which part @Ali-C137?

posted an update 7 months ago

Post

A while ago, I presented this Phi2 DPO fine-tune notebook with LoRa. Got some input from @ybelkada about not needing a ref_model because we can just swap out the LoRa adapters during training. Cool feature 🤓

https://colab.research.google.com/drive/1PGMj7jlkJaCiSNNihA2NtpILsRgkRXrJ#scrollTo=wXqoH2TMnjjp

5 replies

·

David Berenstein

AI & ML interests

Articles

To what extent are we responsible for our content and how to create safer Spaces?

Data Is Better Together: A Look Back and Forward

Organizations

davidberenstein1957's activity