David Berenstein

davidberenstein1957

AI & ML interests

Everything NLP and knowledge graphs

Articles

Organizations

davidberenstein1957's activity

posted an update 2 days ago
view post
Post
1739
๐Ÿงถ We are launching distilabel DataCraft: get started with synthetic data using clicks and natural language!

๐ŸŒŠ Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub

โšก๏ธ Powered by Argilla's distilabel and open source LLMs
๐Ÿ†“ Uses Free Serverless HF Inference Endpoints

๐Ÿ’ก Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects

๐Ÿš€ Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
argilla/distilabel-datacraft
posted an update 7 days ago
view post
Post
1620
๐Ÿฆ€ Is your SQL a bit rusty? I just created theText To SQL Hub dataset explorer. To write SQL queries based on natural text input. Uses DuckDB, Llama 3.1 70B and the Hugging Face dataset-server API.

davidberenstein1957/text-to-sql-hub-datasets
posted an update 8 days ago
view post
Post
1324
Distilabel and synthetic data community interviews - the outcomes

We've been doing some interview with community members to understand the needs surrounding synthetic data. Many thanks to the participants. Note that, given they interviewees were sourced from our community, so the results will likely represent that.

Things distilabel does well
- security and reliability by caching generations and having serializable pipelines.
- scaling up generation by parallelising inference and Anyscale Ray
- solid implementations of state of the art research papers

Things to improve
- communication about the fact we support structured generation
- customization of existing prompt implementations are difficult
- creation of new tasks prove difficult
- arguments and parameters for tasks aren't available at first glance
- the learning curve can be steep
- more tutorials that represent real-life usage

Things to note
- create small scale and large scale dataset to Millions of records
- people use synthetic data to move away from frontier model providers
- people mostly use 7B or 70B models for generating

Participate here: https://github.com/argilla-io/distilabel/issues
posted an update 10 days ago
view post
Post
1484
Interested in learning about everything Image?

โ€‹With the rise of recent interest in Vision Language Models (VLMs), we decided to make a push to include an ImageField within Argilla! This means any open source developer can now work on better models for vision ML tasks too and we would like to show you how.

โ€‹We would love to introduce this new feature to you, so we've prepared a set of notebooks to go over some common image scenarios.
finetune an CLIP retrieval model with sentence transformers
use ColPali+ Qwen VL for RAG and log the results to Argilla
image-generation preference: creating multi-modal preference datasets for free using Hugging Face inference endpoints.

โ€‹See you on Thursday!

https://lu.ma/x7id1jqu
posted an update 12 days ago
view post
Post
1805
๐ŸŒŸ Argilla v2.1.0 goes multi-modal: Image Field, Dark Mode, Enhanched Hugging Face Hub imports and more!

๐Ÿ–ผ Image Field: Seamlessly work with multimodal datasets
๐ŸŒ“ Dark Mode: Reduce eye strain with our sleek new look
๐Ÿค— Enhanced Hugging Face Hub import with the SDK
๐Ÿ‡ช๐Ÿ‡ธ Spanish UI: Breaking language barriers

Plus more improvements to supercharge your model curation workflow!

Check out the full announcement for details and code examples: https://github.com/argilla-io/argilla/compare/v2.0.1...v2.1.0
posted an update 14 days ago
view post
Post
285
๐Ÿ”ฅ Dataset Viber 0.3 launches with Synthesizer to synthesise data with a human in the loop, for free, using open source models with Argilla's distilabel but within a quick-and-easy Gradio Interface.

Why? Not trying to be all fancy and formal just to iterate on your data and to get familiar with your prompts and the produced data. Under the hood, it relies on Hugging Face Inference endpoints and the latest LLMs and VLMs like Meta Llama 3.1 and BlackForest Labs Flux models.

An addition to the other Interfaces that are already support.
- CollectorInterface: Lazily collect data of model interactions without human annotation.
- AnnotatorInterface: Walk through your data and annotate it with models in the loop.
- Synthesizer: Synthesize data with distilabel in the loop.
- BulkInterface: Explore your data distribution and annotate in bulk.

โญ๏ธ Give some good vibes: https://github.com/davidberenstein1957/dataset-viber
posted an update 23 days ago
view post
Post
504
๐Ÿ†• ๐Ÿš€ ๐ŸŽ fast-sentence-transformers - simply, faster, sentence-transformers

Released an initial version a while ago
Archived it because of a cleaner solution described in a blog by Philipp Schmid
Reimplemented it based on that cleaner solution
Unarchived the project
Packaged it up
Released a 0.5 version

pip install fast-sentence-transformers

https://github.com/davidberenstein1957/fast-sentence-transformers
replied to their post 28 days ago
posted an update 28 days ago
view post
Post
1297
๐ŸŽ‰ Just dropped a fresh version of dataset-viber along with some cool, Gradio-based annotators! These tools aren't about formalitiesโ€”they're here to help you quickly collect feedback and get your projects moving along to a more serious stage, ahumm @argilla .

Some new features!
- manual import from a CSV or the Hugging Face Hub
- manual export to CSV or the Hub
- improved automated export to the Hub and CSV
- limit interaction with specific components
- stream data with custom next_input features (SO to Ben Burtenshaw for the suggestions)
- model in-the-loop support for all tasks

dataset-viber/gradio-annotators-66c5ce73d5e3bf99caa445b1
  • 3 replies
ยท
posted an update 30 days ago
view post
Post
2977
๐Ÿš€ We will be generating a preference dataset for DPO/ORPO and cleaning it with AI feedback during our upcoming meetup!

In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. Youโ€™ll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.

This session is perfect for you
- if youโ€™re getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback

Sign up here: https://lu.ma/dt0c7jru
posted an update about 1 month ago
view post
Post
1760
๐Ÿ“ฃ Introducing Dataset Viber: your chill repo for data collection, annotation and vibe checks! ๐ŸŽ‰

I've cooked up Dataset Viber, a set of cool tools designed to make data preparation for AI models easier, more approachable and enjoyable for standalone AI engineers and enthusiasts.

๐Ÿ”ง What Dataset Viber offers:
- CollectorInterface: Lazily collect model interaction data without human annotation
- AnnotatorInterface: Annotate your data with models in the loop
- BulkInterface: Explore data distribution and annotate in bulk
- Embedder: Efficiently embed data with ONNX-optimized speeds

๐ŸŽฏ Key features:
- Supports various tasks for text, chat, and image modalities
- Runs in .ipynb notebooks
- Logs data to local CSV or directly to Hugging Face Hub
- Easy to install via pip: pip install dataset-viber

It's not designed for team collaboration or production use, but rather as a fun and efficient toolkit for individual projects.

Want to give it a try? Check out the repository link https://github.com/davidberenstein1957/dataset-viber/.

I'm excited to hear your feedback and learn how you vibe with your data. Feel free to open an issue or reach out if you have any questions or suggestions!

Some shoutouts:
- Gradio for the amazing backbone
- Daniel van Strien for some initial presentations I did on vibe checks
- Emily Omier for the workshop on structuring GitHub repo READMEs
- Hamel Husain for keeping mentioning that people should look at their data.
- Philipp Schmid for his code for ONNX feature-extractors
- Ben Burtenshaw for the first PR
  • 1 reply
ยท
replied to their post about 2 months ago
replied to their post about 2 months ago
posted an update about 2 months ago
posted an update about 2 months ago
view post
Post
2386
โš—๏ธ Find reusable synthetic data pipeline code and corresponding datasets on the @huggingface Hub.

Find your pipline and use $ distilabel pipeline run --config "hugging_face_dataset_url/pipeline.yaml"

Some components I used
- Embedded dataset viewer https://huggingface.co/docs/hub/main/en/datasets-viewer-embed
- Hugging Face fsspec https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
- distilabel https://distilabel.argilla.io/latest/
- Gradio leaderboard by Freddy Boulton freddyaboulton/gradio_leaderboard
- Gradio modal by Ali Abid

Space: davidberenstein1957/distilabel-synthetic-data-pipeline-explorer
posted an update about 2 months ago
view post
Post
1403
The Meta Llama-3.1 model series can be used for distilling and fine-tuning but this requires annotated preference data so I created a Human Feedback Collector based on Gradio that directly logs data to the Hugging Face Hub.

- Model meta-llama/Meta-Llama-3.1-8B-Instruct
- Data SFT, KTO and DPO data
- Runs on free Zero GPUs in Hugging Face Spaces
- Might need some human curation in Argilla
- Or provide some AI feedback with distilabel

https://huggingface.co/collections/davidberenstein1957/chatinterface-llm-human-feedback-collectors-66a22859c9e703d2af7500c1
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
671
Questions about data, synthetic data, human feedback and data quality?

Argilla has moved its community from Slack to the Hugging Face Discord server!

When part of the Hugging Face Discord, you can select โ€œChannels & rolesโ€ and select โ€œArgillaโ€ along with any of the other groups that are interesting to you. โ€œArgillaโ€ will cover anything about argilla and distilabel, and it will give you access to 1) #argilla-distilabel-general, for all general discussions and news, and 2) #argilla-distilabel-help, for any usage-focused questions.
  • 1 reply
ยท
replied to alex-abb's post 3 months ago
replied to their post 3 months ago
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
2362
โš—๏ธ Looking to get started with Synthetic data and AI Feedback?

I created this cool notebook for a workshop @davanstrien and I gave it a couple of weeks back. It uses https://distilabel.argilla.io/dev/ and I think it is a good entry point for anyone with a practical interest in the topic.

https://colab.research.google.com/github/davanstrien/data-for-fine-tuning-llms/blob/main/03-synthetic-data-generation.ipynb
  • 5 replies
ยท
posted an update 6 months ago
view post
Post
1622
๐Ÿ”ฅ๐Ÿ†•๐Ÿ†•๐Ÿ”ฅ Dataset Drop: 4 KTO signal transformed versions of the highly loved Argilla DPO datasets.

KTO formats for:
- UltraFeedback Cleaned Binarized
- Distilabel Intel Orca
- Distilabel Capybara
- DPO mix

argilla/preference-datasets-for-kto-65f98314d7c1b04ab54d41a7

Paper claims :)

https://arxiv.org/abs/2402.01306

KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal.

KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset.

When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales.

Do you need something custom? Take a look at @davanstrien his guide on creating your own KTO dataset with Argilla and our community.

https://github.com/huggingface/data-is-better-together/tree/main/kto-preference
replied to their post 7 months ago
posted an update 7 months ago