Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!
We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.
To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.
Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.
Would love to answer any questions you have so feel free to add them below!
I created a Capybara-inspired Italian dataset by translating the initial instruction and running it through a pipeline to generate conversations. I used Claude Sonnet for translation and instruction generation, and Opus for generating the answers.
I hope this dataset proves useful for people working on 🇮🇹 language models.
@mik3ml just released ReDiX/wikipediaQA-ita an interesting synthetic dataset originated from wikipedia using a fine tuned version of mistral-7B specific for the Italian language 🇮🇹 .
@mik3ml just released ReDiX/wikipediaQA-ita an interesting synthetic dataset originated from wikipedia using a fine tuned version of mistral-7B specific for the Italian language 🇮🇹 .
A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.
This primer can serve as a comprehensive introduction to recent advances in interpretability for Transformer-based LMs for a technical audience, employing a unified notation to introduce network modules and present state-of-the-art interpretability methods.
Interpretability methods are presented with detailed formulations and categorized as either localizing the inputs or model components responsible for a particular prediction or decoding information stored in learned representations. Then, various insights on the role of specific model components are summarized alongside recent work using model internals to direct editing and mitigate hallucinations.
Finally, the paper provides a detailed picture of the open-source interpretability tools landscape, supporting the need for open-access models to advance interpretability research.
The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.
The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.
Is is time for the open-source AI robots revolution 🚀?
With @haixuantao and @Leyo we’ve been playing with a low-cost DJI robot controlled by three local open-source AI models (Whisper, Idefics2, Parler-TTS - all Apache2) and orchestrated by Dora-cs.