Victor Mustar PRO

victor

AI & ML interests

Building the UX of this website

Recent Activity

reacted to davidberenstein1957's post with πŸ”₯ 14 minutes ago
replied to elliesleightholm's post about 1 hour ago
reacted to elliesleightholm's post with πŸ€— about 1 hour ago

Articles

Organizations

victor's activity

reacted to davidberenstein1957's post with πŸ”₯ 14 minutes ago
view post
Post
89
πŸ€—πŸ”­ Introducing Observers: A Lightweight SDK for AI Observability πŸ”­πŸ€—

Observers is an open-source Python SDK that provides comprehensive observability for AI applications. Our library makes it easy to:

- Track and record interactions with AI models
- Store observations in multiple backends
- Query and analyse your AI interactions with ease

https://huggingface.co/blog/davidberenstein1957/observers-a-lightweight-sdk-for-ai-observability
replied to elliesleightholm's post about 1 hour ago
view reply

Great content! Marqo looks great πŸ€—

reacted to elliesleightholm's post with πŸ€— about 1 hour ago
reacted to cfahlgren1's post with ❀️ about 14 hours ago
view post
Post
2030
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
Β·
reacted to jsulz's post with πŸ”₯ about 15 hours ago
view post
Post
779
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
πŸš€ Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
replied to LukeNeumann's post about 22 hours ago
view reply

Not me but yes I think the community would do something with it.

reacted to monsoon-nlp's post with ❀️ about 22 hours ago
view post
Post
992
Great to see Tatta Bio release an embeddings version of their DNA/protein language model 🧬: tattabio/gLM2_650M_embed
reacted to AkimfromParis's post with πŸ‘ about 22 hours ago
view post
Post
799
πŸ‡―πŸ‡΅ The Open Japanese LLM Leaderboard created by LLM-jp 🌸 in partnership with HuggingFace πŸ€— was released today!

Blog: https://huggingface.co/blog/leaderboard-japanese
Space: llm-jp/open-japanese-llm-leaderboard

🌍 The leaderboard is available in both Japanese and English
πŸ“š Based on the evaluation tool, llm-jp-eval with more than 20 datasets for Japanese LLMs
πŸ“Š The leaderboard showcases all the metrics for NLP experts, plus averages for NLP beginners
πŸ’» For the comfort of users, we chose a horizontal UI, and implemented it in a light and dark theme on Gradio
πŸ”¬ The radar chart provides a very interesting visualization of metrics!
🌱 We are using the Japanese research platform, MDX, so please be patient!
⚑ LLM bigger than +70B will be evaluated soon…

How do you say β€œGPUs Go Brrr” in Japanese - > GPUγŒγƒ–γƒ³γƒ–γƒ³ο½ž! (To pronounce "GPU ga bunbun!") πŸ”₯
  • 4 replies
Β·
replied to LukeNeumann's post 1 day ago
view reply

You're a legend! and yes the dataset would be πŸ”₯

reacted to LukeNeumann's post with 🀯 1 day ago
view post
Post
1005
Nine years ago, I uploaded the first 8K resolution video to YouTube and I've been stockpiling 8K footage ever since: https://www.youtube.com/watch?v=sLprVF6d7Ug&t

Should @Overlaiapp release the first open-source 8K video dataset?

Could anyone even fine tune a model with this?πŸ˜…
Β·
reacted to davidberenstein1957's post with πŸ”₯ 1 day ago
view post
Post
1788
For anyone who struggles with NER or information extraction with LLM.

We showed an efficient workflow for token classification including zero-shot suggestions and model fine-tuning with Argilla, GliNER, the NuMind NuExtract LLM and SpanMarker. @argilla

Video: https://youtu.be/JvLpaYgNd84?feature=shared
Notebooks and slides included to try it yourself πŸ™‚
replied to their post 3 days ago
view reply

Thanks, we'll see what we can do about filtering model based on parameter size (imo no super clear way to do it well, we need to think about it a bit).

replied to their post 3 days ago
view reply

Something new is coming soon regarding all this πŸ‘€ (stay tuned)

reacted to bartowski's post with ❀️ 3 days ago
view post
Post
15840
In regards to the latest mistral model and GGUFs for it:

Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window

Yes, I got excited when a conversion worked and released them ASAP

That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model

I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !
Β·
reacted to hexgrad's post with πŸ”₯ 4 days ago
reacted to merve's post with πŸ”₯ 4 days ago
view post
Post
4629
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
πŸ’¨ a new vision language model with 9x less image tokens, super efficient
πŸ“– aligned with DPO for reducing hallucinations
⚑️ Apache 2.0 license πŸ”₯

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model NexaAIDev/omnivision-968M
  • 4 replies
Β·
reacted to prithivMLmods's post with πŸ€— 4 days ago
view post
Post
3830
Minimalistic Adapters πŸŽƒ

πŸš€Demo Here:
prithivMLmods/FLUX-LoRA-DLC

πŸš€Model:
{ Quote Tuner } : prithivMLmods/Flux.1-Dev-Quote-LoRA
{ Stamp Art } : prithivMLmods/Flux.1-Dev-Stamp-Art-LoRA
{ Hand Sticky } : prithivMLmods/Flux.1-Dev-Hand-Sticky-LoRA
{ Poster HQ } : prithivMLmods/Flux.1-Dev-Poster-HQ-LoRA
{ Ctoon Min } : prithivMLmods/Flux.1-Dev-Ctoon-LoRA

πŸš€Collection:
{ Flux LoRA Collection} : prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be
{ LoRA Space Collection } : prithivMLmods/lora-space-collections-6714b72e0d49e1c97fbd6a32

πŸš€For More Visit
https://huggingface.co/strangerzonehf
.
.
.
πŸ€—@prithivMLmods
  • 3 replies
Β·
reacted to reach-vb's post with πŸ‘ 4 days ago
view post
Post
3914
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! πŸ€—
reacted to fdaudens's post with πŸš€ 6 days ago
view post
Post
1152
πŸͺ„ MagicQuill: AI that reads your mind for image edits! Point at what bugs you, and it suggests the perfect fixes. No more manual editing headaches. Try it here: AI4Editing/MagicQuill
reacted to cfahlgren1's post with πŸ‘€ 6 days ago
view post
Post
2190
Why use Google Drive when you can have:

β€’ Free storage with generous limitsπŸ†“
β€’ Dataset Viewer (Sorting, Filtering, FTS) πŸ”
β€’ Third Party Library Support
β€’ SQL Console 🟧
β€’ Security πŸ”’
β€’ Community, Reach, and Visibility πŸ“ˆ

It's a no brainer!

Check out our post on what you get instantly out of the box when you create a dataset.
https://huggingface.co/blog/researcher-dataset-sharing
  • 1 reply
Β·