Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

m-ric's activity

posted an update 1 day ago
view post
Post
1086
๐—”๐—ฟ๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฒ๐—ป๐—ผ๐˜‚๐—ด๐—ต ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ? โ‡’ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜„๐—ถ๐˜๐—ต ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐Ÿ“Š

A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโ€ฆ

โžก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.

๐—ง๐—ต๐—ฒ ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜
โ–ช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks
โ–ช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
โ–ช๏ธDifference with previous DS benchmarks:
โถ This benchmark leverages various modalities on top of text: images, Excel files, tables
โท Complex tables: sometimes several tables should be leveraged to answer one question
โธ The context is richer, with longer descriptions.
โ–ช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.

๐—œ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€
โ–ช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
โ–ช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.

This new benchmark is really welcome, can't wait to try transformers agents on it! ๐Ÿค—

Read their full paper ๐Ÿ‘‰ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)
posted an update 6 days ago
view post
Post
1039
๐Ž๐ฉ๐ž๐ง๐€๐ˆ ๐Ÿ๐ข๐ง๐š๐ฅ๐ฅ๐ฒ ๐ซ๐ž๐ฏ๐ž๐š๐ฅ๐ฌ โ€œ๐Ÿ“โ€: ๐œ๐ซ๐š๐ณ๐ฒ ๐œ๐ก๐š๐ข๐ง-๐จ๐Ÿ-๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ-๐ญ๐ฎ๐ง๐ž๐ ๐ฆ๐จ๐๐ž๐ฅ >> ๐†๐๐“-๐Ÿ’๐จ ๐Ÿ’ฅ

OpenAI had hinted at a mysterious โ€œproject strawberryโ€ for a long time: ๐˜๐—ต๐—ฒ๐˜† ๐—ฝ๐˜‚๐—ฏ๐—น๐—ถ๐˜€๐—ต๐—ฒ๐—ฑ ๐˜๐—ต๐—ถ๐˜€ ๐—ป๐—ฒ๐˜„ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฐ๐—ฎ๐—น๐—น๐—ฒ๐—ฑ โ€œ๐—ผ๐Ÿญโ€ ๐Ÿญ๐—ต๐—ผ๐˜‚๐—ฟ ๐—ฎ๐—ด๐—ผ, ๐—ฎ๐—ป๐—ฑ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ถ๐˜€ ๐—ท๐˜‚๐˜€๐˜ ๐—บ๐—ถ๐—ป๐—ฑ-๐—ฏ๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด.

๐Ÿคฏ Ranks among the top 500 students in the US in a qualifier for the USA Math Olympiad
๐Ÿคฏ Beats human-PhD-level accuracy by 8% on GPQA, hard science problems benchmark where the previous best was Claude 3.5 Sonnet with 59.4.
๐Ÿคฏ Scores 78.2% on vision benchmark MMMU, making it the first model competitive w/ human experts
๐Ÿคฏ GPT-4o on MATH scored 60% โ‡’ o1 scores 95%

How did they pull this? Sadly OpenAI keeps increasing their performance in โ€œmaking cryptic AF reports to not reveal any real infoโ€, so here are excerpts:

๐Ÿ’ฌ โ€œ๐—ผ๐Ÿญ ๐˜‚๐˜€๐—ฒ๐˜€ ๐—ฎ ๐—ฐ๐—ต๐—ฎ๐—ถ๐—ป ๐—ผ๐—ณ ๐˜๐—ต๐—ผ๐˜‚๐—ด๐—ต๐˜ ๐˜„๐—ต๐—ฒ๐—ป ๐—ฎ๐˜๐˜๐—ฒ๐—บ๐—ฝ๐˜๐—ถ๐—ป๐—ด ๐˜๐—ผ ๐˜€๐—ผ๐—น๐˜ƒ๐—ฒ ๐—ฎ ๐—ฝ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ. ๐—ง๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐—ฟ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด, ๐—ผ๐Ÿญ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ ๐˜๐—ผ ๐—ต๐—ผ๐—ป๐—ฒ ๐—ถ๐˜๐˜€ ๐—ฐ๐—ต๐—ฎ๐—ถ๐—ป ๐—ผ๐—ณ ๐˜๐—ต๐—ผ๐˜‚๐—ด๐—ต๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐—ณ๐—ถ๐—ป๐—ฒ ๐˜๐—ต๐—ฒ ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐—ถ๐—ฒ๐˜€ ๐—ถ๐˜ ๐˜‚๐˜€๐—ฒ๐˜€. It learns to recognize and correct its mistakes.โ€

And of course, they decide to hide the content of this precious Chain-of-
Thought. Would it be for maximum profit? Of course not, you awful capitalist, itโ€™s to protect users:

๐Ÿ’ฌ โ€œWe also do not want to make an unaligned chain of thought directly visible to users.โ€

Theyโ€™re right, it would certainly have hurt my feelings to see the internal of this model tearing apart math problems.

๐Ÿค” I suspect it could be not only CoT, but also some agentic behaviour where the model can just call a code executor. The kind of score improvement the show certainly looks like the ones you see with agents.

This model will be immediately released for ChatGPT and some โ€œtrusted API usersโ€.

Letโ€™s start cooking to release the same thing in 6 months! ๐Ÿš€
  • 1 reply
ยท
posted an update 6 days ago
view post
Post
632
๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ป๐—ด ๐˜†๐—ผ๐˜‚๐—ฟ ๐—›๐—ง๐— ๐—Ÿ ๐˜„๐—ฒ๐—ฏ๐—ฝ๐—ฎ๐—ด๐—ฒ๐˜€ ๐˜๐—ผ ๐—บ๐—ฎ๐—ฟ๐—ธ๐—ฑ๐—ผ๐˜„๐—ป ๐—ถ๐˜€ ๐—ป๐—ผ๐˜„ ๐—ฝ๐—ผ๐˜€๐˜€๐—ถ๐—ฏ๐—น๐—ฒ ๐—ฒ๐—ป๐—ฑ-๐˜๐—ผ-๐—ฒ๐—ป๐—ฑ ๐˜„๐—ถ๐˜๐—ต ๐—ฎ ๐˜€๐—ถ๐—บ๐—ฝ๐—น๐—ฒ ๐—Ÿ๐—Ÿ๐— ! ๐Ÿ‘

Jina just released Reader-LM, that handles the whole pipeline of extracting markdown from HTML webpages.

A while ago, Jina had released a completely code-based deterministic program to do this extraction, based on some heuristics : e.g., โ€œif the text is in a <p> tag, keep it, but if itโ€™s hidden behind another, remove itโ€.

๐Ÿค” But they received complaints from readers: some found it too detailed, other not enough, depending on the pages.

โžก๏ธ So they decided, ๐—บ๐—ฎ๐˜†๐—ฏ๐—ฒ ๐—ต๐—ฒ๐˜‚๐—ฟ๐—ถ๐˜€๐˜๐—ถ๐—ฐ๐˜€ ๐˜„๐—ฒ๐—ฟ๐—ฒ ๐—ป๐—ผ๐˜ ๐—ฒ๐—ป๐—ผ๐˜‚๐—ด๐—ต: ๐—ถ๐—ป๐˜€๐˜๐—ฒ๐—ฎ๐—ฑ, ๐˜๐—ต๐—ฒ๐˜† ๐˜๐—ฟ๐—ถ๐—ฒ๐—ฑ ๐˜๐—ผ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป ๐—ฎ ๐—Ÿ๐—Ÿ๐—  ๐˜๐—ผ ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜๐—ฒ ๐—ฒ๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป. This LLM does not need to be very strong,but it should handle a very long context: itโ€™s a challenging, โ€œshallow-but-wideโ€ architecture.

๐—ง๐—ฒ๐—ฐ๐—ต๐—ป๐—ถ๐—ฐ๐—ฎ๐—น ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:
2๏ธโƒฃ models: Reader-LM-0.5B and 1.5B
โš™๏ธ Two stages of training: first, short and simple HTML to get the basics, then ramp up to longer and harder HTML up to 128k tokens
๐Ÿ”Ž Use contrastive search for decoding: this empirically reduces โ€œrepeating outputโ€ issues
โžก๏ธ Their models beat much larger models at HTML extraction ๐Ÿ”ฅ
๐Ÿค— Weights available on HF (sadly cc-by-nc license): jinaai/reader-lm-1.5b
  • 1 reply
ยท
posted an update 7 days ago
view post
Post
1297
๐—ข๐—ฝ๐—ฒ๐—ป ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ผ๐—ป ๐—ณ๐—ถ๐—ฟ๐—ฒ ๐—ฟ๐—ถ๐—ด๐—ต๐˜ ๐—ป๐—ผ๐˜„! ๐Ÿ”ฅ ๐——๐—ฒ๐—ฒ๐—ฝ๐—ฆ๐—ฒ๐—ฒ๐—ธ-๐—ฉ๐Ÿฎ.๐Ÿฑ ๐—ฎ๐—ป๐—ฑ ๐—ผ๐˜๐—ต๐—ฒ๐—ฟ ๐˜๐—ผ๐—ฝ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€

Mistral AI just released Pixtral-12B, a vision models that seems to perform extremely well! From Mistralโ€™s own benchmark, it beats the great Qwen2-7B and Llava-OV.

๐Ÿค” But Mistralโ€™s benchmarks evaluate in Chain-of-Thought, and even in CoT they show lower scores for other models than the scores already published in non-CoT, which is very strangeโ€ฆ Evaluation is not a settled science!

But itโ€™s only the last of a flurry of great models. Here are the ones currently squatting the top of the Models Hub page:

โถ ๐Ÿ”Š ๐‹๐ฅ๐š๐ฆ๐š-๐Ÿ‘.๐Ÿ-๐Ÿ–๐ ๐Ž๐ฆ๐ง๐ข, a model built upon Llama-3.1-8B-Instruct, that simultaneously generates text and speech response with an extremely low latency of 250ms (Moshi, Kyutaiโ€™s 8B, did 140ms)

โท ๐ŸŸ๐Ÿ—ฃ๏ธ ๐…๐ข๐ฌ๐ก ๐’๐ฉ๐ž๐ž๐œ๐ก ๐ฏ๐Ÿ.๐Ÿ’, text-to-speech model that supports 8 languages ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฏ๐Ÿ‡ต๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡ธ๐Ÿ‡ฆ with extremely good quality for a light size (~1GB weights) and low latency

โธ ๐Ÿณ ๐ƒ๐ž๐ž๐ฉ๐’๐ž๐ž๐ค-๐•๐Ÿ.๐Ÿ“, a 236B model with 128k context length that combines the best of DeepSeek-V2-Chat and the more recent DeepSeek-Coder-V2-Instruct. Depending on benchmarks, it ranks just below Llama-3.1-405B. Released with custom โ€˜deepseekโ€™ license, quite commercially permissive.

โน ๐’๐จ๐ฅ๐š๐ซ ๐๐ซ๐จ published by Upstage: a 22B model (so inference fits on a single GPU) that comes just under Llama-3.1-70B performance : MMLU: 79, GPQA: 36, IFEval: 84

โบ ๐Œ๐ข๐ง๐ข๐‚๐๐Œ๐Ÿ‘-๐Ÿ’๐, a small model that claims very impressive scores, even beating much larger models like Llama-3.1-8B. Let's wait for more scores because these look too good!

Letโ€™s keep looking, more good stuff is coming our way ๐Ÿ”ญ
posted an update 7 days ago
view post
Post
2137
๐—”๐—ฟ๐—ฐ๐—ฒ๐—ฒ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€ ๐—ฆ๐˜‚๐—ฝ๐—ฒ๐—ฟ๐—ก๐—ผ๐˜ƒ๐—ฎ, ๐—ฏ๐—ฒ๐˜๐˜๐—ฒ๐—ฟ ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ฒ ๐—ผ๐—ณ ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ-๐Ÿณ๐Ÿฌ๐—•!

2๏ธโƒฃ versions: 70B and 8B
๐Ÿง  Trained by distilling logits from Llama-3.1-405B
๐Ÿฅ Used a clever compression method to reduce dataset weight from 2.9 Petabytes down to 50GB (may share it in a paper)
โš™๏ธ Not all benchmarks are improved: GPQA and MUSR go down a slight bit
๐Ÿค— 8B weights are available on HF (not the 70B)

Read their blog post ๐Ÿ‘‰ https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/
Model weights (8B) ๐Ÿ‘‰ arcee-ai/Llama-3.1-SuperNova-Lite
posted an update 8 days ago
view post
Post
607
> ๐—ช๐—ฎ๐—ป๐˜ ๐˜๐—ผ ๐—ธ๐—ป๐—ผ๐˜„ ๐—ต๐—ผ๐˜„ ๐—บ๐˜‚๐—ฐ๐—ต ๐—ฎ๐—ป ๐—”๐—ฃ๐—œ ๐—Ÿ๐—Ÿ๐—  ๐—ฐ๐—ฎ๐—น๐—น ๐—ฐ๐—ผ๐˜€๐˜๐˜€ ๐˜†๐—ผ๐˜‚?

I've just made this Space that gets you the API price for any LLM call, for nearly all inference providers out there!

This is based on a comment by @victor under my HF Post a few months back, and leverages BerriAI's data for LLM prices.

Check it out here ๐Ÿ‘‰ m-ric/text_to_dollars
replied to their post 8 days ago
posted an update 9 days ago
view post
Post
1158
> Article read: Simple guide to LLM inference and to TGI

I've just read article "LLM inference at scale with TGI" by @martinigoyanes . It's really good content, a must-read if you want a good low-level intro to LLM inference with TGI!

My takeaways:

How does inference work?
๐Ÿง  Prefill: the input prompt is tokenized on CPU, then transferred to GPU. Then one single forward pass generates the initial token.
๐Ÿ”„ Decode: the model generates ("decodes") tokens one by one, each time appending the new token to the current input of size N to then generate a new token again with this augmented input of length N+1. This loop ends either when a specific token called "End-of-sequence" is generated or when the completion reaches a pre-specified maximum length. Then the sequence is de-tokenized on CPU to yield text again.
โฑ๏ธ This step's speed determines the Time Per Output Token, which directly translates to the key metric: Throughput

๐Ÿค” How was the separation between the two steps decided ? Like, why does prefill include this strange generation of only one token at then end?
โžก๏ธ The cost of attention scales quadratically with the number of tokens, so it can really explode quickly.
To compensate for that, a really important technique called KV caching was devised: using the fact that when generating token N+1, the Key and Value (K and V) matrices generated inside the Transformers are a simple extension from the K and V from the previous step, the model caches the K and V matrices between steps : thus the separation - the prefill part is the part that prepares this KV cache, while the decoding is the one that leverages it and expands it by one at each step.

TGI-specific takeaways:
โš™๏ธ TGI has many SOTA techniques for decoding: Paged Attention, KV Caching and Flash Attentionโ€ฆ
๐Ÿ”€ TGI's router handles generations finishing early because of an EOS token: instead of static batching, it continuously batches requests to the inference engine & filters away finished requests.
  • 1 reply
ยท
posted an update 13 days ago
view post
Post
1883
๐Ÿคฏ ๐—” ๐—ป๐—ฒ๐˜„ ๐Ÿณ๐Ÿฌ๐—• ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—  ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐Ÿฏ.๐Ÿฑ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—ฎ๐—ป๐—ฑ ๐—š๐—ฃ๐—ง-๐Ÿฐ๐—ผ!

@mattshumer , CEO from Hyperwrite AI, had an idea he wanted to try out: why not fine-tune LLMs to always output their thoughts in specific parts, delineated by <thinking> tags?

Even better: inside of that, you could nest other sections, to reflect critically on previous output. Letโ€™s name this part <reflection>. Planning is also put in a separate step.

He named the method โ€œReflection tuningโ€ and set out to fine-tune a Llama-3.1-70B with it.

Well it turns out, it works mind-boggingly well!

๐Ÿคฏ Reflection-70B beats GPT-4o, Sonnet-3.5, and even the much bigger Llama-3.1-405B!

๐—ง๐—Ÿ;๐——๐—ฅ
๐ŸฅŠ This new 70B open-weights model beats GPT-4o, Claude Sonnet, et al.
โฐ 405B in training, coming soon
๐Ÿ“š Report coming next week
โš™๏ธ Uses GlaiveAI synthetic data
๐Ÿค— Available on HF!

Iโ€™m starting an Inference Endpoint right now for this model to give it a spin!

Check it out ๐Ÿ‘‰ mattshumer/Reflection-Llama-3.1-70B
ยท
posted an update 13 days ago
view post
Post
819
๐Ÿš€ย ๐—ช๐—ต๐—ฒ๐—ฟ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐—น๐—ฎ๐˜„๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐˜๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐˜‚๐˜€ : ๐—ฏ๐˜† ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿด, ๐—”๐—œ ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ฟ๐—ฒ๐—ฎ๐—ฐ๐—ต ๐˜๐—ต๐—ฒ ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฝ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐—ฒ๐—ป๐˜๐—ถ๐—ฟ๐—ฒ ๐—ฐ๐—ผ๐˜‚๐—ป๐˜๐—ฟ๐—ถ๐—ฒ๐˜€

Reminder : โ€œScaling lawsโ€ are empirical laws saying that if you keep multiplying your compute by x10, your models will mechanically keep getting better and better.

To give you an idea, GPT-3 can barely write sentences, and GPT-4, which only used x15 its amount of compute, already sounds much smarter than some of my friends (although it's not really - or at least I haven't tested them side-by side). So you can imagine how far a x100 over GPT-4 can take us.

๐ŸŽ๏ธย As a result, tech titans are racing to build the biggest models, and for this they need gigantic training clusters.

The picture below shows the growth of training compute: it is increasing at a steady exponential rate of a x10 every 2 years. So letโ€™s take this progress a bit further:
- 2022: starting training for GPT-4 : 10^26 FLOPs, cost of $100M
- 2024: today, companies start training on much larger clusters like the โ€œsuper AI clusterโ€ of Elon Muskโ€™s xAI, 10^27 FLOPS, $1B
- 2026 : by then clusters will require 1GW, i.e. around the full power generated by a nuclear reactor
- 2028: we reach cluster prices in the 100 billion dollars, using 10GW, more than the most powerful power stations currently in use in the US. This last size seems crazy, but Microsoft and OpenAI already are planning one.

Will AI clusters effectively reach these crazy sizes where the consume as much as entire countries?
โžก๏ธย Three key ingredients of training might be a roadblock to scaling up :
๐Ÿ’ธย Money: but itโ€™s very unlikely, given the potential market size for AGI, that investors lose interest.
โšก๏ธ Energy supply at a specific location
๐Ÿ“šย Training data: weโ€™re already using 15 trillion tokens for Llama-3.1 when Internet has something like 60 trillion.

๐Ÿค”ย Iโ€™d be curious to hear your thoughts: do you think weโ€™ll race all the way there?
ยท
posted an update 14 days ago
view post
Post
2093
๐Ÿฅณ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฟ๐˜€ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ป๐—ผ๐˜„ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐˜€ ๐— ๐˜‚๐—น๐˜๐—ถ-๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€!

Multi-agent systems have been introduced in Microsoft's framework Autogen. It simply means having several agents working together to solve your task instead of only one : this paradigm empirically yields better performance on most benchmarks. The reason for this better performance is conceptually simple: for many tasks, rather than using a do-it-all system, you would prefer to specialize units on sub-tasks. Here, having agents with separate tool sets and memories allows to achieve efficient specialization.

You can now easily build hierarchical multi-agent systems with transformers.agents (not released yet, use the dev version)

To do so, encapsulate the agent in a ManagedAgent object. This object needs arguments agent, name, and a description, which will then be embedded in the manager agent's system prompt to let it know how to call this managed agent, as we also do for tools.

Cf the example in the image! We'll keep building on this paradigm in the upcoming weeks ๐Ÿš€

Read more in the doc ๐Ÿ‘‰ https://github.com/huggingface/transformers/blob/main/docs/source/en/agents_advanced.md

Checkout an advanced multi-agent system that tops the GAIA leaderboard ๐Ÿ‘‰ https://github.com/aymeric-roucher/GAIA/blob/main/gaia_multiagent.py
posted an update 15 days ago
view post
Post
792
๐Ÿšจ ๐—›๐˜‚๐—บ๐—ฎ๐—ป ๐—™๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—”๐—œ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด: ๐—ก๐—ผ๐˜ ๐˜๐—ต๐—ฒ ๐—ด๐—ผ๐—น๐—ฑ๐—ฒ๐—ป ๐—ด๐—ผ๐—ผ๐˜€๐—ฒ ๐˜„๐—ฒ ๐˜๐—ต๐—ผ๐˜‚๐—ด๐—ต๐˜?

Iโ€™ve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.

Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:
๐Ÿง  Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B ๐Ÿ™…โ€โ™‚๏ธ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!

๐Ÿ’ช Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.

๐ŸŽญ The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.

And a consequence of the above:
๐Ÿ”„ ๐—ฅ๐—Ÿ๐—›๐—™ ๐—บ๐—ถ๐—ด๐—ต๐˜ ๐—ฏ๐—ฎ๐—ฐ๐—ธ๐—ณ๐—ถ๐—ฟ๐—ฒ: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.

This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.

โ›”๏ธ Chatbot Arenaโ€™s ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.

Read the paper ๐Ÿ‘‰ Human Feedback is not Gold Standard (2309.16349)
posted an update 16 days ago
view post
Post
2192
๐Ÿค– ๐—ง๐—ต๐—ฒ ๐—”๐—œ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐˜€๐˜: ๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ, ๐—ณ๐˜‚๐—น๐—น๐˜†-๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฝ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ ๐—ณ๐—ผ๐—ฟ ๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฟ $๐Ÿญ๐Ÿฑ ๐—ฝ๐—ฒ๐—ฟ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ

Researchers have just created an AI system that ๐—ฐ๐—ฎ๐—ป ๐—ฐ๐—ผ๐—ป๐—ฑ๐˜‚๐—ฐ๐˜ ๐—ฒ๐—ป๐˜๐—ถ๐—ฟ๐—ฒ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฝ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐˜๐—ฎ๐—ฟ๐˜ ๐˜๐—ผ ๐—ณ๐—ถ๐—ป๐—ถ๐˜€๐—ต, ๐—ฝ๐—ผ๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ฎ๐—น๐—น๐˜† ๐—ฟ๐—ฒ๐˜ƒ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป๐—ถ๐˜‡๐—ถ๐—ป๐—ด ๐—ต๐—ผ๐˜„ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐—ณ๐—ถ๐—ฐ ๐—ฑ๐—ถ๐˜€๐—ฐ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐—ถ๐—ฒ๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—บ๐—ฎ๐—ฑ๐—ฒ.

It doesn't just assist with specific tasks - it automates the entire research process, from generating ideas to writing and reviewing papers.
1 - brainstorm novel research directions, 2- write and execute code for experiments & visualize results, get references, and even 3- write up findings in a full academic paper format!

And it can do all this for under $15 per paper! ๐Ÿคฏ

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:
๐Ÿง  Generates novel research ideas across multiple topics (e.g. diffusion modeling, transformers, learning dynamics aka โ€œgrokkingโ€)
๐Ÿ‘จโ€๐Ÿ’ป Uses open-source coding assistant Aider to implement ideas and run experiments. This is especially important since this agentic assistant can iterate if it fails somewhere.
๐Ÿ“Š Visualizes results and plans follow-up experiments (up to 5 rounds)
โœ๏ธ Writes full academic papers, including finding references using Semantic Search API
๐Ÿ•ต๏ธ Runs a simulated peer review process to evaluate paper quality
๐Ÿ’ฐ Total cost per paper is under $15. This system can generate "hundreds of interesting, medium-quality papers" in just a week !

๐—ฆ๐˜๐—ถ๐—น๐—น ๐—ป๐—ผ๐˜ ๐—ฟ๐—ฒ๐—ฎ๐—ฑ๐˜† ๐˜๐—ผ ๐—ณ๐—ถ๐—น๐—น ๐—œ๐—–๐—Ÿ๐—ฅ ๐˜„๐—ถ๐˜๐—ต ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ๐˜€:
๐Ÿ” Ideas generated in one domain tend to be repetitive across different runs, and even different language model
๐Ÿ‘€ Does not use vision capabilities to fix visual issues in plots
๐Ÿ’ญ Models occasionally hallucinate entire results tables
โ‡’ Only few of the generated papers would actually meet the threshold for acceptance at a top AI conference

๐Ÿ‘‰ย Read their paper: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (2408.06292)
posted an update 22 days ago
view post
Post
2090
๐ŸŽฎ ๐—” ๐—ป๐—ฒ๐˜‚๐—ฟ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐˜€๐—ถ๐—บ๐˜‚๐—น๐—ฎ๐˜๐—ฒ๐˜€ ๐——๐—ข๐—ข๐— : ๐—š๐—ผ๐—ผ๐—ด๐—น๐—ฒ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ฒ๐—ฟ๐˜€ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐˜๐—ต๐—ฒ ๐˜„๐—ฎ๐˜† ๐—ณ๐—ผ๐—ฟ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜๐—ฒ๐—น๐˜†-๐—”๐—œ-๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ด๐—ฎ๐—บ๐—ฒ๐˜€!

Imagine if games were completely live-generated by an AI model : the NPCs and their dialogues, the storyline, and even the game environment. The playerโ€™s in-game actions would have a real, lasting impact on the game story.

In a very exciting paper, Google researchers just gave us the first credible glimpse of this future.

โžก๏ธย They created GameNGen, the first neural model that can simulate a complex 3D game in real-time. They use it to simulate the classic game DOOM running at over 20 frames per second on a single TPU, with image quality comparable to lossy JPEG compression. And it feels just like the true game!

Here's how they did it:
1. They trained an RL agent to play DOOM and recorded its gameplay sessions.
2. They then used these recordings to train a diffusion model to predict the next frame, based on past frames and player actions.
3. During inference, they use only 4 denoising steps (instead of the usual dozens) to generate each frame quickly.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:
๐ŸŽฎ๐Ÿค” Human players can barely tell the difference between short clips (3 seconds) of the real game or the simulation
๐Ÿง  The model maintains game state (health, ammo, etc.) over long periods despite having only 3 seconds of effective context length
๐Ÿ”„ They use "noise augmentation" during training to prevent quality degradation in long play sessions
๐Ÿš€ The game runs on one TPU at 20 FPS with 4 denoising steps, or 50 FPS with model distillation (with some quality loss)

The researchers did not open source the code, but I feel like weโ€™ve just seen a part of the future being written!

Their paper (exploding the upvote counter) ๐Ÿ‘‰ย  Diffusion Models Are Real-Time Game Engines (2408.14837)
In a similar vein, play @Jofthomas 's 'Everchanging Quest' ๐ŸŽฎ Jofthomas/Everchanging-Quest
replied to their post 27 days ago
view reply

Yes that's why I added the big if behind!

posted an update 27 days ago
view post
Post
906
๐—”๐—œ๐Ÿฎ๐Ÿญ ๐—ถ๐˜๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—ป๐—ฒ๐˜„ ๐—๐—ฎ๐—บ๐—ฏ๐—ฎ ๐Ÿญ.๐Ÿฑ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ: ๐—ก๐—ฒ๐˜„ ๐˜€๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—ฟ๐—ฑ ๐—ณ๐—ผ๐—ฟ ๐—น๐—ผ๐—ป๐—ด-๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐˜‚๐˜€๐—ฒ-๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€!๐Ÿ…

@ai21labs used a different architecture to beat the status-quo Transformers models: Jamba architecture combines classic Transformers layers with the new Mamba layers, for which the complexity is a linear (instead of quadratic) function of the context length.

What does this imply?

โžก๏ธ Jamba models are much more efficient for long contexts: faster (up to 2.5x faster for long context), takes less memory, and also performs better to recall everything in the prompt.

That means itโ€™s a new go-to model for RAG or agentic applications!

And the performance is not too shabby: Jamba 1.5 models are comparable in perf to similar-sized Llama-3.1 models! The largest model even outperforms Llama-3.1 405B on Arena-Hard.

โœŒ๏ธ Comes in 2 sizes: Mini (12B active/52B) and Large (94B active/399B)
๐Ÿ“ Both deliver 256k context length, for low memory: Jamba-1.5 mini fits 140k context length on one single A100.
โš™๏ธ New quanttization method: Experts Int8 quantizes only the weights parts of the MoE layers, which account for 85% of weights
๐Ÿค– Natively supports JSON format generation & function calling.
๐Ÿ”“ Permissive license *if your org makes <$50M revenue*

Available on the Hub ๐Ÿ‘‰ ai21labs/jamba-15-66c44befa474a917fcf55251
Read their release blog post ๐Ÿ‘‰ https://www.ai21.com/blog/announcing-jamba-model-family
  • 2 replies
ยท
posted an update 29 days ago
view post
Post
3359
๐—š๐—ผ๐—ผ๐—ด๐—น๐—ฒ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ : ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐˜‚๐—ฝ ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐Ÿญ๐Ÿฐ๐˜… ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ๐—ฟ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐Ÿš€

Remember scaling laws? These are empirical laws that say "the bigger your model, the better it gets". More precisely, "as your compute increases exponentially, loss decreases in a linear fashion". They have wild implications, suggesting that spending 100x more training compute would make you super-LLMs. That's why companies are racing to build the biggest AI superclusters ever, and Meta bought 350k H100 GPUs, which probably cost in the order of $1B.

But think of this : we're building huge reasoning machines, but only ask them to do one pass through the model to get one token of the final answer : i.e., we expend a minimal effort on inference. That's like building a Caterpillar truck and making it run on a lawnmower's motor. ๐Ÿšš๐Ÿ›ต Couldn't we optimize this? ๐Ÿค”

๐Ÿ’ก So instead of scaling up on training by training even bigger models on many more trillions of tokens, Google researchers explored this under-explored avenue : scaling up inference compute.

They combine two methods to use more compute : either a reviser that iterated to adapt the model distribution, or generate N different completions (for instance through Beam Search) and select only the best one using an additional verifier model.

They use a Palm-2 model (released in May 23) on the MATH dataset : Palm-2 has the advantage of getting a low performance on MATH, but not zero, so that improvements will be noticeable.

And the results show that for the same fixed amount of inference compute:
๐Ÿ’ฅ a smaller model with more effort on decoding beats a x14 bigger model using naive greedy sampling.

That means that you can divide your training costs by 14 and still get the same perf for the same inference cost!

Take that, scaling laws. Mark Zuckerberg, you're welcome, hope I can get some of these H100s.

Read the paper here ๐Ÿ‘‰ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2408.03314)
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
981
๐—ญ๐—ฒ๐—ฟ๐—ผ-๐—บ๐—ฎ๐˜๐—ต ๐—ถ๐—ป๐˜๐—ฟ๐—ผ ๐˜๐—ผ ๐—”๐—œ ๐—ต๐—ถ๐˜€๐˜๐—ผ๐—ฟ๐˜†: ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐Ÿญ๐Ÿต๐Ÿฑ๐Ÿฌ๐˜€ ๐˜๐—ผ ๐˜๐—ผ๐—ฑ๐—ฎ๐˜†'๐˜€ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐Ÿ“–

I wanted to structure my thinking about LLMs by going through their history since the 50s. This history is captivating, with the opposition between Connexionists (Rosenblatt, LeCun) and Symbolists, the first victories of "deep" neural networks, the revolution of Attention...

So I might have gone a bit too far! ๐Ÿ˜…

๐Ÿ“ I've made a long post summarizing the main stages of building LLMs: neural networks, optimization, backpropagation, attention layers...

โœ… And I've made sure to keep it 100% horrible-latex-math-free: the technical stuff is conveyed in graphs only, so it should be accessible to really anyone, even your grandfather (I'm sending it to mine right now).

Read it here in english ๐Ÿ‘‰ https://aymeric-roucher.github.io/brief-history-of-ai/
Pour le post en franรงais ๐Ÿ‘‰ https://aymeric-roucher.github.io/breve-histoire-de-l-ia/
posted an update about 2 months ago
view post
Post
1722
๐—ฆ๐—”๐—  ๐Ÿฎ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ: ๐—ก๐—ฒ๐˜„ ๐—ฆ๐—ข๐—ง๐—” ๐—ผ๐—ป ๐˜€๐—ฒ๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป, ๐—ฏ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฏ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜„๐—ถ๐˜๐—ต ๐—ต๐˜‚๐—บ๐—ฎ๐—ป ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐Ÿš€

It's a model for Object segmentation, for both image and video:
๐Ÿ‘‰ input = a text prompt, or a click on a specific object
๐Ÿ‘‰ output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)

๐Ÿ’ช SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.

How did they pull that?

The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! โžก๏ธ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.

๐Ÿ’ก Key idea: researchers they decided to use a segmentation model to help them collect the dataset.

But then itโ€™s a chicken and egg problem: you need the model to create the dataset and the opposite as well? ๐Ÿค”

โ‡’ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿญ: Annotators use only SAM + manual editing tools on each frame โ‡’ Create 16k masklets across 1.4k videos

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฎ: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured โ‡’ This gets a 5.1x speedup over data collection in phase 1! ๐Ÿƒ Collect 60k masklets

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฏ: Now SAM 2 is more powerful, it has the โ€œsingle clickโ€ prompting option, thus annotators can use it with simple clicks to re-annotate data.

They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.

I find this a great example of combining synthetic data generation with human annotation ๐Ÿ‘
posted an update about 2 months ago
view post
Post
1081
๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ถ๐—ป๐—ฎ๐—น๐—น๐˜† ๐—ด๐—ฒ๐˜ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—–๐—ต๐—ฎ๐˜๐—ฏ๐—ผ๐˜ ๐—”๐—ฟ๐—ฒ๐—ป๐—ฎ ๐—ฟ๐—ฎ๐—ป๐—ธ๐—ถ๐—ป๐—ด ๐ŸŽ–๏ธ

Given the impressive benchmarks published my Meta for their Llama-3.1 models, I was curious to see how these models would compare to top proprietary models on Chatbot Arena.

Now we've got the results! LMSys released the ELO derived from thousands of user votes for the new models, and here are the rankings:

๐Ÿ’ฅ 405B Model ranks 5th overall, in front of GPT-4-turbo! But behind GPT-4o, Claude-3.5 Sonnet and Gemini-advanced.
๐Ÿ‘ 70B Model climbs up to 9th rank ! From 1206 โžก๏ธ 1244.
๐Ÿ‘ 8B Model improves from 1152 โžก๏ธ 1170.

โœ… This confirms that Llama-3.1 is a good contender for any task: any of its 3 model size is much cheaper to run than equivalent proprietary models!

For instance, here are the inference prices for the top models;
โžค GPT-4-Turbo inference price from OpenAI: $5/M input tokens, $15/M output tokens
โžค Llama-3.1-405B from HF API (for testing only): 3$/M for input or output tokens (Source linked in the first comment)
โžค Llama-3.1-405B from HF API (for testing only): free โœจ

Get a head start on the HF API (resource by @andrewrreed ) ๐Ÿ‘‰ https://huggingface.co/learn/cookbook/enterprise_hub_serverless_inference_api
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
2243
๐—ง๐—ต๐—ฒ ๐—ต๐˜‚๐—ด๐—ฒ ๐—ฐ๐—ผ๐˜€๐˜ ๐—ผ๐—ณ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ผ๐—ป ๐—ณ๐—ฟ๐—ผ๐—ป๐˜๐—ถ๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐Ÿ’ธ

Google DeepMind recently released a great paper that shows optimal hyperparameters to train across different regimes: Scaling Exponents Across Parameterizations and Optimizers, with data from 10,000 training runs.

One engineer decided to quantify the price of such a large-scale experiment.

๐Ÿ˜ฌ And the bill is hefty: ~13M USD

This exact number is to take with a grain of salt because many approximations were necessary to get the final result.

โ›”๏ธ But still this ballpark means that for this sole experiment, the price is way over what most startups or research labs could afford.

This means that open-sourcing research is more important than ever, to put everyone in the ecosystem on a roughly equal footing. Don't let OpenAI run first, they'll keep everything for themselves!

Read the full post that quantifies the paper's cost ๐Ÿ‘‰ https://152334h.github.io/blog/scaling-exponents/
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
2238
๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐——๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐˜: ๐—ฑ๐—ฟ๐—ผ๐—ฝ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ถ๐—น๐—ฒ, ๐—น๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—Ÿ๐—Ÿ๐—  ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐Ÿ“Šโš™๏ธ

Need to make quick exploratory data analysis? โžก๏ธ Get help from an agent.

I was impressed by Llama-3.1's capacity to derive insights from data. Given a csv file, it makes quick work of exploratory data analysis and can derive interesting insights.

On the data from the Kaggle titanic challenge, that records which passengers survived the Titanic wreckage, it was able by itself to derive interesting trends like "passengers that paid higher fares were more likely to survive" or "survival rate was much higher for women than men".

The cookbook even lets the agent built its own submission to the challenge, and it ranks under 3,000 out of 17,000 submissions: ๐Ÿ‘ not bad at all!

Try it for yourself in this Space demo ๐Ÿ‘‰ m-ric/agent-data-analyst
  • 2 replies
ยท
replied to Wauplin's post 2 months ago
posted an update 2 months ago
view post
Post
3194
๐๐ž๐ฐ ๐๐ž๐œ๐จ๐๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐ช๐ฎ๐ž ๐ข๐ง ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐ฌ๐ข๐ ๐ง๐ข๐Ÿ๐ข๐œ๐š๐ง๐ญ๐ฅ๐ฒ ๐ซ๐ž๐๐ฎ๐œ๐ž๐ฌ ๐ก๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ ๐Ÿ‘

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

๐Ÿ’ก This gave them their key idea: During decoding, rather than picking the token with the highest logit, ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฝ๐—ถ๐—ฐ๐—ธ ๐˜๐—ต๐—ฒ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป ๐˜„๐—ถ๐˜๐—ต ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐˜€๐˜ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐—ป ๐—น๐—ผ๐—ด๐—ถ๐˜ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐—น๐—ฎ๐˜†๐—ฒ๐—ฟ๐˜€?

This gives impressive results:
๐Ÿš€ ๐Ÿฑ% - ๐Ÿฎ๐Ÿฌ% ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฝ๐—ผ๐—ถ๐—ป๐˜๐˜€ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€
๐Ÿš€ For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is ๐—ฎ๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ ๐Ÿฐ๐Ÿฌ% ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ผ ๐˜€๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—ฟ๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด!

๐Ÿค” Wouldn't decoding take longer because of this added contrasting step? ๐Ÿ‘‰ ๐—ง๐—ต๐—ฒ ๐—ฟ๐˜‚๐—ป๐˜๐—ถ๐—บ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐˜€ ๐—ป๐—ฒ๐—ด๐—น๐—ถ๐—ด๐—ถ๐—ฏ๐—น๐—ฒ, ๐Ÿญ ๐˜๐—ผ ๐Ÿด% ๐—ผ๐—ป๐—น๐˜†.

Paper added to my collection ๐Ÿ‘‰ m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0
  • 2 replies
ยท
replied to their post 2 months ago
posted an update 2 months ago
view post
Post
849
One more cookbook:
Agent for self-correcting Text-to-SQL ๐Ÿง‘โ€๐Ÿ’ป

What if the query generated by your Text-to-SQL pipeline is correct SQL but returns wrong results?
๐Ÿ‘‰ We need to add a critique step

โœ… That's very simple with an agent!

Check out the notebook! ๐Ÿ‘‡
https://huggingface.co/learn/cookbook/agent_text_to_sql
posted an update 2 months ago
view post
Post
1847
New cookbook!

I show to to make agentic RAG using Transformers Agents.

Compared to vanilla RAG, agentic RAG can:
โœ… Reformulate the query
โœ… Critique the retrived content to re-retrieve if needed

โžก๏ธ Score increase of 8.5%! ๐Ÿ’ช (Llama-3-70B-judge)

Read it here ๐Ÿ‘‰ https://huggingface.co/learn/cookbook/agent_rag
replied to their post 3 months ago
view reply

It's not using GPT-4o for evaluation, evaluation is done with exact string match!

posted an update 3 months ago
view post
Post
2647
๐—ฌ๐—ผ๐˜‚ ๐—ฑ๐—ผ๐—ป'๐˜ ๐—ป๐—ฒ๐—ฒ๐—ฑ "๐—ณ๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ฎ๐—น๐—น๐—ถ๐—ป๐—ด ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด" ๐˜๐—ผ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ด๐—ผ๐—ผ๐—ฑ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ โ›”

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
๐Ÿฆโ€โฌ› Nexusflow/๐—ก๐—ฒ๐˜…๐˜‚๐˜€๐—ฅ๐—ฎ๐˜ƒ๐—ฒ๐—ป-๐—ฉ๐Ÿฎ-๐Ÿญ๐Ÿฏ๐—•
โŒ˜ CohereForAI/๐—ฐ๐Ÿฐ๐—ฎ๐—ถ-๐—ฐ๐—ผ๐—บ๐—บ๐—ฎ๐—ป๐—ฑ-๐—ฟ-๐—ฝ๐—น๐˜‚๐˜€
โ›ต๏ธ mistralai/๐— ๐—ถ๐˜…๐˜๐—ฟ๐—ฎ๐—น-๐Ÿด๐˜…๐Ÿฎ๐Ÿฎ๐—•-๐—œ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜-๐˜ƒ๐Ÿฌ.๐Ÿญ
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
๐Ÿง Even when used as JSON agents, these fine-tuned models don't perform very well
๐Ÿ… ๐™‚๐™ค๐™ค๐™™ ๐™—๐™–๐™จ๐™š ๐™ข๐™ค๐™™๐™š๐™ก๐™จ ๐™ฅ๐™š๐™ง๐™›๐™ค๐™ง๐™ข ๐™—๐™š๐™ฉ๐™ฉ๐™š๐™ง ๐™ฌ๐™ž๐™ฉ๐™๐™ค๐™ช๐™ฉ ๐™–๐™ฃ๐™ฎ ๐™›๐™ž๐™ฃ๐™š-๐™ฉ๐™ช๐™ฃ๐™ž๐™ฃ๐™œ, ๐™Ÿ๐™ช๐™จ๐™ฉ ๐™ฅ๐™ก๐™–๐™ž๐™ฃ ๐™ฅ๐™ง๐™ค๐™ข๐™ฅ๐™ฉ๐™ž๐™ฃ๐™œ. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

๐Ÿ‘‡ The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: ๐™ฐ๐š๐šŽ๐š—๐š๐™ฟ๐šŠ๐š›๐šœ๐š’๐š—๐š๐™ด๐š›๐š›๐š˜๐š› and ๐™ฐ๐š๐šŽ๐š—๐š๐™ด๐šก๐šŽ๐šŒ๐šž๐š๐š’๐š˜๐š—๐™ด๐š›๐š›๐š˜๐š› are the ones caused by incorrect formatting.
โžค As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to ๐™ฅ๐™ก๐™–๐™ฃ ๐™œ๐™ค๐™ค๐™™ ๐™ฉ๐™–๐™จ๐™ -๐™จ๐™ค๐™ก๐™ซ๐™ž๐™ฃ๐™œ ๐™ฉ๐™ง๐™–๐™Ÿ๐™š๐™˜๐™ฉ๐™ค๐™ง๐™ž๐™š๐™จ ๐™ค๐™ซ๐™š๐™ง ๐™จ๐™š๐™ซ๐™š๐™ง๐™–๐™ก ๐™จ๐™ฉ๐™š๐™ฅ๐™จ.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra
  • 3 replies
ยท
posted an update 3 months ago
view post
Post
767
๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐ญ๐ก๐ž ๐ญ๐จ๐ฉ ๐จ๐Ÿ ๐†๐€๐ˆ๐€ ๐ฅ๐ž๐š๐๐ž๐ซ๐›๐จ๐š๐ซ๐! ๐Ÿฅณ

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the ๐—š๐—”๐—œ๐—” ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

๐Ÿ› ๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐˜‚๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ผ๐—น๐˜€, at least a web browser
๐Ÿ”ข ๐—ฅ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—น๐—ผ๐—ด๐—ถ๐—ฐ, many questions having strong math aspects
๐Ÿ–ผ๏ธ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น, the agent had to handle all file types: ๐Ÿ”Š, ๐Ÿ–ผ๏ธ, ๐ŸŽฌ...
๐Ÿ‘ฃ ๐— ๐˜‚๐—น๐˜๐—ถ-๐˜€๐˜๐—ฒ๐—ฝ, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard ๐Ÿ˜ณ
> "In NASAโ€™s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(๐˜ฏ๐˜ฐ ๐˜ง๐˜ช๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ ๐˜ฐ๐˜ง ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ด๐˜ฆ, ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜จ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฉ๐˜ข๐˜ด ๐˜ต๐˜ฐ ๐˜ง๐˜ช๐˜ฏ๐˜ฅ ๐˜ข๐˜ญ๐˜ญ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ช๐˜ฏ๐˜ง๐˜ฐ)

โžก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿš€ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
๐Ÿฅ‡ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

๐™‚๐™ค ๐™˜๐™๐™š๐™˜๐™  ๐™ค๐™ช๐™ฉ ๐™ฉ๐™๐™š ๐™ก๐™š๐™–๐™™๐™š๐™ง๐™—๐™ค๐™–๐™ง๐™™ ๐Ÿ‘‰ gaia-benchmark/leaderboard
  • 2 replies
ยท
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
3115
๐Ÿ’ฐ ๐—š๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—ฎ๐—ป๐˜† ๐—Ÿ๐—Ÿ๐—  ๐—”๐—ฃ๐—œ ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ฒ๐˜€๐˜ โ‡’ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐—ฐ๐—ผ๐˜€๐˜

I've just found out about ๐™ฐ๐š๐šŽ๐š—๐š๐™พ๐š™๐šœ-๐™ฐ๐™ธ/๐š๐š˜๐š”๐šŽ๐š—๐šŒ๐š˜๐šœ๐š (https://github.com/AgentOps-AI/tokencost).
๐—ง๐—ต๐—ถ๐˜€ ๐—น๐—ถ๐—ฏ๐—ฟ๐—ฎ๐—ฟ๐˜† ๐—ด๐—ถ๐˜ƒ๐—ฒ๐˜€ ๐˜†๐—ผ๐˜‚ ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฐ๐—ฎ๐—น๐—น๐˜€ ๐˜๐—ผ ๐—ฎ๐—ป๐˜† ๐—Ÿ๐—Ÿ๐—  ๐—”๐—ฃ๐—œ: OpenAI, Anthropic, Mistral, AWS or Databricks...

For any model, you can use as input either string prompts or messages, and get as outputs either the price or token count.

Congrats to the AgentOps-AI team: this will be very useful when trying to get a ballpark estimate of a project's price, to compare APIs, or for precise monitoring of usage!

โœจ Daily reminder: ๐—ฟ๐˜‚๐—ป๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ป ๐—”๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—ฐ๐—ผ๐˜€๐˜๐˜€ ๐˜†๐—ผ๐˜‚ ๐—ฒ๐˜…๐—ฎ๐—ฐ๐˜๐—น๐˜† $๐Ÿฌ.๐Ÿฌ๐Ÿฌ/๐—ต๐—ผ๐˜‚๐—ฟ (or 0.00โ‚ฌ in current exchange rates) on a HF space with ZeroGPU!
Learn more on ZeroGPU ๐Ÿ‘‰ https://www.datacenterdynamics.com/en/news/hugging-face-launches-zerogpu-project-to-democratize-ai-gives-away-10-million-worth-of-compute/
ยท
posted an update 4 months ago
view post
Post
1840
๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ฎ๐—ป ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐˜„๐—ผ๐—ฟ๐—ธ๐—ณ๐—น๐—ผ๐˜„ ๐˜‚๐˜€๐—ฒ ๐—ถ๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—  ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐˜๐—ผ ๐˜€๐—ผ๐—น๐˜ƒ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ๐˜€?

โžก๏ธ I made my first ever ๐˜ฎ๐˜ข๐˜ฏ๐˜ช๐˜ฎ video to show just that:

๐—ช๐—ฎ๐˜๐—ฐ๐—ต ๐—ฏ๐—ฒ๐—น๐—ผ๐˜„ ๐—ต๐—ผ๐˜„ ๐—ฎ ๐—ฅ๐—ฒ๐—ฎ๐—ฐ๐˜ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐˜€๐—ผ๐—น๐˜ƒ๐—ฒ๐˜€ ๐—ฎ ๐˜€๐—ถ๐—บ๐—ฝ๐—น๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ, by leveraging its memory to iterate on previous actions! ๐ŸŽฌ๐Ÿ‘‡

Read our blog post on Agents: https://huggingface.co/blog/agents
  • 1 reply
ยท
posted an update 4 months ago
view post
Post
831
๐™’๐™ง๐™ž๐™ฉ๐™ž๐™ฃ๐™œ ๐™ฉ๐™ค๐™ค๐™ก ๐™˜๐™–๐™ก๐™ก๐™จ ๐™ž๐™ฃ ๐™˜๐™ค๐™™๐™š ๐™Ÿ๐™ช๐™จ๐™ฉ ๐™ฌ๐™ค๐™ง๐™ ๐™จ ๐™—๐™š๐™ฉ๐™ฉ๐™š๐™ง ๐™ฉ๐™๐™–๐™ฃ ๐™…๐™Ž๐™Š๐™‰ ๐Ÿ’ช

I was really happy to learn today by @sergeipetrov that paper ๐˜Œ๐˜น๐˜ฆ๐˜ค๐˜ถ๐˜ต๐˜ข๐˜ฃ๐˜ญ๐˜ฆ ๐˜Š๐˜ฐ๐˜ฅ๐˜ฆ ๐˜ˆ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜Œ๐˜ญ๐˜ช๐˜ค๐˜ช๐˜ต ๐˜‰๐˜ฆ๐˜ต๐˜ต๐˜ฆ๐˜ณ ๐˜“๐˜“๐˜” ๐˜ˆ๐˜จ๐˜ฆ๐˜ฏ๐˜ต๐˜ด was accepted at ICLR 2024!

As a reminder, an agent is a system in which you embed a LLM engine, to let it call tools.

These tools are meant like an IronMan suit, to supplement the LLM in areas that it isn't good at.
๐Ÿง‘โ€๐Ÿ’ป For instance your friendly LLM may be terrible at calculating powers of floating numbers ("What is X ^0.2947 ?"), so it should use a calculator.
๐Ÿ”ŽIt may be terrible at knowing precise facts ("What was the date of the Golden Bull?") so it should use a web browser.

So the agent system will prompt an agent with "Now you can use these tools: calculator, search,..."

But ๐™๐™ค๐™ฌ ๐™จ๐™๐™ค๐™ช๐™ก๐™™ ๐™ฉ๐™๐™š ๐™–๐™œ๐™š๐™ฃ๐™ฉ ๐™š๐™ญ๐™ฅ๐™ง๐™š๐™จ๐™จ ๐™ž๐™ฉ๐™จ ๐™–๐™˜๐™ฉ๐™ž๐™ค๐™ฃ๐™จ?

All well known frameworks let agents write their actions as JSON strings.

We ๐—ฝ๐—ฟ๐—ฒ๐—ณ๐—ฒ๐—ฟ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ผ ๐—ด๐—ผ ๐˜„๐—ถ๐˜๐—ต ๐—ณ๐—ผ๐—ฟ๐—บ๐˜‚๐—น๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ถ๐—ป ๐—–๐—ผ๐—ฑ๐—ฒ, ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐—ถ๐˜€ ๐—บ๐˜‚๐—ฐ๐—ต ๐—บ๐—ผ๐—ฟ๐—ฒ ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐˜๐—ถ๐—น๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ฐ๐—ผ๐—ป๐—ฐ๐—ถ๐˜€๐—ฒ, ๐—ฎ๐—ป๐—ฑ ๐—ฎ๐—น๐—น๐—ผ๐˜„๐˜€ ๐˜๐—ผ ๐—ฐ๐—ต๐—ฎ๐—ถ๐—ป ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐˜€๐—ฒ๐—ฎ๐—บ๐—น๐—ฒ๐˜€๐˜€๐—น๐˜†: see the picture attached for an example where Code formulation really shines.

And the paper confirms our choice: researchers show that ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ผ ๐—๐—ฆ๐—ข๐—ก ๐—ผ๐—ฟ ๐—ฝ๐—น๐—ฎ๐—ถ๐—ป ๐˜๐—ฒ๐˜…๐˜, ๐—–๐—ผ๐—ฑ๐—ฒ ๐—ถ๐˜€ ๐—ฏ๐—ฒ๐˜๐˜๐—ฒ๐—ฟ ๐—ฏ๐—ผ๐˜๐—ต ๐—ถ๐—ป ๐—ฐ๐—ผ๐—ป๐—ฐ๐—ถ๐˜€๐—ฒ๐—ป๐—ฒ๐˜€๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ:
โžค Up to 30% fewer steps for the same actions (much more concise)
โžค Up to 20% higher performance on benchmarks

And we find additional benefits, for instance a natural handling of variables.

Read the paper here ๐Ÿ“– Executable Code Actions Elicit Better LLM Agents (2402.01030)
Get your ReactCodeAgent running with our Agents framework! ๐Ÿ‘‰ https://huggingface.co/learn/cookbook/agents
posted an update 4 months ago
view post
Post
992
๐๐ž๐ฐ ๐ ๐ฎ๐ข๐๐ž ๐ข๐ง ๐จ๐ฎ๐ซ ๐Ž๐ฉ๐ž๐ง-๐’๐จ๐ฎ๐ซ๐œ๐ž ๐€๐ˆ ๐œ๐จ๐จ๐ค๐›๐จ๐จ๐ค: ๐™Ž๐™ฉ๐™ง๐™ช๐™˜๐™ฉ๐™ช๐™ง๐™š๐™™ ๐™œ๐™š๐™ฃ๐™š๐™ง๐™–๐™ฉ๐™ž๐™ค๐™ฃ! โœจ

Many use LLM use cases involve generating outputs with a specific structure.

โžก๏ธ For instance when using an LLM as a judge to evaluate another model's outputs, you need it to give you not only a score, but also the rationale for this score, and maybe a confidence level.
So you do not need only "score: 1", but more a dictionary like:
{
     "rationale": "The answer does not match the true answer at all."
     "score": 1,
     "confidence_level": 0.85
}


๐Ÿค” How to force your LLM to generate such a structured output?

๐Ÿ—๏ธ ๐—–๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด is a great technique to generate structured output: you can specify a grammar (=set of rules) that the output should follow, and ๐—ฐ๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ๐—ป ๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐˜€ ๐˜๐—ต๐—ฒ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ฒ๐—ฟ ๐˜๐—ผ ๐—ผ๐—ป๐—น๐˜† ๐—ฝ๐—ถ๐—ฐ๐—ธ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐—ฟ๐—ฒ๐˜€๐—ฝ๐—ฒ๐—ฐ๐˜ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ด๐—ฟ๐—ฎ๐—บ๐—บ๐—ฎ๐—ฟ.

I've created a guide to show you how to use it, both via our Inference API and locally using ๐˜ฐ๐˜ถ๐˜ต๐˜ญ๐˜ช๐˜ฏ๐˜ฆ๐˜ด!

๐Ÿ‘‰ Read it here: https://huggingface.co/learn/cookbook/structured_generation

Thank you @stevhliu for your great help in improving it!
posted an update 5 months ago
view post
Post
2759
๐Ÿ’ฐโŒ ๐‘๐ž๐ฌ๐ž๐š๐ซ๐œ๐ก ๐Ÿ๐จ๐ซ ๐ญ๐ก๐ž ๐ฏ๐ž๐ซ๐ฒ ๐†๐๐” ๐๐จ๐จ๐ซ - ๐’๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ๐ฌ ๐ซ๐ž๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง

๐ŸŽ† Good news: ๐˜†๐—ผ๐˜‚ ๐—ฐ๐—ฎ๐—ป ๐—ฑ๐—ผ ๐—ฐ๐˜‚๐˜๐˜๐—ถ๐—ป๐—ด-๐—ฒ๐—ฑ๐—ด๐—ฒ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐˜„๐—ถ๐˜๐—ต ๐—ฎ ๐—ฐ๐—ฎ๐—น๐—ฐ๐˜‚๐—น๐—ฎ๐˜๐—ผ๐—ฟ ๐—ฎ๐—ป๐—ฑ ๐— ๐—ถ๐—ฐ๐—ฟ๐—ผ๐˜€๐—ผ๐—ณ๐˜ ๐—ฃ๐—ฎ๐—ถ๐—ป๐˜ ๐Ÿฎ๐Ÿฌ๐Ÿฌ๐Ÿฒ!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to ๐—ณ๐—ถ๐—ป๐—ฑ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ฎ๐—น ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ ๐—ผ๐—ณ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜€๐—ถ๐˜‡๐—ฒ ๐˜ƒ๐˜€ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

๐Ÿ’ฅ They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

โžก๏ธ This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

โœ… But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!
posted an update 5 months ago
view post
Post
2460
๐๐š๐ฉ๐ž๐ซ ๐‘๐ž๐ฏ๐ข๐ž๐ฐ: ๐‘๐ก๐จ-๐Ÿ - ๐ƒ๐จ ๐ง๐จ๐ญ ๐ฎ๐ฌ๐ž ๐š๐ฅ๐ฅ ๐ญ๐จ๐ค๐ž๐ง๐ฌ ๐ž๐ช๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ ! โš–๏ธโ›”๏ธ

A new paper topping Daily papers questions a hidden assumption in LLM training:

๐Ÿค” ๐™Ž๐™๐™ค๐™ช๐™ก๐™™ ๐™ฌ๐™š ๐™ง๐™š๐™–๐™ก๐™ก๐™ฎ ๐™ช๐™จ๐™š ๐™–๐™ก๐™ก ๐™ฉ๐™ค๐™ ๐™š๐™ฃ๐™จ ๐™š๐™ฆ๐™ช๐™–๐™ก๐™ก๐™ฎ ๐™ž๐™ฃ ๐™ค๐™ช๐™ง ๐™‡๐™‡๐™ˆ'๐™จ ๐™ฉ๐™ง๐™–๐™ž๐™ฃ๐™ž๐™ฃ๐™œ ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of ๐˜š๐˜ฐ๐˜ญ๐˜ช๐˜ฅ๐˜Ž๐˜ฐ๐˜ญ๐˜ฅ๐˜”๐˜ข๐˜จ๐˜ช๐˜ฌ๐˜ข๐˜ณ๐˜ฑ).

So this paper introduces ๐—ฆ๐—ฒ๐—น๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด, which is actually really simple:
โžก๏ธ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

โžก๏ธ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ—† โฑ๏ธ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
โ—† ๐Ÿ’ช Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
โ—† ๐Ÿš€ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

๐€๐๐๐ข๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ข๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ ๐Ÿ’ก
โ—† Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens ๐Ÿ˜–
โ—† Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! โœ…

Find great reads in @akhaliq 's Daily Papers ๐Ÿ‘‰ https://huggingface.co/papers
Paper added to my collection ๐Ÿ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
posted an update 5 months ago
view post
Post
2160
๐—ก๐—ฒ๐˜„ ๐—ฆ๐—ฝ๐—ฎ๐—ฐ๐—ฒ: ๐˜ผ๐™„ ๐™๐™ง๐™–๐™ซ๐™š๐™ก ๐™ฅ๐™ก๐™–๐™ฃ๐™ฃ๐™š๐™ง ๐Ÿ—บ๏ธ๐Ÿ•๏ธ Plan your next vacation in a few minutes!

I wanted to try out if a powerful LLM like Mixtral-8x7b had geographical reasoning capabilities.
So I built a small space that prompts the LLM to provide a JSON list of places based on a user input.

And the result was impressive! ๐Ÿคฏ

โ‡’ ๐—œ๐˜ ๐˜€๐—ฒ๐—ฒ๐—บ๐˜€ ๐—น๐—ถ๐—ธ๐—ฒ ๐— ๐—ถ๐˜…๐˜๐—ฟ๐—ฎ๐—น ๐—ต๐—ฎ๐˜€ ๐—ฎ ๐—ด๐—ฟ๐—ฎ๐˜€๐—ฝ ๐—ผ๐—ณ ๐—ด๐—ฒ๐—ผ๐—ด๐—ฟ๐—ฎ๐—ฝ๐—ต๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฐ๐—ผ๐—ป๐—ฐ๐—ฒ๐—ฝ๐˜๐˜€ ๐—น๐—ถ๐—ธ๐—ฒ ๐—ก๐—ผ๐—ฟ๐˜๐—ต - ๐—ฆ๐—ผ๐˜‚๐˜๐—ต, ๐—ผ๐—ฟ ๐˜€๐—ฝ๐—ฎ๐˜๐—ถ๐—ฎ๐—น ๐—ฎ๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜.๐Ÿงญ Not just describing these concepts, but really applying them in practice, for instance to successfully answer "give me 4 European cities that are aligned on the map". This is a ๐—ป๐—ถ๐—ฐ๐—ฒ ๐—ฒ๐˜…๐—ฎ๐—บ๐—ฝ๐—น๐—ฒ ๐—ผ๐—ณ ๐—ฎ๐—ป ๐—ฒ๐—บ๐—ฒ๐—ฟ๐—ด๐—ฒ๐—ป๐˜ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†, since nothing in the LLM's training data should prepare it for this specific task.

Anyway, I added API calls and a nice visualization on top of the LLM, streaming output, caching for the answers and locations... and ta-da! โœจ I got the ๐—”๐—œ ๐—ง๐—ฟ๐—ฎ๐˜ƒ๐—ฒ๐—น ๐—ฃ๐—น๐—ฎ๐—ป๐—ป๐—ฒ๐—ฟ.

๐™”๐™ค๐™ช ๐™˜๐™–๐™ฃ ๐™™๐™š๐™จ๐™˜๐™ง๐™ž๐™—๐™š ๐™ž๐™ฉ ๐™ฎ๐™ค๐™ช๐™ง ๐™ฉ๐™ง๐™ž๐™ฅ, ๐™–๐™ฃ๐™™ ๐™ž๐™ฉ ๐™ฌ๐™ž๐™ก๐™ก ๐™˜๐™ค๐™ข๐™š ๐™ช๐™ฅ ๐™ฌ๐™ž๐™ฉ๐™ ๐™ฃ๐™ž๐™˜๐™š ๐™–๐™ฃ๐™™ ๐™˜๐™ค๐™ฃ๐™ซ๐™š๐™ฃ๐™ž๐™š๐™ฃ๐™ฉ ๐™ก๐™ค๐™˜๐™–๐™ฉ๐™ž๐™ค๐™ฃ๐™จ!

๐™๐™ง๐™ฎ ๐™ž๐™ฉ ๐™๐™š๐™ง๐™š ๐Ÿ‘‰ m-ric/ai-travel-planner

Thank you @freddyaboulton for the ๐š๐š›๐šŠ๐š๐š’๐š˜_๐š๐š˜๐š•๐š’๐šž๐š– component, and @clem , @pngwn , @abidlabs for your ideas and support!
  • 1 reply
ยท
posted an update 6 months ago
view post
Post
2067
[๐๐ž๐ฐ ๐๐š๐ฉ๐ž๐ซ] ๐€๐ฅ๐ฅ ๐ญ๐จ๐ค๐ž๐ง๐ฌ ๐ฌ๐ก๐จ๐ฎ๐ฅ๐ ๐ง๐จ๐ญ ๐ซ๐ž๐ช๐ฎ๐ข๐ซ๐ž ๐ญ๐ก๐ž ๐ฌ๐š๐ฆ๐ž ๐ž๐Ÿ๐Ÿ๐จ๐ซ๐ญ ๐ญ๐จ ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž! โ‡’ ๐Œ๐ข๐ฑ๐ญ๐ฎ๐ซ๐ž ๐จ๐Ÿ ๐๐ž๐ฉ๐ญ๐ก๐ฌ ๐Ÿซง๐Ÿ 

Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.

Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: ๐—ป๐—ผ๐˜ ๐—ฎ๐—น๐—น ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฒ๐—พ๐˜‚๐—ฎ๐—น!

โžก๏ธ ๐—ง๐—ต๐—ฒ๐˜† ๐—ต๐—ฎ๐—ฑ ๐˜๐—ต๐—ถ๐˜€ ๐—ด๐—ฒ๐—ป๐—ถ๐˜‚๐˜€ ๐—ถ๐—ฑ๐—ฒ๐—ฎ: ๐Ÿ’ก ๐—ต๐—ฎ๐˜ƒ๐—ถ๐—ป๐—ด ๐—ฎ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป ๐—ด๐—ผ ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐—ฎ ๐—ฏ๐—น๐—ผ๐—ฐ๐—ธ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ฏ๐—ฒ ๐—ผ๐—ฝ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection.
The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. ๐˜›๐˜ฉ๐˜ช๐˜ด ๐˜ข๐˜ญ๐˜ญ๐˜ฐ๐˜ธ๐˜ด ๐˜ต๐˜ฐ ๐˜ค๐˜ฉ๐˜ฐ๐˜ฐ๐˜ด๐˜ฆ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ฆ๐˜น๐˜ข๐˜ค๐˜ต ๐™˜๐™–๐™ฅ๐™–๐™˜๐™ž๐™ฉ๐™ฎ ๐˜ฐ๐˜ง ๐˜ข ๐˜ฃ๐˜ญ๐˜ฐ๐˜ค๐˜ฌ, ๐˜ช.๐˜ฆ. ๐˜ต๐˜ฉ๐˜ฆ ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฑ๐˜ฐ๐˜ณ๐˜ต๐˜ช๐˜ฐ๐˜ฏ ๐˜ฐ๐˜ง ๐˜ต๐˜ฐ๐˜ฌ๐˜ฆ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜จ๐˜ฐ ๐˜ต๐˜ฉ๐˜ณ๐˜ฐ๐˜ถ๐˜จ๐˜ฉ ๐˜ช๐˜ต, ๐˜ธ๐˜ฉ๐˜ช๐˜ค๐˜ฉ ๐˜ฅ๐˜ช๐˜ณ๐˜ฆ๐˜ค๐˜ต๐˜ญ๐˜บ ๐˜ช๐˜ฏ๐˜ง๐˜ญ๐˜ถ๐˜ฆ๐˜ฏ๐˜ค๐˜ฆ๐˜ด ๐˜ต๐˜ฉ๐˜ฆ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ถ๐˜ต๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ข๐˜ญ ๐˜ช๐˜ฏ๐˜ต๐˜ฆ๐˜ฏ๐˜ด๐˜ช๐˜ต๐˜บ ๐˜ฐ๐˜ง ๐˜ต๐˜ฉ๐˜ฆ ๐˜ง๐˜ฐ๐˜ณ๐˜ธ๐˜ข๐˜ณ๐˜ฅ ๐˜ฑ๐˜ข๐˜ด๐˜ด.

This yields Mixture-of-Depths (MoD), with spectacular results.

โœจ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
๐ŸŽš๏ธ ๐—–๐—ฎ๐—ฝ๐—ฎ๐—ฐ๐—ถ๐˜๐˜† ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐˜๐˜‚๐—ป๐—ฒ๐—ฑ ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜„๐—ฎ๐˜† ๐—ฑ๐—ผ๐˜„๐—ป ๐˜๐—ผ ๐Ÿญ๐Ÿฎ.๐Ÿฑ% for every second block: thus 87.5% of tokens just skip the block!
๐Ÿš€ For the same training time and performance, >๐Ÿฒ๐Ÿฌ% ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜€๐—ฝ๐—ฒ๐—ฒ๐—ฑ!
๐Ÿค ๐—–๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฏ๐—ถ๐—ป๐—ฒ๐—ฑ ๐˜„๐—ถ๐˜๐—ต ๐— ๐—ถ๐˜…๐˜๐˜‚๐—ฟ๐—ฒ-๐—ผ๐—ณ-๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐˜๐˜€ for further improvements.

๐Ÿ“„ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ต๐—ฒ๐—ฟ๐—ฒ ๐Ÿ‘‰ Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)
๐Ÿ“š I added it to my paper collection ๐Ÿ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
  • 1 reply
ยท
posted an update 6 months ago
view post
Post
1875
๐Ÿ๐ŸŽ๐Ÿ๐Ÿ’, ๐ญ๐ก๐ž ๐ฒ๐ž๐š๐ซ ๐จ๐Ÿ ๐š๐ ๐ž๐ง๐ญ ๐ฐ๐จ๐ซ๐ค๐Ÿ๐ฅ๐จ๐ฐ๐ฌ ๐Ÿ”ง๐Ÿฆพ๐Ÿค–

I've just watched Andrew Ng's talk at Sequoia last week.
If you're interested in Agents, you should really watch it!

๐—ช๐—ต๐˜† ๐˜‚๐˜€๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐—ณ๐—น๐—ผ๐˜„๐˜€?
The current LLM task solving workflow is not very intuitive:
We ask it โ€œwrite an essay all in one shot, without ever using backspace.โ€

Why not allow the LLM a more similar process to what we would do?
- โ€œWrite an essay outlineโ€
- โ€œDo you need wen research?โ€
- โ€œWrite a first draftโ€
- โ€œConsider improvementsโ€
โ€ฆ

This is called an Agentic workflow. Existing ones bring a huge performance boost. With HumanEval: GPT-4 zero-shot gets 67% score, agentic with either one of tool use or reflection goes over 90%, and the combination of the two scores even higher!

๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด ๐—ฑ๐—ฒ๐˜€๐—ถ๐—ด๐—ป ๐—ฝ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐—ป๐˜€
On the following two points, the tech is robust:

โš™๏ธ ๐—ฅ๐—ฒ๐—ณ๐—น๐—ฒ๐˜…๐—ถ๐—ผ๐—ป: For instance: add a critic step after the writing step
๐Ÿ› ๏ธ ๐—ง๐—ผ๐—ผ๐—น ๐˜‚๐˜€๐—ฒ: extends the capabilities of the LLM by allowing it to call tools, like search or calculator

The next two will be needed to go further, but the tech for them is more emerging and not reliable yet:
๐Ÿ—บ๏ธ ๐—ฃ๐—น๐—ฎ๐—ป๐—ป๐—ถ๐—ป๐—ด forward to decompose task into subtasks. This allows great behaviours like an AI Agent re-routing after a failure
๐Ÿ ๐— ๐˜‚๐—น๐˜๐—ถ-๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฏ๐—ผ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป: Program a flock of agents with tasks.
Improving the two above points will unlock huge performance boosts!

Andrew NG says Research agents are already part of his workflow!

๐—–๐—น๐—ผ๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ผ๐˜‚๐—ด๐—ต๐˜๐˜€
Andrew speculates that through agentic workflows, maybe generating many tokens fast from a small LLM will give better results than slower throughput from a powerful LLM like GPT-5.

๐ŸŽฌ Watch the talk here ๐Ÿ‘‰ https://www.youtube.com/watch?v=sal78ACtGTc
๐Ÿ“š I've added his recommended reads to m-ric/agents-65ba776fbd9e29f771c07d4e
  • 1 reply
ยท
posted an update 6 months ago
view post
Post
1793
๐“๐ก๐ž ๐ซ๐ž๐ญ๐ฎ๐ซ๐ง ๐จ๐Ÿ ๐ญ๐ก๐ž ๐‘๐๐๐ฌ โš” ๐๐ž๐ฐ ๐Œ๐š๐ฆ๐›๐š-๐›๐š๐ฌ๐ž๐ ๐š๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž "๐‰๐š๐ฆ๐›๐š"

Since the release of BERT by Google in 2019, Transformers architecture have taken over machine learning thanks to their ๐—ฎ๐˜๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ฒ๐—ฐ๐—ต๐—ฎ๐—ป๐—ถ๐˜€๐—บ, that gives them the ability to focus on important points of the input. But ๐™–๐™ฉ๐™ฉ๐™š๐™ฃ๐™ฉ๐™ž๐™ค๐™ฃ ๐™˜๐™ค๐™ข๐™ฅ๐™ช๐™ฉ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ž๐™จ ๐™ฆ๐™ช๐™–๐™™๐™ง๐™–๐™ฉ๐™ž๐™˜ ๐™ž๐™ฃ ๐™ฉ๐™๐™š ๐™ž๐™ฃ๐™ฅ๐™ช๐™ฉ ๐™ก๐™š๐™ฃ๐™œ๐™ฉ๐™.

๐Ÿ’ซ The Mamba paper, published in December 2023, announced the return of the RNNs: it has no attention, but integrates a selection mechanism, which should be able to reproduce the โ€œfocusโ€ ability of attention, in an architecture for which the compute requirements ๐—ด๐—ฟ๐—ผ๐˜„ ๐—ผ๐—ป๐—น๐˜† ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ๐—น๐˜† ๐—ถ๐—ป ๐—ถ๐—ป๐—ฝ๐˜‚๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต!
๐Ÿค” Would this work? We had yet to see a large Mamba model recovering the performance of Attention-based Transformers.

๐Ÿ’ฅ But now it's done! A (Mamba + Transformers) hybrid just beat Transformers!

The AI21 Labs team just released Jamba.
They insert a few Transformer layers to inject some attention in a big pile of Mamba layers, thus getting the best of both worlds.

๐™๐™‡;๐˜ฟ๐™:
๐Ÿ—๏ธ ๐—ก๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ: 4 Jamba blocks, each of these being 7 Mamba layers for 1 Transformer.
๐Ÿ‹๏ธ ๐Ÿฑ๐Ÿฎ๐—• ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€, ๐Ÿญ๐Ÿฎ๐—• ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฎ๐˜ ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ: This reduction is enabled by Mixture of Experts, and similar to Mixtral (47B parameters - 13B active).
๐ŸŽ๏ธ ๐—ฆ๐—ฝ๐—ฒ๐—ฒ๐—ฑ: ๐˜…๐Ÿฏ ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต๐—ฝ๐˜‚๐˜. Jamba is much faster than similar-sized Transformer models on long contexts.
๐Ÿ“ ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต: ๐Ÿญ๐Ÿฐ๐Ÿฌ๐—ž ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ on a single 80GB A100!
๐Ÿ’ช ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ: ๐˜€๐˜๐—ฎ๐˜๐—ฒ-๐—ผ๐—ณ-๐˜๐—ต๐—ฒ-๐—ฎ๐—ฟ๐˜ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ต๐—ถ๐˜€ ๐˜€๐—ถ๐˜‡๐—ฒ. The small injection of attention seems sufficient since Jamba beats the open-source reference Mixtral-8x7B on many benchmarks!

Try it here ๐Ÿ‘‰ ai21labs/Jamba-v0.1
posted an update 6 months ago
view post
Post
1697
๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ฏ๐—ฒ๐—ฎ๐—บ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐˜„๐—ผ๐—ฟ๐—ธ? โžก๏ธ ๐™‰๐™š๐™ฌ ๐™ซ๐™ž๐™จ๐™ช๐™–๐™ก๐™ž๐™ฏ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ฉ๐™ค๐™ค๐™ก! ๐Ÿ‘€

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

๐’๐จ ๐ก๐จ๐ฐ ๐ข๐ฌ ๐ญ๐ก๐ž ๐ง๐ž๐ฑ๐ญ ๐ญ๐จ๐ค๐ž๐ง ๐ฌ๐ž๐ฅ๐ž๐œ๐ญ๐ž๐?

๐Ÿ“Š Given its input sentence like "๐˜ž๐˜ฉ๐˜ข๐˜ต ๐˜ช๐˜ด ๐˜ต๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ? ๐˜›๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "๐™ž๐™จ" gets score 0.56, and "๐™˜๐™–๐™ฃ" gets score 0.35.

๐Ÿค‘ ๐†๐ซ๐ž๐ž๐๐ฒ ๐๐ž๐œ๐จ๐๐ข๐ง๐  is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "๐™ž๐™จ", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "๐™˜๐™–๐™ฃ" could have been completed with "๐˜ฃ๐˜ฆ ๐˜ฐ๐˜ฃ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ฆ๐˜ฅ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ถ๐˜ต๐˜ช๐˜ฏ๐˜จ ๐˜ฑ๐˜ณ๐˜ฆ๐˜ท๐˜ช๐˜ฐ๐˜ถ๐˜ด ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ๐˜ด ๐˜ง๐˜ช๐˜ณ๐˜ด๐˜ต", which steers the LLM towards a correct reasoning!

๐Ÿ—บ๏ธ ๐๐ž๐š๐ฆ ๐ฌ๐ž๐š๐ซ๐œ๐ก improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "๐™ž๐™จ" and the "๐™˜๐™–๐™ฃ" completion could be tested. โœ…

๐Ÿ‘‰ I've created a tool to let you visualize it, thank you @joaogante for your great help!
๐™๐™ง๐™ฎ ๐™ž๐™ฉ ๐™๐™š๐™ง๐™š: m-ric/beam_search_visualizer
posted an update 6 months ago
view post
Post
2033
๐—จ๐˜€๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐Ÿง‘โ€โš–๏ธ ๐—ณ๐—ผ๐—ฟ ๐—ฎ๐—ป ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐˜๐—ถ๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

๐Ÿค” Then ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฎ๐˜€๐—ธ ๐—ฎ๐—ป๐—ผ๐˜๐—ต๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—  ๐˜๐—ผ ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป, by providing it relevant rating criteria? ๐Ÿ‘‰ This is the idea behind LLM-as-a-judge.

โš™๏ธ To implement a LLM judge correctly, you need a few tricks.
โœ… So ๐—œ'๐˜ƒ๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฝ๐˜‚๐—ฏ๐—น๐—ถ๐˜€๐—ต๐—ฒ๐—ฑ ๐—ฎ ๐—ป๐—ฒ๐˜„ ๐—ป๐—ผ๐˜๐—ฒ๐—ฏ๐—ผ๐—ผ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ต๐—ผ๐˜„ ๐˜๐—ผ ๐—ถ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ถ๐˜ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐—น๐˜† ๐—ถ๐—ป ๐—ผ๐˜‚๐—ฟ ๐—›๐˜‚๐—ด๐—ด๐—ถ๐—ป๐—ด ๐—™๐—ฎ๐—ฐ๐—ฒ ๐—–๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ! (you can run it instantly in Google Colab)
โžก๏ธ ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
โžก๏ธ ๐—”๐—น๐—น ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ๐˜€: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!
  • 2 replies
ยท
posted an update 6 months ago
view post
Post
Interesting paper: ๐†๐š๐‹๐จ๐ซ๐ž: ๐ญ๐ซ๐š๐ข๐ง ๐Ÿ•๐ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐จ๐ง ๐œ๐จ๐ง๐ฌ๐ฎ๐ฆ๐ž๐ซ-๐ ๐ซ๐š๐๐ž ๐†๐๐”๐ฌ ๐Ÿ’ช
It's now possible to ๐™›๐™ช๐™ก๐™ก๐™ฎ ๐™ฅ๐™ง๐™š-๐™ฉ๐™ง๐™–๐™ž๐™ฃ a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!

The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!

The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits

Another technique is to ๐™ฅ๐™ง๐™ค๐™Ÿ๐™š๐™˜๐™ฉ ๐™ฉ๐™๐™š ๐™ฌ๐™š๐™ž๐™œ๐™๐™ฉ ๐™ข๐™–๐™ฉ๐™ง๐™ž๐™ญ ๐™ฉ๐™ค ๐™ก๐™ค๐™ฌ๐™š๐™ง-๐™ง๐™–๐™ฃ๐™  ๐™จ๐™ฅ๐™–๐™˜๐™š๐™จ, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.

โžก๏ธ Enter the authors of ๐˜Ž๐˜ข๐˜“๐˜ฐ๐˜ณ๐˜ฆ: ๐˜”๐˜ฆ๐˜ฎ๐˜ฐ๐˜ณ๐˜บ-๐˜Œ๐˜ง๐˜ง๐˜ช๐˜ค๐˜ช๐˜ฆ๐˜ฏ๐˜ต ๐˜“๐˜“๐˜” ๐˜›๐˜ณ๐˜ข๐˜ช๐˜ฏ๐˜ช๐˜ฏ๐˜จ ๐˜ฃ๐˜บ ๐˜Ž๐˜ณ๐˜ข๐˜ฅ๐˜ช๐˜ฆ๐˜ฏ๐˜ต ๐˜“๐˜ฐ๐˜ธ-๐˜™๐˜ข๐˜ฏ๐˜ฌ ๐˜—๐˜ณ๐˜ฐ๐˜ซ๐˜ฆ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ. They gather (and prove) interesting insights:
โ›” The weight matrix does not reliably converge to lower ranks during training.
โœ… But the gradient matrix does!

Based on these insights, ๐˜๐—ต๐—ฒ๐˜† ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—š๐—ฎ๐—Ÿ๐—ผ๐—ฟ๐—ฒ, that projects the gradient to lower ranks.
๐Ÿ—บ๏ธ ๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ถ๐—ฑ๐—ฒ๐—ฎ: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).

๐Ÿค This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).

โžก๏ธ ๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿ“‰ Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
๐Ÿ’ช No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) โ‡’ this is essential, it confirms that the method is viable!

Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)
posted an update 7 months ago
view post
Post
๐Ÿ“š๐Ÿ”Ž If you're building RAG applications, you should check this out:

โš™๏ธ I've built a new space to let you visualize the chunks you get with different text splitting methods!

โžก๏ธ Visualize your chunks here:
m-ric/chunk_visualizer
  • 2 replies
ยท