Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

Seeking contributors for a completely open-source ๐Ÿš€ Data Science platform! singhsidhukuldeep.github.io

Organizations

singhsidhukuldeep's activity

posted an update 1 day ago
view post
Post
297
OpenAI's latest model, "o1", has demonstrated remarkable performance on the Norway Mensa IQ test, scoring an estimated IQ of 120.

Everyone should think before answering!

Key findings:

โ€ข o1 correctly answered 25 out of 35 IQ questions, surpassing average human performance
โ€ข The model excelled at pattern recognition and logical reasoning tasks
โ€ข Performance was validated on both public and private test sets to rule out training data bias

Technical details:

โ€ข o1 utilizes advanced natural language processing and visual reasoning capabilities
โ€ข The model likely employs transformer architecture with billions of parameters
โ€ข Improved few-shot learning allows o1 to tackle novel problem types

Implications:

โ€ข This represents a significant leap in AI reasoning abilities
โ€ข We may see AIs surpassing 140 IQ by 2026 if the trend continues
โ€ข Raises important questions about the nature of intelligence and cognition
  • 1 reply
ยท
posted an update 4 days ago
view post
Post
453
Researchers from Tencent have developed DepthCrafter, a novel method for generating temporally consistent long depth sequences for open-world videos using video diffusion models.

It leverages a pre-trained image-to-video diffusion model (SVD) as the foundation and uses a 3-stage training strategy on paired video-depth datasets:
1. Train on a large realistic dataset (1-25 frames)
2. Fine-tune temporal layers on realistic data (1-110 frames)
3. Fine-tune spatial layers on synthetic data (45 frames)

It adapts SVD's conditioning mechanism for frame-by-frame video input and employs latent diffusion in VAE space for efficiency.
Sprinkle some intelligent inference strategy for extremely long videos:
- Segment-wise processing (up to 110 frames)
- Noise initialization to anchor depth distributions
- Latent interpolation for seamless stitching

And outperforms SOTA methods on multiple datasets (Sintel, ScanNet, KITTI, Bonn).

Read here: https://depthcrafter.github.io
posted an update 5 days ago
view post
Post
717
If you're passionate about the latest in AI, self-driving technology, and humanoid robotics, you need to catch this episode featuring Andrej Karpathy, he discusses OpenAI, Tesla, and education. It's 44 minutes, but you might have to slow it down based on how fast he speaks!

Key Insights:

1. Self-Driving Cars as a Bridge to AGI:
Andrej explores the parallels between self-driving technology and Artificial General Intelligence (AGI), suggesting that in some respects, AGI has already been achieved within the realm of self-driving. Teslaโ€™s approach, which emphasizes software over expensive hardware like LIDAR, exemplifies this.

2. Tesla vs. Waymo: The Battle of Approaches:
Tesla relies on vision-based systems with minimal sensors, leveraging advanced neural networks for decision-making. This contrasts sharply with Waymo's sensor-heavy vehicles, highlighting a broader software versus hardware challenge that could define the future of scalable autonomous driving.

3. End-to-End Deep Learning:
Andrej highlights the transition from manually programmed systems to fully end-to-end deep learning models that "eat through the stack." At Tesla, this shift has significantly reduced reliance on C++ code, making neural networks the driving force in software and hardware integration.

4. Humanoid Robotics - More Than Just a Dream:
The shift from Teslaโ€™s automotive neural networks to humanoid robots like Optimus is nearly seamless. By using the same sensors and computational platforms, Tesla is redefining what a robotics company can achieve at scale, bridging the gap between vehicle AI and human-like robotics.

And...

5. The Power of Transformers in AI

6. Synthetic Data: The Future of AI Training

7. AI for Education - A Revolutionary Approach

Full super fast speech is here: https://youtu.be/hM_h0UA7upI
posted an update 6 days ago
view post
Post
1034
1 hour of OpenAi o1, here are my thoughts...

Here are my few observations:

- Slower response times: o1 can take over 10+ seconds to answer some questions, as it spends more time "thinking" through problems. In my case, it took over 50 seconds.

- Less likely to admit ignorance: The models are reported to be less likely to admit when they don't know the answer to a question.

- Higher pricing: o1-preview is significantly more expensive than GPT-4o, costing 3x more for input tokens and 4x more for output tokens in the API. With more thinking and more tokens, this could require houses to be mortgaged!

- Do we need this?: While it's better than GPT-4o for complex reasoning, on many common business tasks, its performance is just equivalent.

- Not a big deal: No comparisons to Anthropic or Google DeepMind Gemini are mentioned or included.

- This model tries to think and iterate over the response on its own! Think of it as an inbuilt CoT on steroids! Would love a technical review paper on the training process.

A must-read paper: https://cdn.openai.com/o1-system-card.pdf
posted an update 8 days ago
view post
Post
931
Reflection-Llama-3.1-70B burst onto the scene, surprising everyone! It claimed to outperform others with its novel Reflection-Tuning technique, promising not just to match but to surpass the likes of Claude 3.5 and GPT-4o, leveraging its 70 billion parameters to redefine what open-source could achieve.

And now, everything is crumbling!

The model's performance metrics, especially its 99.2% accuracy on the high school math dataset GSM 8K, have raised eyebrows. While it looked like a valedictorian, based on the open weights, it hardly performs like one.

The model card in the Transformers behaves as Llama 3 and not 3.1.

While the weights were released publicly, they are having issues aligning with the claims. The tuning has been restarted, and the author claims to upload the updated weights soon!

And the big one: the black-boxed API shared is not at all like the open weights. Even more, when pushed hard, the API endpoint claims to be an LLM by Anthropic!

But you might ask, didn't this model beat Anthropic Claude 3.5? Yes, it did.

So, did Claude 3.5 beat Claude 3.5? No, the benchmark is zero-shot, and the claims are that the results are not under zero-shot but under CoT/few-shot!

And to top it all off, the reflecting back idea is not new. But I don't think that's a big deal.

I took some time to look through everything, and now, once tested, this model looks to be worse than Llama 3.1 70B

I still believe the Reflection-Tuning technique is promising. These are the papers discussing its efficacy:
- "Think Before You Speak: Training Language Models With Pause Tokens"
- "Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning"

PS: Matt Shumer/@mattshumer_ (Twitter Handle) (Reflection-Llama-3.1-70B creator) is a great researcher. Let's wait for his updated weights!

Great YT video: https://youtu.be/Xtr_Ll_A9ms

Hugging Face Clem Delangue ๐Ÿค—?
Can you please help here if possible? This will be the pinnacle of open-source!
  • 1 reply
ยท
posted an update 9 days ago
view post
Post
1181
Remember when @Google launched MediaPipe in an effort to create efficient on-device pipelines?

They've just unlocked the ability to run 7B+ parameter language models directly in your browser. This is a game-changer for on-device AI!

Yes, they are streaming 8.6 GB model files!

Currently, they have Gemma 2B/7B running, but imagine Dynamic LoRA, multimodal support, quantization, and you never leaving Chrome!

This is a significant technical advancement, especially in Memory Optimization:

- Redesigned the model-loading code to work around WebAssembly's 4 GB memory limit.
- Implemented asynchronous loading of transformer stack layers (28 for Gemma 1.1 7B).
- Reduced peak WebAssembly memory usage to less than 1% of previous requirements.

Cross-Platform Compatibility
- Compiled the C++ codebase to WebAssembly for broad browser support.
- Utilized the WebGPU API for native GPU acceleration in browsers.

Here's why this matters:

1. Privacy: No need to send data to remote servers.
2. Cost-Efficiency: Eliminates server expenses.
3. Offline Capabilities: Use powerful AI without an internet connection.

Blog: https://research.google/blog/unlocking-7b-language-models-in-your-browser-a-deep-dive-with-google-ai-edges-mediapipe/
posted an update 10 days ago
view post
Post
3435
This is an absolutely mind-boggling experiment!

@GuangyuRobert (Twitter Handle) from MIT has created Project Sid, which simulates over 1,000 autonomous AI agents collaborating in a Minecraft environment, operating for extended periods without human intervention. This simulation demonstrates unprecedented levels of agent interaction, decision-making, and societal development.

Agents operate independently for hours or days, showcasing advanced decision-making algorithms and goal-oriented behavior.

The simulation produced complex, emergent phenomena, including:
- Economic systems with currency (gems) and trading
- Cultural development and religious practices
- Agents even understood bribing. Priests were moving the most gems to bribe people into following them!
- Governmental structures and democratic processes

Project Sid addresses fundamental challenges in AI research:
- Coherence: Maintaining consistent agent behavior over extended periods.
- Multi-agent Collaboration: Enabling effective communication and coordination among numerous AI entities.
- Long-term Progression: Developing agents capable of learning and evolving over time.

While Minecraft serves as the initial testbed, the underlying AI architecture is designed to be game-agnostic, suggesting potential applications in various digital environments and real-world simulations.

Imagine a policy being debated by the government and how it might affect society; Sid can simulate its impact!

Even if this remains just a game experiment, the project successfully manages 1,000+ agents simultaneously, a feat that requires robust distributed computing and efficient agent architecture.
posted an update 12 days ago
view post
Post
1796
Google's Chain-of-Thought (CoT) is one of the most effective ways to improve LLMs' reasoning.

Researchers have now developed a novel approach called Strategic Chain-of-Thought (SCoT) to enhance the reasoning capabilities of large language models even further.

๐Ÿง  SCoT uses a two-stage process within a single prompt:
- Strategy Elicitation: The model first identifies and determines an effective problem-solving strategy for the given task. This becomes the strategic knowledge that guides the reasoning process.
- Strategy Application: The model then applies the identified strategic knowledge to solve the problem and generate the final answer.

Essentially, SCoT integrates strategic knowledge to guide reasoning without relying on external knowledge sources or multiple queries.

According to the research, SCoT showed significant improvements over standard CoT across various datasets, including a 21.05% increase on the GSM8K math dataset and a 24.13% increase on the Tracking_Objects spatial reasoning task.

Changes in the Prompt Structure:
The SCoT prompt typically consists of five components:
- Role: Defines the expert role the model should assume.
- Workflow: Outlines the steps for strategy identification and application.
- Rules: Specifies guidelines for generating answers.
- Initialization: Sets up the task.
- Task Input: Provides the specific problem to solve.

Strategy Generation:
The model is prompted to generate strategic knowledge relevant to the problem domain. For example, in mathematics, it might favor elegant solutions like using arithmetic series formulas over brute-force calculations.

Guided Reasoning:
Using the elicited strategy, the model then generates a chain-of-thought reasoning path. This approach aims to produce more stable and higher-quality outputs compared to standard chain-of-thought methods.

Read the full paper: https://arxiv.org/abs/2409.03271
  • 1 reply
ยท
posted an update 13 days ago
view post
Post
760
Good folks at Epoch AI have just released their most comprehensive database yet, tracking over 800 state-of-the-art and historically notable AI models. This incredible resource provides key insights into the factors driving machine learning progress.

Since 2010, the training compute used to create AI models has been growing at a staggering rate of 4.1x per year. That means the computational power behind these models is doubling roughly every six months! And it's not just the compute that's increasing - the costs are too. Training compute costs for the largest models are doubling every nine months, with the most advanced models now costing hundreds of millions of dollars.

Interestingly, training compute has scaled up faster for language models compared to vision. While the largest vision and language models had similar compute requirements before 2020, language models have since rapidly outpaced vision models, driven by the success of transformer architectures. The size of datasets used to train language models is also doubling approximately every eight months.

Another fascinating trend is that the length of time spent training notable models is growing by about 1.2x per year. While longer training times could ease hardware constraints, there is a tradeoff to consider. For very long runs, waiting for algorithmic and hardware improvements might be more beneficial than simply extending training.

If this continues, by 2028, we will reach cluster prices in the 100 billion dollars, using 10GW of power!

Link: https://epochai.org/data/notable-ai-models
posted an update 14 days ago
view post
Post
1606
Just wrapped up a deep dive into the latest lecture on building LLMs, such as ChatGPT, from @Stanford CS229 course. Here are my top takeaways:

๐Ÿ” Understanding the Components: LLMs like ChatGPT, Claude, and others are more than just neural networks; they are a complex blend of architecture, training loss, data evaluation, and systems. Knowing how these components work together is key to improving and scaling these models.

๐Ÿ“Š Scaling Matters: Performance improves predictably with more data, bigger models, and greater computational power. However, balancing these factors is crucial to avoid overfitting and resource waste.

๐Ÿ“ˆ Data is King: LLMs are trained on trillions of tokens scraped from the internet, but the quality of this data matters immensely. Rigorous filtering and deduplication processes are essential to maintaining data integrity.

๐Ÿ—๏ธ Pre-Training vs. Post-Training: While pre-training equips the model with general knowledge, post-training (like RLHF) fine-tunes it to follow human-like responses, reducing toxic outputs and improving alignment with human values.

๐ŸŒ Reinforcement Learning from Human Feedback (RLHF): This technique allows LLMs to maximize outputs that align with human preferences, making models more reliable and accurate.

๐Ÿ’ก Why It Matters: Understanding these processes not only helps us appreciate the complexity behind our everyday AI tools but also highlights the challenges and opportunities in the ever-evolving field of AI.

Whether youโ€™re in tech, data science, or just AI-curious, staying updated on these advancements is crucial. LLMs are not just transforming industries; theyโ€™re redefining the future of human-computer interaction!

I just realized this was almost 2 hours long...

Link: https://www.youtube.com/watch?v=9vM4p9NN0Ts
ยท
posted an update 18 days ago
view post
Post
845
Just tried LitServe from the good folks at @LightningAI !

Between llama.cpp and vLLM, there is a small gap where a few large models are not deployable!

That's where LitServe comes in!

LitServe is a high-throughput serving engine for AI models built on FastAPI.

Yes, built on FastAPI. That's where the advantage and the issue lie.

It's extremely flexible and supports multi-modality and a variety of models out of the box.

But in my testing, it lags far behind in speed compared to vLLM.

Also, no OpenAI API-compatible endpoint is available as of now.

But as we move to multi-modal models and agents, this serves as a good starting point. However, itโ€™s got to become faster...

GitHub: https://github.com/Lightning-AI/LitServe
posted an update about 1 month ago
view post
Post
1691
โœจ Feeling thankful...

๐Ÿ‡ฎ๐Ÿ‡ณ 15th August, 2024; on India's 78th Independence Day

๐ŸŽ‰ Crossed 100 followers on Hugging Face

๐Ÿ† Got LinkedIn Top Voice

๐Ÿค– AI has never been more exciting and I am here for it

๐Ÿ‘€ @clem Can I be a Hugging Face fellow now?
posted an update about 1 month ago
view post
Post
2612
It took Googleโ€™s Transformer model from 2017 a whopping $900 to train. ๐Ÿ’ธ

This in contrast to the $191 million Google spent on Gemini Ultra sounds like a bargain! ๐Ÿ’ฐ

Gemini Ultra required 50 billion petaFLOPS (one petaFLOP equals one quadrillion FLOPs). ๐Ÿค–
Compared to OpenAIโ€™s GPT-4, which required 21 billion petaFLOPS, at a cost of $78 million. ๐Ÿ’ก

2017: Original Transformer Model: $930 [@Google ] ๐Ÿ’ป
2018: BERT-Large: $3,288 [@Google] ๐Ÿ“š
2019: RoBERTa Large: 160k [@Meta] ๐ŸŒ
2020: GPT-3(175B): $4.32M [@OpenAI] ๐Ÿง 
2023: Llama 2 70B: $3.93M [@Meta] ๐Ÿ‘
2023: GPT-4: $78.35M [@OpenAI] ๐ŸŒŸ
Now, Gemini Ultra: $191.4M [@Google] ๐Ÿš€

This forms an exponential curve! ๐Ÿคฏ

But, why? ๐Ÿค”
Compute, data, and expertise. All three come at a great cost! โš™๏ธ๐Ÿ“Š๐Ÿ’ก

Google recently made Gemini-1.5-Flash fine-tuning free, as it's almost impossible for regular businesses to justify an in-house trained foundational model! ๐Ÿ†“

This barrier of cost is going to result in fewer new foundational models/less competition and more fine-tunes! ๐Ÿ“‰๐Ÿ”„

Data [Stanford Universityโ€™s 2024 AI Index Report]: https://aiindex.stanford.edu/report/
Graphic: https://voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088

Many thanks to everyone spending tons of resources and open-sourcing the models! ๐Ÿค—
ยท
posted an update about 1 month ago
view post
Post
2125
AutoGen from @Microsoft is crazy! ๐Ÿš€ It's an open-source framework that allows LLM agents to chat with each other to solve your tasks. ๐Ÿค–๐Ÿ’ฌ

They use the Assistant-Agent and User-Proxy-Agent framework! ๐Ÿ› ๏ธ

As the name suggests, the Assistant-Agent does the work, and the User-Proxy-Agent behaves like a human, guiding the Assistant-Agent and double-checking its work! ๐Ÿง‘โ€๐Ÿ’ปโœ…

Both Assistant-Agent and User-Proxy-Agent can be the same or different LLMs. ๐Ÿค”๐Ÿ”„

AutoGen is an open-source programming framework for building AI agents and facilitating cooperation among multiple agents to solve tasks. ๐ŸŒŸ

This is truly amazing for building agentic AI quickly! ๐Ÿš€โœจ

GitHub: https://github.com/microsoft/autogen ๐Ÿ”—


from autogen import AssistantAgent, UserProxyAgent, config_list_from_json

#config
config_list = config_list_from_json(env_or_file="OAI_CONFIG_LIST")

assistant = AssistantAgent("assistant", llm_config={"config_list": config_list})
user_proxy = UserProxyAgent("user_proxy", code_execution_config={"work_dir": "coding", "use_docker": False}) 

user_proxy.initiate_chat(assistant, message="Plot a chart of NVDA and TESLA stock price change YTD.")
# This initiates an automated chat between the two agents to solve the task

posted an update about 1 month ago
view post
Post
926
Remember when Claude 3.5 Sonnet by @AnthropicAI took the world by storm with Claude Artifacts? ๐ŸŒโœจ

Now we have LlamaCoder, an open-source Claude Artifacts app that can generate full React apps and components with Meta-Llama 3.1 405B. ๐Ÿ’ป 100% free and open source. ๐Ÿ†“

I like how Llama has now started becoming a placeholder denoting open-source work! ๐Ÿ”“
Originally, Llama was an acronym for Large Language Model Meta AI. ๐Ÿค–

GitHub: https://github.com/Nutlope/llamacoder
Demo (by togetherAI): https://llamacoder.together.ai
  • 2 replies
ยท
posted an update about 1 month ago
view post
Post
2756
What is the best LLM for RAG systems? ๐Ÿค”

In a business setting, it will be the one that gives the best performance at a great price! ๐Ÿ’ผ๐Ÿ’ฐ

And maybe it should be easy to fine-tune, cheap to fine-tune... FREE to fine-tune? ๐Ÿ˜ฒโœจ

That's @Google Gemini 1.5 Flash! ๐Ÿš€๐ŸŒŸ

It now supports fine-tuning, and the inference cost is the same as the base model! <coughs LORA adopters> ๐Ÿคญ๐Ÿค–

So the base model must be expensive? ๐Ÿ’ธ
For the base model, the input price is reduced by 78% to $0.075/1 million tokens and the output price by 71% to $0.3/1 million tokens. ๐Ÿ“‰๐Ÿ’ต

But is it any good? ๐Ÿคทโ€โ™‚๏ธ
On the LLM Hallucination Index, Gemini 1.5 Flash achieved great context adherence scores of 0.94, 1, and 0.92 across short, medium, and long contexts. ๐Ÿ“Š๐ŸŽฏ

Google has finally given a model that is free to tune and offers an excellent balance between performance and cost. โš–๏ธ๐Ÿ‘Œ

Happy tuning... ๐ŸŽถ๐Ÿ”ง

Gemini 1.5 Flash: https://developers.googleblog.com/en/gemini-15-flash-updates-google-ai-studio-gemini-api/ ๐Ÿ”—

LLM Hallucination Index: https://www.rungalileo.io/hallucinationindex ๐Ÿ”—
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
642
Remember when @mistralAI said large enough and casually dropped Mistral-Large-Instruct-2407? ๐Ÿคฏ๐Ÿš€

It's now on http://lmsys.org! ๐ŸŒ It works amazing for instruction following, hard prompts, coding, and longer queries with only 123 billion parameters. ๐Ÿ’ก๐Ÿ’ป

It outperforms GPT4-Turbo and Claude 3 Opus on Coding, Hard Prompts, Math, and Longer Query categories. ๐Ÿ“ˆ๐Ÿ”ข

It also outperforms Llama 3.1 405B on Instruction Following while being 3x smaller. ๐ŸŽ๐Ÿ”

It also does exceedingly well on the Ai2 ZebraLogic logistic reasoning benchmark despite being much smaller than the other models. ๐Ÿฆ“๐Ÿค”

Mistral is not here to take part but to take over! ๐Ÿ†๐ŸŒŸ

Model: https://mistral.ai/news/mistral-large-2407/
posted an update about 1 month ago
view post
Post
2175
๐Ÿ—“๏ธ Remember when last April, @Meta released Segment Anything Model (SAM) paper and it was too good to be true. ๐Ÿคฏ

They have now released Segment Anything Model 2 (SAM 2) and it's mind-blowingly great! ๐Ÿš€

SAM 2 is the first unified model for segmenting objects across images and videos. You can use a click, box, or mask as the input to select an object on any image or frame of video. ๐Ÿ–ผ๏ธ๐Ÿ“น

SAM consists of an image encoder to encode images, a prompt encoder to encode prompts, then outputs of these two are given to a mask decoder to generate masks. ๐ŸŽญ

The biggest jump of SAM2 from SAM is using memory to have consistent masking across frames! They call it masklet prediction! ๐Ÿง 

They have also released the dataset, SA-V
This dataset is truly huge, with 190.9K manual annotations and 451.7K automatic! ๐Ÿ“Š

๐Ÿ“„ Paper: https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/

๐Ÿ“ Blog: https://ai.meta.com/sam2/

๐Ÿ”— Demo: https://sam2.metademolab.com/demo

๐Ÿ’พ Model Weights: https://github.com/facebookresearch/segment-anything-2/blob/main/checkpoints/download_ckpts.sh

๐Ÿ“ Dataset: https://ai.meta.com/datasets/segment-anything-video-downloads/
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
747
Looks like @Google is still not satisfied with Gemini 1.5 Pro! ๐Ÿ˜ฒ

Good folks at @GoogleDeepMind quietly updated the already good Gemini 1.5 Pro to Gemini-1.5-Pro-Experiment-0801 ๐Ÿš€

Unremarkable naming aside, the model itself outperforms GPT-4o, Claude-3.5, and LLama 3.1 on LMSYS and the Vision Leaderboard. ๐ŸŒŸ

Gemini-1.5-Pro-Experiment-0801 is great at almost everything, multi-lingual tasks, Maths, understanding, and coding. ๐ŸŒ๐Ÿ“š๐Ÿ’ป

Although in my testing, I felt Claude-3.5 was slightly better at coding! ๐Ÿ‘จโ€๐Ÿ’ป๐Ÿค”

Also, still cannot find an LLM that can solve the "Strawberry prompt"! ๐Ÿ“โ“

"""
How many R's are there in Strawberry?
Also, write Strawberry with all r's in brackets
"""

Try here: https://aistudio.google.com/app/prompts/new_chat
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
2344
Hello, HuggingFace๐Ÿค— community ๐ŸŒŸ,

All the amazing people quantising LLMs to AWQ and GPTQ ๐Ÿ”ง๐Ÿค–

Can you please mention the perplexity you achieved ๐Ÿ“‰ OR any other metric to measure the quantisation qualitatively? ๐Ÿ“Š

The GGUF community follows this really well! ๐Ÿ‘

And if it is not too much to ask, the script used for quantisation would be amazing! ๐Ÿ“

Thanks for the quants for the GPU poor! ๐Ÿ’ป
  • 1 reply
ยท
replied to their post about 2 months ago
view reply

Why do you say that, what task are you using it for?

posted an update about 2 months ago
view post
Post
651
When @MistralAI drops a blog post labelled "Large Enough," it's going to get serious! ๐Ÿš€๐Ÿ’ก

- Mistral-Large-Instruct-2407, just call it Mistral-Large2, is a 123B parameters Instruct model with 128k context ๐ŸŒ๐Ÿ“š

- Multilingual in 11 languages; English ๐Ÿ‡ฌ๐Ÿ‡ง, French ๐Ÿ‡ซ๐Ÿ‡ท, German ๐Ÿ‡ฉ๐Ÿ‡ช, Spanish ๐Ÿ‡ช๐Ÿ‡ธ, Italian ๐Ÿ‡ฎ๐Ÿ‡น, Chinese ๐Ÿ‡จ๐Ÿ‡ณ, Japanese ๐Ÿ‡ฏ๐Ÿ‡ต, Korean ๐Ÿ‡ฐ๐Ÿ‡ท, Portuguese ๐Ÿ‡ต๐Ÿ‡น, Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ, and Polish ๐Ÿ‡ต๐Ÿ‡ฑ. ๐Ÿ—ฃ๏ธ๐Ÿ—บ๏ธ

- Also highly focused on programming, trained on 80+ coding languages such as Python, Java, C, C++, Javascript, bash ๐Ÿ’ป๐Ÿ”ง

- Supports native function calling and structured output. ๐Ÿ› ๏ธ๐Ÿ“Š

- Released under Mistral Research License (Non-Commercial License, Research only๐Ÿ˜”)

- Open weights only๐Ÿ”“, no data or code released ๐Ÿ”’๐Ÿ“

Definitely firing shots at @Meta Llama3.1: ๐ŸŽฏ๐Ÿ”ฅ
MMLU - 84.0% (ML2) vs 79.3% (L3.1-70B) vs 85.2% (L3.1-405B)
GSM8K - 93% (ML2) vs 95.5% (L3.1-70B-Ins) vs 96.8% (L3.1-405B-Ins)

Also, it's kinda chunky! ๐Ÿ“ฆ๐Ÿ’ช
fp16/ bf16 - ~250GB VRAM
fp8/ int8 - ~125GB VRAM
int4 - ~60GB VRAM

I tried quantising it to AWQ and GPTQ, but couldn't with 30GB V-RAM. โŒ๐Ÿ–ฅ๏ธ

Also calling out AWQ and GPTQ on not supporting multi-GPU quantisation! ๐Ÿ–ฅ๏ธโšก

God sent @casperhansen has posted AWQ quantised INT4 model (68.68 GB) with the perplexity of 2.889: casperhansen/mistral-large-instruct-2407-awq ๐Ÿ”ฅ๐Ÿ‘

Looks like open AI is going to beat OpenAI! ๐Ÿ†๐Ÿค–

Blog post: https://mistral.ai/news/mistral-large-2407/

Models: mistralai/Mistral-Large-Instruct-2407
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
1546
Yet another post hailing how good Meta Llama 3.1 is? ๐Ÿค” I guess not!

While Llama 3.1 is truly impressive, especially 405B (which gives GPT-4o a run for its money! ๐Ÿ’ช)

I was surprised to see that on the Open LLM Leaderboard, Llama 3.1 70B was not able to dethrone the current king Qwen2-72B! ๐Ÿ‘‘

Not only that, for a few benchmarks like MATH Lvl 5, it was completely lagging behind Qwen2-72B! ๐Ÿ“‰

Also, the benchmarks are completely off compared to the official numbers from Meta! ๐Ÿคฏ

Based on the responses, I still believe Llama 3.1 will perform better than Qwen2 on LMSYS Chatbot Arena. ๐Ÿค– But it still lags behind on too many benchmarks! ๐Ÿƒโ€โ™‚๏ธ

Open LLM Leaderboard: open-llm-leaderboard/open_llm_leaderboard ๐ŸŒ

Hopefully, this is just an Open LLM Leaderboard error! @open-llm-leaderboard SOS! ๐Ÿšจ
ยท
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
2745
Meta Researchers: How many compute hours should we use to train Llama 3.1?
Mr. Zuck: Yes! ๐Ÿค–๐Ÿ’ช

Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper ๐Ÿ“„ on their findings and technical aspects of the models and their training process!

Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. ๐Ÿ–ฅ๏ธ๐Ÿš€


Here are some interesting findings about the computing infrastructure of Llamas:

- Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaโ€™s production clusters! ๐Ÿ“Š

- That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaโ€™s Grand Teton AI server platform. ๐Ÿ–ฅ๏ธ๐Ÿ”‹

- What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. ๐Ÿ’พ๐Ÿ“ˆ

- Meta's mad lads saved each GPUโ€™s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. ๐Ÿ› ๏ธ๐Ÿ”


If this sounds big, well, they document the humungous challenges that come with it:

- In the 54-day training period, there were 466 job interruptions. ๐Ÿ•’๐Ÿ”„

- About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! ๐Ÿ’ฅ๐Ÿ–ฅ๏ธ

- Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ๐Ÿ“‰๐Ÿ’พ

- With all this, effective training timeโ€”measured as the time spent on useful training over the elapsed timeโ€”was higher than 90%. โฑ๏ธ๐Ÿ“Š

I think this is the stuff that movies can be made on! ๐ŸŽฌ๐ŸŒŸ

Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
ยท
posted an update about 2 months ago
view post
Post
700
๐Ÿš€ Good folks at @nvidia just dropped: "ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities" ๐Ÿง ๐Ÿ’ก

In the past few months, the open LLM community has made significant progress in releasing open models (Llama-3-70B-Instruct ( @Meta -AI) ๐Ÿฆ™, QWen2-72BInstruct ( @AlibabaGroup ) ๐ŸŒ, Nemotron-4-340B-Instruct ( @nvidia ) โš™๏ธ, and Mixtral-8x22BInstruct-v0.1 ( @MistralAI ) ๐ŸŒช๏ธ) that are at par with proprietary models! ๐Ÿ“ˆ

But top models like GPT-4 are still outperforming them in certain domains! ๐Ÿ”๐Ÿ’ช

This led us to having domain-focused open-LLMs (DeepSeek-Coder-V2 for coding and math ๐Ÿ‘จโ€๐Ÿ’ปโž•, ChatQA 1.5 for conversational QA and retrieval-augmented generation (RAG) ๐Ÿ’ฌ๐Ÿ”, and InternVL 1.5 for vision-language tasks ๐Ÿ–ผ๏ธ๐Ÿ—ฃ๏ธ)

The challenge that ChatQA 2 focuses on is of context length and RAG! ๐Ÿ“๐Ÿ”—

These are the two capabilities essential for LLMs to process large volumes of information that cannot fit into a single prompt and are complementary to each other, depending on the downstream tasks and computational budgets. ๐Ÿงฉ๐Ÿ“Š

The solution is a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. ๐Ÿ”„๐Ÿ”ง

๐Ÿ“„ Paper: ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities (2407.14482)

The interesting thing to notice from benchmarks was, how good QWen 2 is out of the box! ๐Ÿ‘โœจ
posted an update 2 months ago
view post
Post
452
Remember when you had a few hundred rows of data that could easily be opened in Excel. ๐Ÿ“Š

Well, we are far from that with billion-parameter LLMs trained on trillions of tokens. ๐ŸŒ

@Microsoft wants to bridge that using "SpreadsheetLLM": Encoding Spreadsheets for Large Language Models. ๐Ÿค–๐Ÿ“ˆ

While it sounds simple, Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). ๐Ÿšง

They initially propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach is limited by LLMs' token constraints, making it impractical for most applications. โ›”

Solution... A SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. ๐Ÿ”ง

It comprises three modules:
1๏ธโƒฃ Structural-anchor-based compression
2๏ธโƒฃ Inverse index translation
3๏ธโƒฃ Data-format-aware aggregation

It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. ๐Ÿ†

Sounds exciting, sadly no code, models OR datasets are released. ๐Ÿ™

Moreover, there is a lot of research in encoding 2D position embeddings and this work has not been benchmarked against that! ๐Ÿง

Paper: SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (2407.09025)
posted an update 2 months ago
view post
Post
586
Running billion parameter models, sometimes we forget what it all is! ๐Ÿค”๐Ÿ’ก

Matrix multiplication ๐Ÿงฎโœจ

While there are multiple plays on memory management and caching to speed it up! ๐ŸŽ๏ธ๐Ÿ’พโšก

The naive way of Matrix multiplication becomes even more fascinating the bigger these models get! ๐Ÿคฏ๐Ÿ“ˆ

QKV for the win! ๐Ÿ†๐Ÿ”‘๐Ÿ“š

GitHub: https://github.com/wentasah/mmul-anim
Slides: https://cw.fel.cvut.cz/wiki/_media/courses/b4m36esw/esw09_2019.pdf ๐Ÿ“‘๐ŸŽ“
  • 1 reply
ยท
posted an update 2 months ago
view post
Post
571
Caffe 2 started it, TensorFlow brought it to the masses but PyTorch perfected it! โœจ

Good folks at PyTorch are just wizards ๐Ÿง™โ€โ™‚๏ธ. Along with producing one of the best Deep Learning libraries of all time, they just dropped the "PyTorch Documentary" ๐ŸŽฅ

A must-watch! It covers the beginning to now:
- Caffe 2 โ˜•
- Torch (with Lua) ๐Ÿ”ฅ
- Tensorflow ๐Ÿ”„
- PyTorch ๐Ÿ”ฅ
- Taffe IR (later became ONNX - Open Neural Network Exchange) ๐Ÿ”ง

Full Official PyTorch Documentary: Powering the AI Revolution: https://youtu.be/rgP_LBtaUEc

Interesting quote: "PyTorch does not fight for the fastest performance but the easiest user experience!" ๐ŸŒŸ

That's what Python ๐Ÿ feels like...

You got to thank @Meta for open-sourcing it! ๐Ÿค
posted an update 3 months ago
view post
Post
836
Remember the recently released GLM-4 from Tsinghua University ๐ŸŽ‰

Now we have an open-source version of it continuously trained for multilingual code generation! ๐ŸŒ

It beats CodeLlama 70B (almost 7x size) and is competitive with DeepSeek Coder 33B and Qwen 2 ๐Ÿ’ช

Just like almost every other coding model, it has a 128K context ๐Ÿ“œ

Supports:
- Code completion ๐Ÿ–‹๏ธ
- Code generation ๐Ÿ› ๏ธ
- Code interpreter ๐Ÿ’ก
- Web search ๐Ÿ”
- Function call ๐Ÿ“ž
- Repository-level code Q&A ๐Ÿ—‚๏ธ

Benchmarks 48.9 and 40.4 for the complete and instruct tasks of BigCodeBench ๐Ÿ“Š

It still falls behind DeepSeek-Coder-V2. While this might have fewer parameters, DSC V2 is a MoE model with only ~2B active parameters ๐Ÿค”

Good to see more efficient coding LLMs, but DeepSeek-Coder-V2 is just too good ๐Ÿ†

Codel: https://github.com/THUDM/CodeGeeX4
Model weights: THUDM/codegeex4-all-9b
  • 1 reply
ยท
posted an update 3 months ago
view post
Post
3581
An Open-source and super-fast alternative to @OpenAI GPT4o is here! ๐Ÿš€

In November last year, Iliad announced a fully open-source-oriented AI lab called @kyutai_labs ๐Ÿงช

In this very short time they have released Moshi! An open speech-to-speech model ๐Ÿ—ฃ๏ธ... released publicly even before closed GPT4o (yes, you can try it right now!) ๐ŸŒ

Demo: https://www.moshi.chat/?queue_id=talktomoshi

This is what you expect an intelligent companion to do! ๐Ÿค– It is continuously generating responses and listening at the same time with sub 300 ms latency! โฑ๏ธ

This level of engagement is so new, it almost feels like I am under pressure to keep up with it! ๐Ÿ˜…

While it hallucinates like crazy! ๐Ÿคฏ I think fundamentally this is what a true assistant would look like. And did I say they are going to open-source it? ๐Ÿ†“

Weights and full technical report are promised to be coming soon! ๐Ÿ“œ

This is the work of an incredible but just 8 team members!! ๐Ÿ‘ฅ๐Ÿ‘
  • 1 reply
ยท
posted an update 3 months ago
view post
Post
579
๐ŸŒŸ It's been about a week since @Google dropped Gemma 2 and now Gemma 2 27B is the highest-ranked open-source LLM on LMSYS Chatbot Arena Leaderboard, beating Meta Llama 3 70B and Alibaba Qwen 2 72B! ๐Ÿš€๐Ÿ’ช

๐Ÿ” Here is what a week of Gemma looked like:

1๏ธโƒฃ First, it's a challenging model to run. The only reason I could find is soft-capping of logits within the attention for longer context optimizations. And no one was doing that before! ๐Ÿคฏ

2๏ธโƒฃ Next, the technical report mentions a 2B model... where is it? ๐Ÿค”

3๏ธโƒฃ Simple things like a context length of 8192 tokens, the Rotary Position Embeddings (RoPE), and the approximated GeGLU non-linearity are similar to earlier Gemma's. ๐Ÿ“๐Ÿ”„

4๏ธโƒฃ But a lot of new stuff is here like Local Sliding Window and Global Attention: they alternate between them at every layer... don't know why! ๐Ÿคทโ€โ™‚๏ธ and of course Logit soft-capping. ๐Ÿ’ก

5๏ธโƒฃ On-Policy Distillation of Language Models - Knowledge Distillation: Leverage a larger teacher model to train a smaller model (the 9B model). ๐ŸŽ“โžก๏ธ๐Ÿ“š

6๏ธโƒฃ Model Merging: Combined average models from experiments run with different hyperparameters. ๐Ÿงชโš—๏ธ

7๏ธโƒฃ And borrowed Grouped-Query Attention (GQA) from @Meta Llama-3. ๐Ÿ”„๐Ÿค

๐Ÿ“– One of the best articles on Gemma 2: https://huggingface.co/blog/gemma2

๐Ÿ“Š Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

๐Ÿ”— Models: google/gemma-2-release-667d6600fd5220e7b967f315
posted an update 3 months ago
view post
Post
1534
๐Ÿš€ Transformers are not here to take part but take over... and down goes real-time object detection! ๐Ÿ’ฅ

Enter Real-time DEtection Transformer (RT-DETR) ๐Ÿฆพ as suggested capable of real-time object detection. ๐ŸŽฏ

Object DEtection Transformer (DETR) is not new ( @Meta did it eons ago) but it had the issue of every other transformer, high computational cost ๐Ÿ’ธ

RT-DETR brings an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed ๐ŸŽ๏ธ

Gist is RT-DETR speeds up object detection by redesigning its encoder to process features more efficiently and selecting higher quality initial object queries. โšก

It also allows adjusting the number of decoder layers to balance speed and accuracy for different real-time scenarios. โš–๏ธ

This makes RT-DETR faster and more accurate than previous YOLO models. ๐Ÿ†

How much better๐Ÿ˜Ž/faster? โฑ๏ธ

RT-DETR-R50 achieved 53.1% AP on COCO and 108 FPS on a T4 GPU, while RT-DETR-R101 achieved 54.3% AP and 74 FPS, outperforming advanced YOLO models in both speed and accuracy. ๐Ÿš€โœจ

๐Ÿ“„ Paper: DETRs Beat YOLOs on Real-time Object Detection (2304.08069)

๐Ÿง  Models: https://huggingface.co/models?search=pekingu/rt-detr
posted an update 3 months ago
view post
Post
505
๐Ÿ“… Remember when at the beginning of the year @Google gave an update on knowledge distillation! Introducing a way of learning from Self-Generated Mistakes?

๐Ÿ“Š Resulting in significant improvements across tasks:
- ๐Ÿ“„ 2.1x in summarization
- ๐ŸŒ 1.7x in translation
- ๐Ÿง  1.9x in reasoning tasks

๐Ÿš€ Well, it looks like Google wasn't messing around! According to the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.

๐Ÿ“ˆ For post-training, the Gemma 2 team generated completions from a stronger teacher model (unspecified in the report, but presumably Gemini Ultra), and then trained the student models on this synthetic data with SFT. This is quite common as seen in many open models, such as Zephyr and OpenHermes.

๐Ÿค” Sounds too good to be true? These models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference.

๐Ÿ“ฐ This is where the January 2024 paper "On-Policy Distillation of Language Models" comes in...

๐Ÿ” Gemma 2 team used โ€œon-policy distillation,โ€ where the student generates completions from the SFT prompts. These completions are then used to compute the KL divergence between the teacherโ€™s and studentโ€™s logits. By minimizing the KL divergence throughout training, the student learns to model the behavior of the teacher accurately while also minimizing the train-inference mismatch.

๐Ÿ“š Gem๐Ÿ”น of a blog by @huggingface uncovering everything Gemma 2: https://huggingface.co/blog/gemma2#knowledge-distillation

๐Ÿ“„ On-Policy Distillation of Language Models: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (2306.13649)
posted an update 3 months ago
view post
Post
540
Remember Will Smith eating Spaghetti? ๐Ÿ๐Ÿ˜†

AI has come a long way from generating hilariously low-quality videos to almost unrealistic realistic videos ๐ŸŽฅโœจ

But most models like @OpenAI Sora, @Kling_ai , etc are not publicly available. ๐Ÿšซ๐Ÿ–ฅ๏ธ

But now we have @LumaLabsAI Dream Machine, which is publicly available for free! ๐ŸŽ‰๐Ÿ†“

Here is the dilemma, Sora and Kling posted some excellent examples of what the AI was capable of, and so did Luma AI. ๐ŸŒŸ๐Ÿค–

But in actual use, they leave so much to be desired. ๐Ÿ˜• Are we back to cherry-picking examples and leaking benchmarks in training data? ๐Ÿ’๐Ÿ“Š

Try Dream Machine ๐Ÿ‘‰ https://lumalabs.ai/dream-machine ๐ŸŒ
posted an update 3 months ago
view post
Post
1667
๐Ÿ–ฅ๏ธ Do you have 1TB+ VRAM?

๐ŸŽ‰ Well, good news for you!

๐Ÿ‘จโ€๐Ÿ”ฌ Good folks at @nvidia have released Nemotron 4 340B, the new open-source LLM king, rivalling GPT-4! ๐Ÿš€

๐Ÿ“Š 340B parameter models in 3 flavours: base, reward, and instruct models

๐ŸŽฏ It's a dense model, not MoE

๐Ÿ‘“ 4k context window

๐Ÿ“š 9T tokens training data, 2 phase training (8T pre-train + 1T continued pre-training)

๐ŸŒ Trained on 50+ languages and 40+ coding languages (70% training data is English, 15% multi-lingual, 15% code)

๐Ÿ“… June 2023 training data cut-off

๐Ÿ’ป To deploy needs 8x H200/ 16x H100/ 16x A100 80GB for BF16 Inference (about 8x H100 in int4)

๐Ÿ† Of course, it beats Llama 3 70B on MMLU (81.1), Arena Hard (54.2), and GSM8K (92.4)

๐Ÿค– But beaten by Qwen 2 on HumanEval and MTBench which is a 72B parameter model

๐Ÿ”ง Used SFT, DPO, and RPO. RLHF via Nemo Aligner framework to align the model

๐Ÿ“Š 98% of alignment data was synthetically generated

๐Ÿ“„ Nvidia open licence with commercial use allowed

ยฏ\_(ใƒ„)_/ยฏ
๐Ÿ˜… Glad to see more open models but this is one confusing fellow!
๐Ÿคจ340B parameter model that is narrowly beating 70B models? Starts failing against 72B models? Sounds like a model for synthetic data generation! But then it has 4k context?

๐Ÿ”— Models: nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911

๐Ÿ“‘ Paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
posted an update 3 months ago
view post
Post
968
๐Ÿ” Remember Tensorboard graph visualizer?

๐Ÿš€ @Google just released Model Explorer, a Tensorboard graph visualizer on steroids ๐Ÿ’ช.

๐Ÿ› ๏ธ Model Explorer is a graph visualization tool designed to improve understanding, debugging, and optimizing machine learning (ML) models, especially large ones.

๐ŸŽฏ It addresses challenges in traditional graph visualization tools by implementing a hierarchical layout and GPU-accelerated graph rendering, which enhances performance and usability.

๐ŸŒ The tool supports visualization of large-scale ML models by displaying hierarchical information, which simplifies understanding complex model architectures.

๐Ÿ”‘ Key features include layer-by-layer exploration ๐Ÿ”, side-by-side graph comparison for debugging conversion errors ๐Ÿ›, and per-node data overlays for identifying performance issues ๐Ÿ“ˆ.

๐Ÿ‘จโ€๐Ÿ’ป Originally developed for Google's internal use, Model Explorer is now available publicly as part of the Google AI Edge family of products and even runs directly in colab!

๐Ÿ”— Colab: https://github.com/google-ai-edge/model-explorer/blob/main/example_colabs/quick_start.ipynb

๐Ÿ“ฐ Blog: https://research.google/blog/model-explorer/
  • 1 reply
ยท
posted an update 3 months ago
view post
Post
2067
There are 2.2 billion active @Apple devices ๐Ÿ and all of them just got smarter thanks to Apple Intelligence (AI) ๐Ÿง 

Well, almost all devices... ๐Ÿค”

Your device needs:
- A17 Pro chip or later if it's an iPhone ๐Ÿ“ฑ,
- M1 chip or later if iPad ๐Ÿ“ฑ,
- M1 chip or later if Mac ๐Ÿ’ป.

All this aside, this is probably the largest deployment of on-device LLMs ๐ŸŒ.

Here is the technical goodness:
- AI will run ~3B LLM on device (Mac, iPhone, iPad) with grouped-query-attention, activation, and embedding quantization (Talaria bit rate selection) running on the neural engine ๐Ÿš€.
- Will be using fine-tuned LoRA Adapters for different tasks, claiming to outperform other 7B and 3B LLMs! ๐Ÿฅ‡
- iPhone 15 Pro 0.6 ms time-to-first-token with 30 tokens/second latency โฑ.
- No server model size or details ๐Ÿค.
- Will be dynamically loading, caching, and swapping LoRA adapters (think LoRA Land) ๐Ÿ”„.
- On-device model has 49K vocab size, while the server model goes 100K ๐Ÿ“š.
- Using rejection sampling fine-tuning and RLHF in post-processing ๐Ÿ“ˆ.
- A rejection sampling fine-tuning algorithm with teacher committee ๐ŸŽ“.
- And reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator ๐Ÿงฎ.
- Used synthetic data generation (from bigger models, does not mention which) for tasks like summaries ๐Ÿ“.
- 750 evaluation samples for each production use case to evaluate summarization (dataset not released) ๐Ÿ“Š.
- No mention of multilingual support ๐ŸŒ.
- Used Apple's AXLearn framework (JAX) and FSP to train on TPUs and GPUs ๐Ÿ’ช.
- 3B + Adapter outperforms Phi-3 mini, Gemma 7B, Mistral 7B on summarization ๐Ÿ†.
- 3B + Adapter achieves 78.7% on IFEval beating Phi-3 mini, Gemma 7B, Mistral 7B; Server Model matches GPT-4-Turbo and beats Mixtral 8x22B and GPT-3.5-turbo โœจ.

LoRA for the win! ๐ŸŽ‰

Blog: https://machinelearning.apple.com/research/introducing-apple-foundation-models
  • 1 reply
ยท
posted an update 3 months ago
view post
Post
3177
Here is a thought, instead of telling LLMs what to do, show them! ๐ŸŽญ

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. ๐Ÿ—ฃ๏ธ๐ŸŒ

DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! ๐Ÿคฏ

What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! ๐Ÿ˜‚)

Here is the step-by-step implementation: ๐Ÿ› ๏ธ

Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. ๐Ÿ

Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. ๐ŸŽš๏ธ

Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. ๐Ÿ”„

Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. โญ๏ธ

Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. ๐ŸŽฏ

DITTO outperforms few-shot prompting. ๐Ÿš€

Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) ๐Ÿ“„
posted an update 3 months ago
view post
Post
1367
๐Ÿ“ˆ One of the biggest changes in Llama 3 was the training dataset, which grew by 7X over Llama 2 (2T to 15T tokens) ๐Ÿš€

While Meta did not open source the dataset, it sparked a thought... what would happen if everyone had access to a big, high-quality dataset? ๐Ÿค”

To address that, in April this year, @huggingface released FineWeb, a 15T token open-source dataset ๐ŸŒ

And now they are releasing FineWeb Technical Report and FineWeb Edu ๐Ÿ“š

๐Ÿ† 15T tokens in FineWeb outperforming other open datasets
๐ŸŽ“ 1.3T highest-quality educational dataset FineWeb-Edu
๐Ÿ“˜ 5.4T high-quality educational tokens in FineWeb-Edu-2

FineWeb Edu outperforms other datasets on MMLU, ARC, OpenBookQA ๐Ÿ“ˆ

ODC-By 1.0 license ๐Ÿ“œ

Report: HuggingFaceFW/blogpost-fineweb-v1
posted an update 3 months ago
view post
Post
1492
Every time a new model is released that is topping 10+ leaderboards on 50+ benchmarks... ๐Ÿš€

My brain goes... I will wait for the LMSYS Chatbot Arena results! ๐Ÿค”

User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. ๐Ÿข

Now we have MixEval, a new open benchmark with a 96% correlation to LMSYS Chatbot Arena and Human preferences. ๐ŸŽฏ

It comes with MixEval (4k samples) and MixEval Hard (1k samples) ๐Ÿ“Š

Can use GPT-3.5-Turbo or any other open-source models as Parser/Judge ๐Ÿค–

It takes less than 6% of the time and cost of MMLU ๐Ÿ’ธ

As expected:
In open models: Qwen2 72B >> Llama 3 70B >> Mixtral 8x7B ๐Ÿ”
In Closed Models: GPT-4o >> Claude 3 Opus >> Gemini Pro ๐Ÿ”’

Leaderboard: https://mixeval.github.io/ ๐Ÿ“ˆ
posted an update 4 months ago
view post
Post
1591
Remember when @Microsoft released Phi-3 models... ๐Ÿค”

Yup, the ones that had ๐Ÿฆ™Llama 3 8B beat on MMLU using 3.8B parameters! ๐Ÿ†

Now they are on the LMSYS Chatbot Arena Leaderboard! ๐Ÿ“Š๐Ÿ“ˆ

Medium(14B) ranks near GPT-3.5-Turbo-0613, but behind Llama 3 8B. ๐Ÿ“‰

Phi-3 Small(7B) is close to Llama-2-70B, and Mistral fine-tunes. ๐Ÿ“Š

What about the Phi-3 Mini(3.8B), that was giving Llama 3 8B a run for its money on MMLU? It gets an arena score of 1037 (#73) against 1153 (#22) of Llama 3 8B ๐Ÿคผ

Looks like there is a struggle here between perplexity and inherent knowledge! ๐Ÿค”

And Microsoft picked knowledge with high perplexity ๐Ÿง 

Now I am even more intrigued: what is @Meta feeding its ๐Ÿฆ™ Llamas?๐ŸŒพ

๐Ÿ† Leaderboard: https://chat.lmsys.org/?leaderboard
posted an update 4 months ago
view post
Post
879
"Hold your pixels" ๐Ÿšฆ... SD3 is here ๐ŸŒŸ

๐Ÿš€ Performance Enhancements: Stable Diffusion 3 surpasses other text-to-image models like DALLยทE 3 in typography and prompt adherence.

๐Ÿ—๏ธ New Architecture: Introduces the Multimodal Diffusion Transformer (MMDiT) that separately processes image and language data, enhancing text understanding and spelling.

โšก Efficiency Improvements: Features a rectified flow formulation for more efficient image generation, fitting within the memory constraints of common GPUs.

๐Ÿ“ˆ Scalability: Demonstrates scaling capabilities with models ranging up to 8 billion parameters, showing improvements in model performance without saturation.

๐Ÿ”ง Flexible Text Encoders: Offers a flexible approach to text encoding, maintaining performance even when the largest T5 text encoder is removed for less memory-intensive operations.

While they discuss experiments on 2B and 8B parameter models, no word on open weights ๐Ÿค

Paper: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (2403.03206)
@StabilityAI
posted an update 4 months ago
view post
Post
704
๐Ÿ”ฅ 77.2% on MMLU with 3.7B parameters ๐Ÿš€

... 3.7B active parameters, 40B in total parameters ๐Ÿ“Š

7.4 GFlops forward computation per token, 1/19 of Llama3-70B ๐Ÿ“‰

Exciting enough? ๐Ÿ˜ฒ

That's Yuan2-M32 for you, released by IEIT-Yuan.
A new 40B Mixture of Experts using a new Attention Router mechanism ๐Ÿง 

32 experts with 2 active in generation โœŒ๏ธ

8,192 context length ๐Ÿ“

Trained on 2T tokens, using 9.25% of the compute required by the dense models ๐Ÿ› ๏ธ.

Yuan 2.0-M32 employs fine-tuning techniques to adjust to longer sequence lengths, utilizing a modified base value in the Rotary Position Embedding to maintain performance over extended contexts ๐Ÿ”„.

Open-source - Apache 2.0 ๐Ÿ“œ

Vocabulary size of 135,040 ๐Ÿ—ฃ๏ธ

Outperforms Mixtral 8x7B (47B total parameters, 12.9B active parameters) on all benchmarks and almost gives Llama 3 70B run for its money ๐Ÿ’ธ

Models: https://huggingface.co/IEITYuan ๐ŸŒ
Paper: Yuan 2.0-M32: Mixture of Experts with Attention Router (2405.17976) ๐Ÿ“„
posted an update 4 months ago
view post
Post
1511
Remember Gemini, GPT-4o, all being true multimodal models ๐ŸŒŸ.

Now we have a paper ๐Ÿ“„ describing an architecture that might achieve that!

Uni-MoE: a native multimodal, Unified Mixture of Experts (MoE) architecture ๐Ÿ—๏ธ.

Uni-MoE integrates various modalities (text ๐Ÿ“, image ๐Ÿ–ผ๏ธ, audio ๐ŸŽต, video ๐Ÿ“น, speech ๐Ÿ—ฃ๏ธ) using modality-specific encoders and connectors for a cohesive multimodal understanding.

Training Strategy:
1๏ธโƒฃ Training cross-modality alignment with diverse connectors ๐Ÿ”„.
2๏ธโƒฃ Training modality-specific experts using cross-modality instruction data ๐Ÿ“Š.
3๏ธโƒฃTuning the Uni-MoE framework with Low-Rank Adaptation (LoRA) on mixed multimodal data ๐Ÿ”ง.

Technical Details:

Modality-Specific Encoders: CLIP for images ๐Ÿ–ผ๏ธ, Whisper for speech ๐Ÿ—ฃ๏ธ, BEATs for audio ๐ŸŽต.

MoE-Based Blocks: Shared self-attention layers, feed-forward networks (FFN) based experts, and sparse routers for token-level expertise allocation ๐Ÿš€.

Efficient Training: Utilizes LoRA for fine-tuning pre-trained experts and self-attention layers ๐Ÿ› ๏ธ.

Uni-MoE outperforms traditional dense models on benchmarks like A-OKVQA, OK-VQA, VQAv2, MMBench, RACE-Audio, and English High School Listening Test ๐Ÿ†.

The code is open-sourced as well: https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/Uni_MoE_v2

Paper: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2405.11273)
posted an update 4 months ago
view post
Post
1877
๐Ÿฆ… Falcon has landed... again!
And now it not just reads but sees as well ๐Ÿ“–๐Ÿ‘€

Here is a summary of the Falcon-11B-VLM model:

Model Type: Causal decoder-only model ๐Ÿ”„.

Parameters: 11 billion ๐ŸŒŒ.

Vision Integration: Uses the pretrained CLIP ViT-L/14 vision encoder with the recently released Falcon2-11B chat-finetuned model and trained with image-text data ๐Ÿ–ผ๏ธ๐Ÿ“š.

Training: Pretrained on over 5,000 billion tokens from RefinedWeb with curated corpora ๐Ÿ“Š.

Dynamic Encoding: Enhances perception of fine-grained details in images ๐Ÿ”.

Training Hardware: 16 A100 80GB GPUs with ZeRO and Flash-Attention 2 ๐Ÿ–ฅ๏ธ.

Tokenizer: Falcon-7B/11B tokenizer ๐Ÿงฉ.

Languages Supported: ๐ŸŒ Primarily English, with capabilities in German ๐Ÿ‡ฉ๐Ÿ‡ช, Spanish ๐Ÿ‡ช๐Ÿ‡ธ, French ๐Ÿ‡ซ๐Ÿ‡ท, Italian ๐Ÿ‡ฎ๐Ÿ‡น, Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ, Romanian ๐Ÿ‡ท๐Ÿ‡ด, Czech ๐Ÿ‡จ๐Ÿ‡ฟ, Swedish ๐Ÿ‡ธ๐Ÿ‡ช, and more. ๐Ÿ—ฃ๏ธ๐ŸŒ.

License: Open Source - TII Falcon License 2.0, based on Apache 2.0 ๐Ÿ“œ.

Model: tiiuae/falcon-11B-vlm
posted an update 4 months ago
view post
Post
1476
You are happy that @mistralai is releasing a new model ๐Ÿ˜Š

You become even more happy to see it's a completely new coding model ๐Ÿ˜„

Then you become sad because the model is licensed under MNLP ๐Ÿ˜”

Before we talk about MNLP, here is the gist of the model:

๐Ÿท๏ธName: Codestral (Code + Mistral ๐Ÿ˜‚)

๐Ÿš€ 22B parameters

๐ŸŒ Supports 80 programming languages (including Python, Java, C, C++, bash, swift, and more)

๐Ÿ† Outperforms Llama 3 70B and Code Llama 70B on HumanEval and MBPP

๐Ÿ† Outperforms DeepSeek Coder 33B on HumanEval

๐Ÿ“œ 32K context window (longer than Llama 3, DeepSeek, or Code Llama)

๐Ÿค– Supports both code assistant and code completion use cases

More details: https://mistral.ai/news/codestral/

mistralai/Codestral-22B-v0.1

Now what's MNLP? It's a non-commercial license for Mistral models that Codestral is released under! More here: https://mistral.ai/news/mistral-ai-non-production-license-mnpl/

Don't be sad... ๐Ÿ˜ƒ There is another model that's open source and actually gives better performance on HumanEval: Bin12345/AutoCoder
posted an update 4 months ago
view post
Post
1052
Remember stacking in ensemble ML? ๐Ÿค”

What happens if you do the reverse of that but with LLMs? ๐Ÿคฏ

Basically, MoE created by merging multiple models (instead of being pre-trained like Mixtral)? ๐Ÿง 

Frankenstein MoE! (not an official name) ๐ŸงŸโ€โ™‚๏ธ

That's the new Kraken architecture! ๐Ÿ™

It uses a sequence classification model to route inputs to the most suitable language model based on the input's characteristics. ๐Ÿšฆ

Yup, multiple full-fledged LLMs are loaded into memory, and then a classification layer decides who gets to generate an output! ๐ŸŽฐ

Tell me you have too many GPUs without telling me you have too many GPUs! ๐Ÿ–ฅ๏ธ๐Ÿ”ฅ

Jokes aside, extremely fascinating research but I don't understand why this can't just be a big model with multiple LORA adapters, that can be decided on the fly? ๐Ÿคทโ€โ™‚๏ธ

Model: cognitivecomputations/Kraken
Github: https://github.com/cognitivecomputations/kraken
  • 3 replies
ยท
posted an update 4 months ago
view post
Post
326
Mistral 7B might be one of the most popular open-source LLMs out there, with a total of over 3.4 million downloads on @huggingface Hub ๐Ÿš€, and now we have the next version...

@MistralAI Mistral -7B-v0.3 (base) ๐Ÿ“ˆ and Mistral-7B-Instruct-v0.3 ๐Ÿ› ๏ธ

- 7.3 billion parameters ๐Ÿง 
- Apache 2.0 license ๐Ÿ“œ
- Extended vocabulary of 32,768 ๐Ÿ“–
- Supports new v3 Tokenizer and function calling ๐Ÿค–
- Also, it's completely uncensored ๐Ÿ†“

In conclusion, Mistral-7B-v0.3 is an uncensored Mistral-7B-v0.2 with an extended vocabulary ๐ŸŽ‰.

They have also released mistral_inference, although I don't know what's the advantage in using it? vLLM is still my go-to way of deploying local Mistral-7B! ๐ŸŒ

Models:
mistralai/Mistral-7B-v0.3
mistralai/Mistral-7B-Instruct-v0.3
posted an update 4 months ago
view post
Post
1247
When was the last time you looked for a non-English LLM, only to be saddened there is NO good option? ๐Ÿ˜Ÿ

For instance, @Meta 's poster child Llama 3 has less than 5% non-English tokens! ๐ŸŒ

@cohere is here to change that... and change they will. ๐Ÿ’ช

Introducing Aya (this time the name actually makes sense) ๐ŸŒŸ

Aya is an open-weight (CC-BY-NC) model, that comes in 3 flavours:

1๏ธโƒฃ Aya 101: 13B parameters, mT5-xxl based model that supports 101 languages ๐ŸŒ

2๏ธโƒฃ Aya 23: 8B and 3๏ธโƒฃ35B parameter models, supports 23 languages ๐Ÿ“Š

23 languages covered are: Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ, Chinese (simplified & traditional) ๐Ÿ‡จ๐Ÿ‡ณ, Czech ๐Ÿ‡จ๐Ÿ‡ฟ, Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ, English ๐Ÿ‡ฌ๐Ÿ‡ง, French ๐Ÿ‡ซ๐Ÿ‡ท, German ๐Ÿ‡ฉ๐Ÿ‡ช, Greek ๐Ÿ‡ฌ๐Ÿ‡ท, Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ, Hindi ๐Ÿ‡ฎ๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ, Italian ๐Ÿ‡ฎ๐Ÿ‡น, Japanese ๐Ÿ‡ฏ๐Ÿ‡ต, Korean ๐Ÿ‡ฐ๐Ÿ‡ท, Persian ๐Ÿ‡ฎ๐Ÿ‡ท, Polish ๐Ÿ‡ต๐Ÿ‡ฑ, Portuguese ๐Ÿ‡ต๐Ÿ‡น, Romanian ๐Ÿ‡ท๐Ÿ‡ด, Russian ๐Ÿ‡ท๐Ÿ‡บ, Spanish ๐Ÿ‡ช๐Ÿ‡ธ, Turkish ๐Ÿ‡น๐Ÿ‡ท, Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ, and Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ.

Not only models, they have open-sourced the dataset: Aya Collection stands as the most extensive assembly of multilingual instruction fine-tuning datasets to date, featuring 513 million prompts and completions across 114 languages. ๐Ÿ“š

These annotations were provided by people across the globe. Not gonna lie, I almost shed a tear reading this... ๐Ÿ˜ข

From Cohere: The word Aya is derived from the Twi language meaning โ€œfernโ€ - a symbol of endurance and resourcefulness. Aya embodies our dedication to advancing multilingual AI. ๐ŸŒฟ

Looks like the good folks at Cohere did not sleep after the success of command r & command r plus! ๐Ÿ˜ดโžก๏ธ๐Ÿš€

Models: Aya-101-13B: CohereForAI/aya-101
Aya-23-8B: CohereForAI/aya-23-8B
Aya-23-35B: CohereForAI/aya-23-35B

Dataset: CohereForAI/aya-datasets-660415741bd4852f01c81c77

Blog: https://cohere.com/research/aya
posted an update 4 months ago
view post
Post
1893
Good folks at @Meta have introduced Chameleon ๐ŸฆŽ (who names these things? ๐Ÿคทโ€โ™‚๏ธ)

Chameleon is an AI model that can work with multiple types of data, like text and images, all at once. ๐Ÿ–ผ๏ธ๐Ÿ“

Before you start searching, as of this post, the model/code have not been open-sourced nor is there any commitment to open-source... sorry! ๐Ÿšซ๐Ÿ”“

Still, here is the technical stuff:

๐Ÿ‘‰ Challenges with Current Systems:

๐Ÿ“‰ Fragmentation: Current multimodal models are often specialized for either text or image tasks, lacking unified approaches.

๐Ÿ“Š Scalability: Existing systems struggle with scaling to handle complex, mixed-modal tasks without significant performance degradation.

๐Ÿ”„ Alignment: Aligning textual and visual modalities remains a technical challenge, often requiring separate processing pipelines.

๐Ÿ‘‰ Objective:

๐ŸŽฏ Unified Modeling: Develop a single model capable of handling various multimodal tasks (text generation, image generation, image captioning, visual question answering) seamlessly.

๐Ÿ‘‰ How It's Done ๐Ÿ“˜

Early-Fusion Architecture ๐Ÿง : Utilizes an early-fusion token-based approach to integrate text and image data from the beginning.

Stable Training ๐Ÿ’ช: Implements a tailored alignment recipe and specific architectural parameterization to ensure stability in mixed-modal settings.

Broad Evaluation ๐Ÿ“Š: Assesses the model across various tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation.

๐Ÿ‘‰ Results: (Fun fact they mention Llava-1.5 in comparison but never really share the results)

๐Ÿ† Performance: Chameleon achieves state-of-the-art results in image captioning and outperforms models like Llama-2 in text-only tasks.

โš–๏ธ Competitiveness: It shows competitive performance with models such as Mixtral 8x7B and Gemini-Pro.

๐Ÿ‘ฉโ€โš–๏ธ Human Judgments: Matches or exceeds the performance of larger models, including Gemini Pro and GPT-4V

Paper: Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818)
posted an update 4 months ago
view post
Post
1170
Tired of writing Pandas code? ๐Ÿ˜ฉ

If you are using VS Code, now you can use Data Wrangler from @Microsoft ! ๐Ÿš€

It will convert your Pandas DataFrame to a rich and interactive user interface to view and analyze your data ๐Ÿ“Š, show insightful column statistics and visualizations ๐Ÿ“ˆ, and automatically generate Pandas code as you clean and transform the data. ๐Ÿงน๐Ÿ”„

Supports everything you can think of...

- Data view ๐Ÿ‘€
- Data cleaning ๐Ÿงผ
- Data filtering ๐Ÿ”
- Data summary/statistics ๐Ÿ“Š
- Data transformation ๐Ÿ”„
- Data missing values treatment โ“
- Adding new fields โž•

Everything in a simple open-source extension ๐ŸŒŸ

https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler

PS: I love Pandas ๐Ÿผ, never tired of it... still, this is cool! ๐Ÿ˜Ž
posted an update 4 months ago
view post
Post
1031
You are happy that @Meta has open-sourced Llama 3 ๐Ÿ˜ƒ...

So you jump on @HuggingFace Hub to download the new shiny Llama 3 model only to see a few quintillion Llama 3's! ๐Ÿฆ™โœจ

Which one should you use? ๐Ÿค”

Not all Llamas are created equal! ๐Ÿฆ™โš–๏ธ

An absolutely crazy comparison experiment by Wolfram Ravenwolf ( @Wolfram ) might answer your question! ๐Ÿงช๐Ÿง™โ€โ™‚๏ธ

- Comprehensive assessment of Llama 3 Instruct 70B and 8B models. ๐Ÿ“Š
- Tested 20 versions across HF, GGUF, and EXL2 formats. ๐Ÿ”„
- Methodology: The process tested translation capabilities and cross-language understanding, using deterministic generation settings to minimize random factors. Used German data protection training exams to evaluate cross-language understanding. ๐ŸŒ๐Ÿ“
- Best performance from EXL2 4.5bpw quant, scoring perfect in all tests. ๐Ÿ†โœ…
- GGUF 8-bit to 4-bit quants also performed exceptionally. ๐ŸŒŸ
- Llama 3 8B unquantized is best in its size class but not as good as 70B quants. ๐Ÿ“๐Ÿ”
- 1-bit quantizations showed significant quality drops. โš ๏ธโฌ‡๏ธ

Best models:
- turboderp/Llama-3-70B-Instruct-exl2
- casperhansen/llama-3-70b-instruct-awq

Blog: https://huggingface.co/blog/wolfram/llm-comparison-test-llama-3
posted an update 4 months ago
view post
Post
2081
๐ŸŽญ You picked an LLM for your work but then you find out it hallucinates! ๐Ÿค–

๐Ÿค” Your first thought might be to fine-tune it on more training data.... but should you? ๐Ÿ› ๏ธ

๐Ÿ“œ This is what @Google is exploring in the paper "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" ๐Ÿ•ต๏ธโ€โ™‚๏ธ

๐Ÿ“˜ When LLMs undergo supervised fine-tuning with new factual knowledge not present in their initial training data, there is a risk they might "hallucinate" or produce factually incorrect information. ๐Ÿšจ

๐Ÿ” The paper investigates how fine-tuning LLMs with new facts influences their ability to leverage pre-existing knowledge and the extent to which they generate errors. ๐Ÿ“Š

โš™๏ธTechnical Setup:

๐Ÿ”ง Approach: They introduce a system named SliCK (this stands for Sampling-based Categorization of Knowledge, don't even bother understanding how) to categorize knowledge into four levels (HighlyKnown, MaybeKnown, WeaklyKnown, and Unknown) based on how well the model's generated responses agree with known facts. ๐Ÿ—‚๏ธ

๐Ÿ“ Experimental Setup: The study uses a controlled setup focusing on closed-book QA, adjusting the proportion of fine-tuning examples that introduce new facts versus those that do not. ๐Ÿงช

๐Ÿ‘‰ Here is the gist of the findings:

๐Ÿšธ LLMs struggle to integrate new factual knowledge during fine-tuning, and such examples are learned slower than those consistent with the model's pre-existing knowledge. ๐Ÿข

๐Ÿ“ˆ As LLMs learn from examples containing new knowledge, their propensity to hallucinate increases. ๐Ÿ‘ป

โฑ๏ธ Early stopping during training can mitigate the risks of hallucinations by minimizing exposure to unlearned new facts. ๐Ÿ›‘

๐Ÿง  Training LLMs mostly with known examples leads to better utilization of pre-existing knowledge, whereas examples introducing new knowledge increase the risk of generating incorrect information. ๐Ÿ—๏ธ

๐Ÿ“„ Paper: Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? (2405.05904) ๐Ÿ“š
  • 2 replies
ยท
posted an update 4 months ago
view post
Post
1776
How many times have you said Pandas is slow and still kept on using it? ๐Ÿผ๐Ÿ’จ

Get ready to say Pandas can be fast but it's expensive ๐Ÿ˜‚

๐Ÿ™Œ Original Limitations:

๐Ÿ’ป CPU-Bound Processing: Traditional pandas operations are CPU-bound (mostly single-threaded๐Ÿ˜ฐ), leading to slower processing of large datasets.

๐Ÿง  Memory Constraints: Handling large datasets in memory-intensive operations can lead to inefficiencies and limitations.

๐Œฃ Achievements with @nvidia RAPIDS cuDF:

๐Ÿš€ GPU Acceleration: RAPIDS cuDF leverages GPU computing. Users switch to GPU-accelerated operations without modifying existing pandas code.

๐Ÿ”„ Unified Workflows: Seamlessly integrates GPU and CPU operations, falling back to CPU when necessary.

๐Ÿ“ˆ Optimized Performance: With extreme parallel operation opportunity of GPUs, this achieves up to 150x speedup in data processing, demonstrated through benchmarks like DuckDB.

๐Ÿ˜…New Limitations:

๐ŸŽฎ GPU Availability: Requires a GPU (not everything should need a GPU)

๐Ÿ”„ Library Compatibility: Currently in the initial stages, all the functionality cannot be ported

๐Ÿข Data Transfer Overhead: Moving data between CPU and GPU can introduce latency if not managed efficiently. As some operations still run on the CPU.

๐Ÿค” User Adoption: We already had vectorization support in Pandas, people just didn't use it as it was difficult to implement. We already had DASK for parallelization. It's not that solutions didn't exist

Blog: https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/

For Jupyter Notebooks:

%load_ext cudf.pandas
import pandas as pd


For python scripts:

python -m cudf.pandas script.py


posted an update 4 months ago
view post
Post
1315
๐ŸŽ‰ A new LLM is launched! ๐Ÿš€
After checking if it's open-source or not, ๐Ÿค”
you rush to see the benchmarks... ๐Ÿƒโ€โ™‚๏ธ๐Ÿ’จ

Which benchmark does everyone check first? ๐Ÿ”

MMLU (Massive Multitask Language Understanding)? ๐Ÿ“š

Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! ๐ŸŒโ—

Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! ๐Ÿฏ๐ŸŒ

๐Ÿงช 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans ๐Ÿง‘โ€๐Ÿ”ฌ

๐Ÿ”Ÿ Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing ๐ŸŽฏ

๐Ÿ“Š 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench ๐Ÿ“ˆ

๐Ÿค– LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations ๐Ÿง ๐Ÿ’ก

Any guess who tops it and who bombs it? ๐Ÿค”๐Ÿ“‰๐Ÿ“ˆ

GPT-4o drops by 17% (from 0.887 to 0.7149) ๐Ÿ“‰
Llama-3-70B drops by 27% (from 0.820 to 0.5541) ๐Ÿ“‰

๐Ÿ”— TIGER-Lab/MMLU-Pro
  • 2 replies
ยท
posted an update 4 months ago
view post
Post
923
After saying AI only 120 times, we have the next family of Gemma models!

๐Ÿ†“Gemma Model Family:
๐Ÿฆ™ Gemma 2: 27B parameter model, comparable to Llama-3-70B.

๐Ÿ‘๏ธ PaliGemma: PaliGemma is a new vision-language model family from Google, combining SigLIP as an image encoder and Gemma-2B as a text decoder. ๐ŸŒŸ๐Ÿ–ผ๏ธ๐Ÿ“

๐Ÿ“ฝ๏ธMoren technical Details of PaliGemma:

๐Ÿ‘‰Model Types: Pretrained (PT) ๐Ÿ‹๏ธโ€โ™‚๏ธ, Mix (general-purpose) ๐Ÿ”„, and Fine-tuned (FT) ๐ŸŽฏ.
๐Ÿ‘‰Resolutions and Precisions: 224x224, 448x448, 896x896; bfloat16, float16, float32. ๐Ÿ“๐Ÿ“
๐Ÿ‘‰Tasks: Image captioning ๐Ÿž๏ธ๐Ÿ–Š๏ธ, visual question answering (VQA) โ“๐Ÿค”, detection ๐Ÿ”, referring expression segmentation โœ‚๏ธ, document understanding ๐Ÿ“„.
๐Ÿ‘‰Architecture: SigLIP-So400m for images ๐Ÿ–ผ๏ธ, Gemma-2B for text ๐Ÿ“.
๐Ÿ‘‰Inference: Via PaliGemmaForConditionalGeneration class in transformers ๐Ÿค–.
๐Ÿ‘‰Fine-tuning: Available using both big_vision and transformers frameworks ๐Ÿ”ง๐Ÿ› ๏ธ.
๐Ÿ‘‰Memory Considerations: Higher resolution models require more memory ๐Ÿง ๐Ÿ’พ.


โœจ Other things โœจ (Why is everyone using โœจ for AI? ๐Ÿคจ)

๐Ÿ†’ Gemini Model Family:

๐ŸŒŸ Gemini 1.5 Pro: 2m token support, quality improvements in translation, coding, reasoning.
โšก Gemini Flash: Optimized for speed, 1m token capacity.
๐Ÿš€ Gemini Ultra, Pro, Flash, Nano: Various performance and efficiency models.
๐Ÿ’Ž Gemini Gems: Custom GPTs.
๐ŸŽ™๏ธ Gemini Live (Project Astra): Two-way voice conversation.
๐Ÿ“š LearnLM: Models fine-tuned for learning.


๐Ÿ†•Other Launches:

๐Ÿค– Veo: DeepMind's answer to Sora.
๐Ÿ–ผ๏ธ Imagen 3: Improved photorealistic image generation.
๐ŸŽต Music AI Sandbox: YouTube x DeepMind collaboration.
๐Ÿ” SynthID watermarking: Extends to text, images, audio, video.
โš™๏ธ Trillium (TPUv6): New TPU.

๐Ÿ’ฒ@Google Product Suite Integrations: Enhancements in Workspace, Email, Docs, Sheets, Photos, Search, Android, and Lens.

google/gemma-release-65d5efbccdbb8c4202ec078b

google/paligemma-release-6643a9ffbf57de2ae0448dda
posted an update 4 months ago
view post
Post
3617
Is GPT-4o everything you expected? ๐Ÿค”

@OpenAI has gone omni (GPT-4"o" ๐ŸŒ), a multimodal LLM, it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. ๐ŸŽค๐Ÿ“ธโœ๏ธ

1๏ธโƒฃ Based on the examples seen:
Inputs possible are Text โœ๏ธ, Text + Image ๐Ÿ“๐Ÿ–ผ๏ธ, Text + Audio ๐Ÿ“๐ŸŽง, Text + Video ๐Ÿ“๐ŸŽฅ, Audio ๐ŸŽง
and outputs possible are Image ๐Ÿ–ผ๏ธ, Image + Text ๐Ÿ–ผ๏ธ๐Ÿ“, Text ๐Ÿ“, Audio ๐ŸŽง

2๏ธโƒฃ 88.7% on MMLU ๐Ÿ†; 90.2% on HumanEval (best in class) ๐Ÿฅ‡

3๏ธโƒฃ Up to 50% cheaper ๐Ÿ’ธ and 2x faster โšก than GPT-4 Turbo

4๏ธโƒฃ GPT-4o will be available in the free tier of ChatGPT ๐ŸŽ‰

5๏ธโƒฃ Near real-time audio with 320ms on average, similar to human conversation ๐Ÿ—ฃ๏ธ**

6๏ธโƒฃ New tokenizer with a 200k token vocabulary ๐Ÿ“š (previously 100k vocabulary) leading to 1.1x - 4.4x fewer tokens needed across 20 languages ๐ŸŒ

7๏ธโƒฃ Tokenizer compression and more efficient across non-English languages (3-5 times fewer tokens for major Indian languages ๐Ÿ‡ฎ๐Ÿ‡ณ)

๐Ÿ‘Open questions:
- What is the context length? โ“
- Why does GPT-4 still exist, if GPT-4o is better, faster, and cheaper? ๐Ÿคจ

Blog: https://openai.com/index/hello-gpt-4o/ ๐ŸŒ
Available today:https://chatgpt.com/ ๐Ÿš€

I just wanted it to be cheaper, and more accessible! ๐Ÿ™Œ

Still not open source, but a price reduction is welcome! ๐Ÿ’ฐ

Also, something fun happened, for the first 10-15 mins all search engines were correcting GPT-4o to GPT-4 ๐Ÿ˜‚

Also, also, GPT-4o is the model which was powering the GPT2 chatbot in the LMsys arena (ELO 1310 vs. 1253 for GPT-4 Turbo) ๐Ÿ…
ยท
posted an update 4 months ago
view post
Post
1456
You are all happy ๐Ÿ˜Š that @meta-llama released Llama 3.

Then you are sad ๐Ÿ˜” that it only has a context length of 8k.

Then you are happy ๐Ÿ˜„ that you can just scale llama-3 PoSE to 96k without training, only needing to modify max_position_embeddings and rope_theta.

But then you are sad ๐Ÿ˜ข it only improves the model's long-context retrieval performance (i.e., finding needles) while hardly improving its long-context utilization capability (doing QA and summarization).

But then you are happy ๐Ÿ˜ that the
@GradientsTechnologies community has released the long-context Llama-3-8B-Instruct-262K with long context (262k-1M+).

Now we have another paper "Extending Llama-3's Context Ten-Fold Overnight" ๐Ÿ“œ.

The context length of Llama-3-8B-Instruct is extended from 8K to 80K using QLoRA fine-tuningโš™๏ธ.

The training cycle is highly efficient, taking "only" ๐Ÿ˜‚ 8 hours on a single 8xA800 (80G) GPU machine.

The model also preserves its original capability over short contexts. โœ

The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4.๐Ÿ“Š

The paper suggests that the context length could be extended far beyond 80K with more computation resources (๐Ÿ˜… GPU-poor).

The team plans to publicly release all resources, including data, model, data generation pipeline, and training code, to facilitate future research from the โค๏ธ community.

Paper: https://arxiv.org/abs/2404.19553

This is where we are... until next time... ๐ŸŒŸ

Extending Llama-3's Context Ten-Fold Overnight (2404.19553)