macadeliccc (Tim Dolan)

posted an update 3 months ago

Post

947

My tool calling playgrounds repo has been updated again to include the use of flux1-schnell or dev image generation. This functionality is similar to using Dall-E 3 via the @ decorator in ChatGPT. Once the function is selected, the model will either extract or improve your prompt (depending on how you ask).

I have also included 2 notebooks that cover different ways to access Flux for your specific use case. The first method covers how to access flux via LitServe from Lightning AI. LitServe is a bare-bones inference engine with a focus on modularity rather than raw performance. LitServe supports text generation models as well as image generation, which is great for some use cases, but does not provide the caching mechanisms from a dedicated image generation solution.

Since dedicated caching mechanisms are so crucial to performance, I also included an example for how to integrate SwarmUI/ComfyUI to utilize a more dedicated infrastructure that may already be running as part of your tech stack. Resulting in a Llama-3.1 capable of utilizing specific ComfyUI JSON configs, and many different settings.

Lastly, I tested the response times for each over a small batch request to simulate a speed test.

It becomes clear quickly how efficient caching mechanisms can greatly reduce the generation time, even in a scenario where another model is called. An average 4.5 second response time is not bad at all when you consider that an 8B model is calling a 12B parameter model for a secondary generation.

Repo: https://github.com/tdolan21/tool-calling-playground
LitServe: https://github.com/Lightning-AI/LitServe
SwarmUI: https://github.com/mcmonkeyprojects/SwarmUI

posted an update 3 months ago

Post

1589

Automated web scraping with playwright is becoming easier by the day. Now, using ollama tool calling, its possible to perform very high accuracy web scraping (in some cases 100% accurate) through just asking an LLM to scrape the content for you.

This can be completed in a multistep process similar to cohere's platform. If you have tried the cohere playground with web scraping, this will feel very similar. In my experience, the Llama 3.1 version is much better due to the larger context window. Both tools are great, but the difference is the ollama + playwright version is completely controlled by you.

All you need to do is wrap your scraper in a function:

async def query_web_scraper(url: str) -> dict:
    scraper = WebScraper(headless=False)
    return await scraper.query_page_content(url)

and then make your request:

# First API call: Send the query and function description to the model
response = ollama.chat(
    model=model,
    messages=messages,
    tools=[
        {
            'type': 'function',
            'function': {
                'name': 'query_web_scraper',
                'description': 'Scrapes the content of a web page and returns the structured JSON object with titles, articles, and associated links.',
                'parameters': {
                    'type': 'object',
                    'properties': {
                        'url': {
                            'type': 'string',
                            'description': 'The URL of the web page to scrape.',
                        },
                    },
                    'required': ['url'],
                },
            },
        },
    ]
)

To learn more:
Github w/ Playground: https://github.com/tdolan21/tool-calling-playground/blob/main/notebooks/ollama-playwright-web-scraping.ipynb
Complete Guide: https://medium.com/@tdolan21/building-an-llm-powered-web-scraper-with-ollama-and-playwright-6274d5d938b5

posted an update 3 months ago

Post

1271

Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.

Step 1: Pull docker images

docker pull apostacyh/vllm:lmcache-0.1.0

Step 2: Start vLLM + LMCache

model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<Your huggingface access token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

You can add another vLLM instance as long as its on a separate GPU by simply deploying another:

# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<Your huggingface token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).

LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm

posted an update 7 months ago

Post

4497

Fine tune Phi-3 using samatha themed dataset and Huggingface SFT trainer!

In this colab, we simply apply a supervised finetune to phi-3 using the sharegpt format.

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    mapper = {"system": "system\n", "human": "\nuser\n", "gpt": "\nassistant\n"}
    end_mapper = {"system": "", "human": "", "gpt": ""}
    for convo in convos:
        text = "".join(f"{mapper[(turn := x['from'])]} {x['value']}\n{end_mapper[turn]}" for x in convo)
        texts.append(f"{text}{EOS_TOKEN}")  
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)
print(dataset['text'][8])

Opus Samantha consists of 1848 samples with the samantha personality. The dataset covers a wide variety of topics such as logical reasoning, mathematics, legal, and rp.

This notebook serves as a viable option to finetune Phi-3 until Unsloth supports phi-3, which should be very soon. When that happens check out AutoSloth for both SFT, DPO, and langfuse format RAG fine tuning on free tier colab hardware.

Resources:
Dataset: macadeliccc/opus_samantha
Colab: https://colab.research.google.com/drive/1e8LILflDQ2Me52hwS7uIfuJ9DxE2oQzM?usp=sharing
AutoSloth: https://colab.research.google.com/drive/1Zo0sVEb2lqdsUm9dy2PTzGySxdF9CNkc#scrollTo=bpimlPXVz-CZ

reacted to fblgit's post with ❤️ 9 months ago

Post

Over the past week, I've been putting Claude through its paces, focusing primarily on productivity tasks (you know, the good old BAU – Business As Usual).

1. Python/Torch/Transformers/AI/ML
Right off the bat, I threw some complex AI/ML tasks at Claude, and I must say, it handled them with finesse. It even caught a few things that GPT missed! However, let's not get too carried away – we're not quite at the auto-code level just yet.

2. Brainstorming
This is where Claude falls a bit short. It seems to be more grounded than its competitors, which might not be ideal for generating novel ideas. If you're looking for a brainstorming partner, you might want to look elsewhere.

3. Attention
Despite the claims of super-large attention in the paper, Claude's "forgetting" mechanism seems to be more pronounced. It tends to miss entire chunks of information rather than just specific details like GPT does.

4. Following / Tasks
I hit a roadblock when Claude couldn't generate a LaTeX document. It's not the best at following complex, multi-step tasks.

5. Hallucinations
Oh boy, does Claude hallucinate! And when it does, it's on a whole new level of nonsense. The hallucinations seem to align with its grounded nature, making them even more convincing within the context of the prompt.

6. Sycophancy
Claude is quite the people-pleaser. I've found that using an adversarial brainstorming approach is more beneficial and time-efficient, as it forces me to highlight Claude's mistakes rather than letting it focus on being a sweet, pleasant minion.

7. Interface / UI
There's definitely room for improvement here. Basic features like stepping back on a prompt and stopping generation with the ESC key are missing. These are essential for extracting and composing content effectively.

Despite these limitations, I firmly believe that Claude is currently the #1

4 replies

·

posted an update 9 months ago

Post

Quantize 7B paramater models in 60 seconds using Half Quadratic Quantization (HQQ).

This game-changing technique allows for rapid quantization of models like Llama-2-70B in under 5 minutes, outperforming traditional methods by 50x in speed and offering high-quality compression without calibration data.

Mobius Labs innovative approach not only significantly reduces memory requirements but also enables the use of large models on consumer-grade GPUs, paving the way for more accessible and efficient machine learning research.

Mobius Labs' method utilizes a robust optimization formulation to determine the optimal quantization parameters, specifically targeting the minimization of errors between original and dequantized weights. This involves employing a loss function that promotes sparsity and utilizes a non-convex lp<1-norm, making the problem challenging yet solvable through a Half-Quadratic solver.

This solver simplifies the problem by introducing an extra variable and dividing the optimization into manageable sub-problems. Their implementation cleverly fixes the scale parameter to simplify calculations and focuses on optimizing the zero-point, utilizing closed-form solutions for each sub-problem to bypass the need for gradient calculations.

Check out the colab demo where you are able to quantize models (text generation and multimodal) for use with vLLM or Timm backend as well as transformers!

AutoHQQ: 👉 https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
Code: https://github.com/mobiusml/hqq
HQQ Blog post: https://mobiusml.github.io/hqq_blog/

Edit: Here is an example of how powerful HQQ can be: macadeliccc/Nous-Hermes-2-Mixtral-8x7B-DPO-HQQ

Citations:

@misc {badri2023hqq,
title = {Half-Quadratic Quantization of Large Machine Learning Models},
url = {https://mobiusml.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month = {November},
year = {2023}
}

reacted to Tonic's post with ❤️👍 9 months ago

Post

LAST CHANCE TO TRY 🌟STARCODER2

After today it's gone !

actually not - just joking ! it's <3 open source !

just trying to get folks' attention to my featured "Spaces of the Week" :

Tonic/starcoder2

drop a like for your boy and join us next week for making fine tunes !

replied to their post 9 months ago

thanks for letting me know. I have updated the post with the correct link https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing

posted an update 9 months ago

Post

Create synthetic instruction datasets using open source LLM's and bonito🐟!

With Bonito, you can generate synthetic datasets for a wide variety of supported tasks.

The Bonito model introduces a novel approach for conditional task generation, transforming unannotated text into task-specific training datasets to facilitate zero-shot adaptation of large language models on specialized data.

This methodology not only improves the adaptability of LLMs to new domains but also showcases the effectiveness of synthetic instruction tuning datasets in achieving substantial performance gains.

AutoBonito🐟: https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing
Original Repo: https://github.com/BatsResearch/bonito?tab=readme-ov-file
Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334)

2 replies

·

reacted to loubnabnl's post with 🤗❤️ 9 months ago

Post

⭐ Today we’re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens 🚀
As always, we released everything from models and datasets to curation code. Enjoy!

🔗 StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
🔗 Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
🔗 BlogPost: https://huggingface.co/blog/starcoder2
🔗 Code Leaderboard: bigcode/bigcode-models-leaderboard

reacted to akhaliq's post with 👍❤️ 9 months ago

Post

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

reacted to FremyCompany's post with 👍 9 months ago

Post

🔥 What's that biomedical model that got 170,763 downloads last month on HuggingFace?! Well, the paper is finally published! #BioLORD

📰 Read our article in the Journal of the American Medical Informatics Association:
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae029/7614965

📝TLDR: BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.

🤗 Access our models from the HuggingFace hub, including the new 2023-C and 2023-S variants:
FremyCompany/BioLORD-2023
FremyCompany/BioLORD-2023-M
FremyCompany/BioLORD-2023-S
FremyCompany/BioLORD-2023-C

1 reply

·

reacted to Xenova's post with ❤️ 9 months ago

Post

Real-time object detection w/ 🤗 Transformers.js, running YOLOv9 locally in your browser! 🤯

Try it out yourself: Xenova/video-object-detection
(Model used + example code: Xenova/gelan-c_all)

This demo shows why on-device ML is so important:
1. Privacy - local inference means no user data is sent to the cloud
2. No server latency - empowers developers to build real-time applications
3. Lower costs - no need to pay for bandwidth and processing of streamed video

I can't wait to see what you build with it! 🔥

3 replies

·

reacted to clefourrier's post with 🤝❤️ 9 months ago

Post

First big community contribution on our evaluation suite, lighteval ⛅️

@Ali-C137 added 3 evaluation tasks in Arabic:
- ACVA, a benchmark about Arabic culture
- MMLU, translated
- Exams, translated
(datasets provided/translated by the AceGPT team)

Congrats to them!
https://github.com/huggingface/lighteval/pull/44

1 reply

·

replied to their post 9 months ago

Thank you! 🤗

replied to vladbogo's post 9 months ago

Thank you! I will try it again and let you know if there is any issues

Tim Dolan

AI & ML interests

Recent Activity

Articles

Deploy hundreds of open source models on one GPU using LoRAX

Organizations

macadeliccc's activity