GGML & GGUF

by gotzmann - opened Sep 3, 2023

Discussion

gotzmann

Sep 3, 2023

@TheBloke Please, this model family looks very promising

migtissera

Owner Sep 4, 2023

@gotzmann have you been using this? I've been using it for STEM related tasks and it's been a pleasant surprise!

gotzmann

Sep 5, 2023

I've been done thorough research of latest 70B models (Upstage, Samantha, Nous Hermes, ...) with our own benchmark and I should say that Synthia got most scores out all of them!

We are going to use it within our chat system instead of Upstage Instruct model.

BTW, I've converted Synthia v1.2 into Q_4_K_M format (which we are using), so it fits on a pair of 3090 / 4090 cards or one A6000:

https://huggingface.co/gotzmann/Synthia-70B-v1.2-GGML
https://huggingface.co/gotzmann/Synthia-70B-v1.2-GGUF

migtissera

Owner Sep 5, 2023

•

edited Sep 6, 2023

Awesome!

Share your app too when you're ready. I can also help with some LinkedIn/Twitter re-sharing. :)

migtissera

Owner Sep 6, 2023

How do I use GGML or GGUF models? What's the best and fastest way to use them for inference? Do you have a suggested library? How many tokens/second can you achieve with those?

Flanua

Sep 6, 2023

You can use them in Llama.cpp but I personally use it in Obaboga Web UI. GGUF it's like GGML V2.0 with increased metadata length to store more info about model and some additional improvements. GGML becoming deprecated unfortunately pretty fast and GGUF is a new formal now. GGML format was designed to run large AI modes in CPU mode so we can use ddr ram instead vram. I usually getting 25-30 tokens per sec.

migtissera

Owner Sep 7, 2023

Is anyone solving 70B inferences to match GPT-3.5's generation time? My 70B models are served with Transformers text generation in 4-bit, and it's super slow!

Flanua

Sep 7, 2023

•

edited Sep 7, 2023

Is anyone solving 70B inferences to match GPT-3.5's generation time? My 70B models are served with Transformers text generation in 4-bit, and it's super slow!

Ideally You need to run it on very large and powerful GPUs to have inference speed of GPT-3.5. And by running it in CPU mode like I do well don't expect it to be close to GPT-3.5 inference speed but I'm pretty sure some new optimizations on it's way to improve inference speed (for a CPU mode we already can partially offload GGML and GGUF models to GPUs to increase performance) maybe we actually would able to squeeze more speed in the near future though. Or you can try to rent some powerful cloud GPUs to try and experiment with inference speed.
Also you can try to use this to run your AI models:
https://github.com/huggingface/text-generation-inference
But I don't tried it yet.

migtissera

Owner Sep 7, 2023

Yeah that’s what I’m using. @gotzmann any thoughts?

migtissera changed discussion status to closed Sep 11, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment