What GPU is needed for this 70B one?
Whether RTX A6000 48GB is enough for 70B ?
enough for me.
Wed Jul 19 22:03:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 On | Off |
| 30% 44C P8 32W / 300W | 805MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:02:00.0 Off | Off |
| 44% 76C P2 298W / 300W | 34485MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1262 G /usr/lib/xorg/Xorg 110MiB |
| 0 N/A N/A 1880 G /usr/lib/xorg/Xorg 430MiB |
| 0 N/A N/A 2009 G /usr/bin/gnome-shell 86MiB |
| 0 N/A N/A 4149 G ...8417883,14948046860862319246,262144 151MiB |
| 1 N/A N/A 1262 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1880 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 44687 C python 34460MiB |
+---------------------------------------------------------------------------------------+
@alfredplpl can you please share how you started it? token lenghts? branch? have the same setup but cant get it loaded...
Yeah 4-bit uses around 36-38GB VRAM to load, plus context, so 48GB should be plenty
@harpergrieve
check the README again, I recently made updates to it to describe various steps that are needed, eg updating Transformers, and, if you use text-generation-webui or AutoGPTQ from Python code, making sure inject_fused_attention=False
is set
@TheBloke thanks for the reply, using text gen inference and now getting the model.layers.0.self_attn.q_proj.weight error. Ill try using one of the other branches.
Did you update Transformers? And is that with Loader: AutoGPTQ?
Also try downloading hte model again (same branch, ie main), just to double check the download is OK
Earlier today I confirmed text-gen-ui works OK with AutoGPTQ + the main file, using "no inject fused attention" and with Transformers updated to latest version - which be aware has to be done inside the Python environment of text-generation-webui, else it won't see the changes.
Yep, just updated transformers and it got me past the oom error. Now gettting that self_attn.q_proj.weight error on both main and gptq-4bit-32g-actorder_True. Can the inject_fused_attention=False flag be set through a env var like bits and groupsize?
Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.
@TheBloke Thanks for the help, and thanks for the models! I appreciate your work. Ill try and look into it and report back any findings if i do get it working...
I guess not even the gptq-3bit--1g-actorder_True
will fit into a 24 GB GPU (e.g. RTX 3090)?
Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.
FYI TGI should now work with this model, a PR was merged the other day
I guess not even the
gptq-3bit--1g-actorder_True
will fit into a 24 GB GPU (e.g. RTX 3090)?
Yeah I don't think it will. You will need 2 x 24GB GPU, or 1 x 48GB GPU. Or an asynchronous setup like 1 x 24GB + 1 x 12GB.
But 1 x 24GB won't fit it I'm afraid. Even the smallest file is 26GB.
Try the binary of ggml.ccp (latest commit). I was able to load the ggmlv3 with 24 gb vram and 40gb additional ram. Got 0,83 token/second on 4090 and i9/9900k on the non-chat version. Oobabooga is not updated / merged yet.
It's slow (0.8 - 0.9 tokens/s), but with exlammaHF I got it working on a 24gb 4090.
@Squeezitgirdle How did you do that? AFAIK Exllama does not support offloading to CPU RAM. Or is that supported using the HF variant?
How much 2x separated GPU is slower than one large vram GPU?
Depends on gpu model, electrical pci-e slots and cpu, I think. If you have two full pci-e 16x slots (not available on consumer Mainboards) with two rtx 3080, it will depend only on drivers and multi gpu supporting the models loader. Some versions of autogptq may be slow or even not better than with one gpu.
I figured out, that in use of private hobby, a 60-70b model isn’t worth to play with, because the difference to a good 13 or 30b model is not that big. Sometimes, you are missing those little amount of percentage a model does not answer in your language. In this case, you may train it by yourself by simply training some books. Llama-2 7b may work for you with 12GB VRAM. You will need 20-30 gpu hours and a minimum of 50mb raw text files in high quality (no page numbers and other garbage). Today, I did my first working Lora merge, which makes me able to train in short blocks with 1MB text blocks. Training a 13b llama2 model with only a few MByte of German text seems to work better than I hoped. If you insist interfering with a 70b model, try pure llama.ccp. It is faster because of lower prompt size, so like talking above you may reach 0,8 tokens per second. Prompting with 4K history, you may have to wait minutes to get a response while having 0,02 tokens per second. And we are talking about a 4090 gpu. with full multi gpu support and running under Linux, this should get much faster with two of these gpus.
About Llama-2-70B-chat ,fp16, if I have 8*A10(24G),can I run it ,thanks!
The size of Llama 2 70B fp16 is around 130GB so no you can't run Llama 2 70B fp16 with 2 x 24GB. You need 2 x 80GB GPU or 4 x 48GB GPU or 6 x 24GB GPU to run fp16.
But you can run Llama 2 70B 4-bit GPTQ on 2 x 24GB and many people are doing this.
hello,what's the need about the RAM in Llama 2 70B fp16?
I think you only need as much RAM as the size of one shard, which is only about 10GB. 64GB would be fine for example. Generally you won't find machines that have less RAM than VRAM anyway.
my GPU is 16 * A10(16 * 24G). I ask many people to solve this problem,but failed.
url:https://github.com/h2oai/h2ogpt/issues/692
command:CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" python generate.py --base_model=/data/model/llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True
It appears a BUG when I use GPUs > 10:
https://user-images.githubusercontent.com/74184102/262883754-9f065f93-4e54-4708-8584-6b80ccf438ab.png
10 gpu is ok!But more gpu is helpful!
When I use GPU <= 10, it can work! Like this command:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python generate.py --base_model=/data/model/llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True
But I need more gpu because longer prompt need more gpu memmory.Thanks!
The size of Llama 2 70B fp16 is around 130GB so no you can't run Llama 2 70B fp16 with 2 x 24GB. You need 2 x 80GB GPU or 4 x 48GB GPU or 6 x 24GB GPU to run fp16.
But you can run Llama 2 70B 4-bit GPTQ on 2 x 24GB and many people are doing this.
so Mac Studio with M2 Ultra 196GB would run Llama 2 70B fp16?
Say you have a beefy setup with some 4xL40 gpus or similar, do these need to be connected with nvlink to get good perf or enough to just reside in the same physical box for llama 70b?
I am running on Windows server with xenon processor, 4 Tesla GPUs with each 64 GB. Only one user is able to interact with at a time. The following error appears when another user asks a question or feeds with a prompt while the first one is still processing. Please advise.
Error Encountered
Error occurred during text generation: {"detail":{"msg":"Server is busy; please try again later.","type":"service_unavailable"}}
Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.
FYI TGI should now work with this model, a PR was merged the other day
It's October and it still does not work. The error about self_attn.q_proj.weight still comes while loading 70b chat gptq on text generation inference @Bloke anything I am missing. I am using the latest tgi version docker and required cuda configs as well.
Hi, I have 2 GPUs of which 1 Nvidia. I want to run Llama2 7b-chat only using Nvidia (Linux Debian system).
I normally run Llama2 with those commands (from this guide https://lachieslifestyle.com/2023/07/29/how-to-install-llama-2/#preparing-to-install-l-la-ma-2)
#conda activate TextGen2
#cd text-generation-webui
#python server.py
could you suggest me How to do?
Thanks :)
@Squeezitgirdle How did you do that? AFAIK Exllama does not support offloading to CPU RAM. Or is that supported using the HF variant?
Sorry I'm just now responding.
I have absolutely no idea. I did it once using LM Studio, but that's it. I haven't been able to do it again after updating LM studio.