Unfortunately I can't run on text-generation-webui
Tell me, what am I doing wrong? I did everything according to the instructions,
- add in update_windows.bat
pip install autogptq
pip install einops - run it
- add 'trust_remote_code': shared.args.trust_remote_code, in AutoGPTQ_loader.py
- add --trust-remote-code (instead --trust_remote_code) and --autogptq in webui.py
but I get an error:
Traceback (most recent call last): File โD:\LLaMA\oobabooga_windows\text-generation-webui\server.pyโ, line 71, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File โD:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.pyโ, line 95, in load_model output = load_func(model_name) File โD:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.pyโ, line 297, in AutoGPTQ_loader return modules.AutoGPTQ_loader.load_quantized(model_name) File โD:\LLaMA\oobabooga_windows\text-generation-webui\modules\AutoGPTQ_loader.pyโ, line 43, in load_quantized model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params) File โD:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling\auto.pyโ, line 62, in from_quantized model_type = check_and_get_model_type(save_dir) File โD:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_utils.pyโ, line 124, in check_and_get_model_type raise TypeError(f"{config.model_type} isnโt supported yet.") TypeError: RefinedWeb isnโt supported yet.
Parameters on the tab GPTQ
wbits 4, groupsize 64, model_type llama
You need to update AutoGPTQ with:
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install . # This step requires CUDA toolkit installed
I will make this clearer in the README!
Three more things to note:
- The GPTQ parameters don't have any effect for AutoGPTQ models
- This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading
- It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.
It's working! Thanks!
Output generated in 24.27 seconds (0.45 tokens/s, 11 tokens, context 48, seed 466795515)
Output generated in 43.11 seconds (0.49 tokens/s, 21 tokens, context 48, seed 532631384)
Output generated in 40.49 seconds (0.57 tokens/s, 23 tokens, context 41, seed 1349492009)
Output generated in 334.20 seconds (0.33 tokens/s, 109 tokens, context 48, seed 1693397338)
๐ญ
Yup :) It is slow as hell atm. I've flagged it with qwopqwop and PanQiWei of AutoGPTQ so hopefully they can investigate if it's anything on the AutoGPTQ side.
But my feeling is that it may be as much to do with the custom code for loading the Falcon model - or some combination of that code with AutoGPTQ
Three more things to note:
- The GPTQ parameters don't have any effect for AutoGPTQ models
- This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading
- It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.
How to open CPU offloading, is there any possiable run this model in 4090 24G?
How to run it in google colab. I encounter the below error,Could you please help to resolve the issue. Running with A100 GPU instance .
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/_base.py:182 โ
โ if (pos_ids := kwargs.get("position_ids", None)) is not None: โ
โ โฒ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
SyntaxError: invalid syntax
Code used
! BUILD_CUDA_EXT=0 pip install auto-gptq
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
Download the model from HF and store it locally, then reference its location here:
quantized_model_dir = "/TheBloke/falcon-40b-instruct-GPTQ"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
device="cuda:0",
use_triton=False,
use_safetensors=True,
torch_dtype=torch.float32,
trust_remote_code=True)
The models fully loads on my 3090 in WSL2 with text-gen-webui, using AutoGPTQ.
Compiling AutoGPTQ in 0.3.0.dev0 from source still does not mark the module +cuXXX.
Yeah he still hasn't fixed that. But it does compile the module; or should.
There's a simple test as to whether you have the CUDA extension installed:
$ python -c 'import torch ; import autogptq_cuda'
$
If that returns no output, it should be OK.