Text Generation
Transformers
PyTorch
English
llama
causal-lm
text-generation-inference
Inference Endpoints

Expected scalar type Float but found Half when using Text Gen WebUI with VIcuna & monkey-patch

#11
by mbecuwe - opened

I am trying to finetune a Vicuna model using text generation webui.
I followed these steps for install as shown in the documentation:

# Install miniconda
curl -sL "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" > "Miniconda3.sh"
bash Miniconda3.sh

# Create conda env
conda create -n textgen python=3.10.9
conda activate textgen

# Install torch
pip3 install torch torchvision torchaudio

# Install text generation webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# install nvcc
conda install -c conda-forge cudatoolkit-dev

# Install GPTQ for LLaMa
sudo apt install build-essential
mkdir repositories
cd repositories
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
python setup_cuda.py install

# Install monkey patch
cd ..
git clone https://github.com/johnsmith0031/alpaca_lora_4bit
pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@eaa9955 # Wont work if I dont revert to this specific commit 

# Download model
cd ..
python download-model.py TheBloke/stable-vicuna-13B-GPTQ

# Run server with monkey patch
python server.py --model TheBloke_stable-vicuna-13B-GPTQ --wbits 4 --groupsize 128 --model_type Llama --share --api --listen --auto-devices --monkey-patch --no-stream

When trying to generate from prompts in the interface, I get the error:

Traceback (most recent call last):
  File "/home/jupyter/text-generation-webui/modules/callbacks.py", line 73, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/jupyter/text-generation-webui/modules/text_generation.py", line 277, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/jupyter/text-generation-webui/repositories/alpaca_lora_4bit/amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1565, in generate
    return self.sample(
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2612, in sample
    outputs = self(
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 293, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 197, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/jupyter/text-generation-webui/repositories/alpaca_lora_4bit/autograd_4bit.py", line 133, in forward
    out = matmul4bit_with_backend(x, self.qweight, self.scales,
  File "/home/jupyter/text-generation-webui/repositories/alpaca_lora_4bit/autograd_4bit.py", line 89, in matmul4bit_with_backend
    return mm4b.matmul4bit(x, qweight, scales, qzeros, g_idx)
  File "/home/jupyter/text-generation-webui/repositories/alpaca_lora_4bit/matmul_utils_4bit.py", line 131, in matmul4bit
    output = _matmul4bit_v2(x, qweight, scales, zeros, g_idx)
  File "/home/jupyter/text-generation-webui/repositories/alpaca_lora_4bit/matmul_utils_4bit.py", line 70, in _matmul4bit_v2
    quant_cuda.vecquant4matmul_faster(x, qweight, y, scales, zeros, g_idx, x.shape[-1] // 2)
RuntimeError: expected scalar type Float but found Half

Text generation will work without the monkey patch but then I cannot finetune the model on my dataset.
All my tests are using GPU Nvidia P100.
Would be a great help if you could help me fix it !

Sorry I have no experience of the monkey patch or fine tuning GPTQ models.

AutoGPTQ is added PEFT support soon (it's currently in a PR - you could try it) which will be much better, when it works.

Try asking on the Github where you got the monkey patch code - is it Alpaca Lora 4bit?

Sign up or log in to comment