Could provide a demo python file?

#11
by carlosbdw - opened

I tried guanaco-7b_ggml and guanao_13b_ggml ,and feel them are amzing model .
Now I want to try the 33B_GPTQ and 65B_GPTQ, but can not find a successful way. It will be great if an official demo could be provided , thax!

First install AutoGPTQ:

pip install auto-gptq

Note that this only works automatically with CUDA toolkit 11.7 or 11.8.

If you have another version, or if the above command doesn't install the CUDA extension (if you get a warning about "CUDA extension not installed"), then install from source :

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install .

Then here is some sample code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

parser = argparse.ArgumentParser(description='Simple AutoGPTQ example')
parser.add_argument('model_name_or_path', type=str, help='Model folder or repo')
parser.add_argument('--model_basename', type=str, help='Model file basename if model is not named gptq_model-Xb-Ygr')
parser.add_argument('--use_slow', action="store_true", help='Use slow tokenizer')
parser.add_argument('--use_safetensors', action="store_true", help='Model file basename if model is not named gptq_model-Xb-Ygr')
parser.add_argument('--use_triton', action="store_true", help='Use Triton for inference?')
parser.add_argument('--bits', type=int, default=4, help='Specify GPTQ bits. Only needed if no quantize_config.json is provided')
parser.add_argument('--group_size', type=int, default=128, help='Specify GPTQ group_size. Only needed if no quantize_config.json is provided')
parser.add_argument('--desc_act', action="store_true", help='Specify GPTQ desc_act. Only needed if no quantize_config.json is provided')

args = parser.parse_args()

quantized_model_dir = args.model_name_or_path

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=not args.use_slow)

try:
   quantize_config = BaseQuantizeConfig.from_pretrained(quantized_model_dir)
except:
    quantize_config = BaseQuantizeConfig(
            bits=args.bits,
            group_size=args.group_size,
            desc_act=args.desc_act
        )

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=args.model_basename,
        device="cuda:0",
        use_triton=args.use_triton,
        quantize_config=quantize_config)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Then run:

python simple_autogptq.py TheBloke/guanaco-65B-GPTQ --model_basename Guanaco-65B-GPTQ-4bit.act-order --use_safetensors

Or if you already downloaded the model locally, you can specify the local path like:

python simple_autogptq.py /path/to/models/TheBloke_guanaco-65B-GPTQ --model_basename Guanaco-65B-GPTQ-4bit.act-order --use_safetensors

With two RTX 3090 GPUs, the inference time is infinite! Why? I am running the code above.

Sign up or log in to comment