Llama3 HQQ
Collection
4 items
•
Updated
•
18
This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-8B-Instruct model. We provide two versions:
Models | fp16 | HQQ 4-bit/gs-64 | AWQ 4-bit | GPTQ 4-bit |
---|---|---|---|---|
Bitrate (Linear layers) | 16 | 4.5 | 4.25 | 4.25 |
VRAM (GB) | 15.7 | 6.1 | 6.3 | 5.7 |
Models | fp16 | HQQ 4-bit/gs-64 | AWQ 4-bit | GPTQ 4-bit |
---|---|---|---|---|
Decoding* - short seq (tokens/sec) | 53 | 125 | 67 | 3.7 |
Decoding* - long seq (tokens/sec) | 50 | 97 | 65 | 21 |
*: RTX 3090
Models | fp16 | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | AWQ 4-bit | GPTQ 4-bit |
---|---|---|---|---|---|
ARC (25-shot) | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
HellaSwag (10-shot) | 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
MMLU (5-shot) | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
TruthfulQA-MC2 | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
Winogrande (5-shot) | 77.98 | 76.24 | 76.48 | 76.4 | 76.64 |
GSM8K (5-shot) | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
Average | 69.51 | 68.00 | 69.02 | 67.67 | 68.23 |
Relative performance | 100% | 97.83% | 99.3% | 97.35% | 98.16% |
You can reproduce the results above via pip install lm-eval==0.4.3
First, install the dependecies:
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you use the bitblas backend
Also, make sure you use at least torch 2.4.0
or the nightly build with at least CUDA 12.1.
Then you can use the sample code below:
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator
#Load the model
###################################################
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version
compute_dtype = torch.bfloat16 #bfloat16 for torchao_int4, float16 for bitblas
cache_dir = '.'
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)
#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4")
#prepare_for_inference(model, backend="bitblas") #takes a while to init...
#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)