AQLM
Try AQLM on Google Colab!
Additive Quantization of Language Models (AQLM) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
Inference support for AQLM is realised in the aqlm
library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
pip install aqlm[gpu,cpu]
The library provides efficient kernels for both GPU and CPU inference and training.
The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub repository. To run AQLM models simply load a model that has been quantized with AQLM:
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
PEFT
Starting with version aqlm 1.0.2
, AQLM supports Parameter-Efficient Fine-Tuning in a form of LoRA integrated into the PEFT library.
AQLM configurations
AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
---|---|---|---|---|---|---|---|
Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ |
CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ |
CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ |
Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ |