Transformers

You are viewing v4.46.0 version. A newer version v4.46.3 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

AQLM

Try AQLM on Google Colab!

Additive Quantization of Language Models (AQLM) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.

Inference support for AQLM is realised in the aqlm library. Make sure to install it to run the models (note aqlm works only with python>=3.10):

pip install aqlm[gpu,cpu]

The library provides efficient kernels for both GPU and CPU inference and training.

The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub repository. To run AQLM models simply load a model that has been quantized with AQLM:

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    torch_dtype="auto", 
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

PEFT

Starting with version aqlm 1.0.2, AQLM supports Parameter-Efficient Fine-Tuning in a form of LoRA integrated into the PEFT library.

AQLM configurations

AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:

Kernel	Number of codebooks	Codebook size, bits	Notation	Accuracy	Speedup	Fast GPU inference	Fast CPU inference
Triton	K	N	KxN	-	Up to ~0.7x	✅	❌
CUDA	1	16	1x16	Best	Up to ~1.3x	✅	❌
CUDA	2	8	2x8	OK	Up to ~3.0x	✅	❌
Numba	K	8	Kx8	Good	Up to ~4.0x	❌	✅

< > Update on GitHub

←AWQ Quanto→