Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.46.3).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Optimum-quanto

Try optimum-quanto + transformers with this notebook!

🤗 optimum-quanto library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features such as:

weights quantization (float8,int8,int4,int2)
activation quantization (float8,int8)
modality agnostic (e.g CV,LLM)
device agnostic (e.g CUDA,XPU,MPS,CPU)
compatibility with torch.compile
easy to add custom kernel for specific device
supports quantization aware training

Before you begin, make sure the following libraries are installed:

pip install optimum-quanto accelerate transformers

Now you can quantize a model by passing QuantoConfig object in the from_pretrained() method. This works for any model in any modality, as long as it contains torch.nn.Linear layers.

The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use optimum-quanto library instead.

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.

Optimum-quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following benchmark (llama-2-7b on perplexity metric). You can find more benchmarks here

The library is versatile enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).

< > Update on GitHub

←AQLM EETQ→