Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.46.3).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

BitNet

BitNet replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinear with ternary (or binary in the older version) precision. The BitLinear layers introduced here quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.

Alt Text — The architecture of BitNet with BitLinear layers

During training, we start by quantizing the weights into ternary values, using symmetric per tensor quantization. First, we compute the average of the absolute values of the weight matrix and use this as a scale. We then divide the weights by the scale, round the values, constrain them between -1 and 1, and finally rescale them to continue in full precision. $scale_w = \frac{1}{\frac{1}{nm} \sum_{ij} |W_{ij}|}$ $W_q = \text{clamp}_{[-1,1]}(\text{round}(W*scale))$ $W_{dequantized} = W_q*scale_w$

Activations are then quantized to a specified bit-width (e.g., 8-bit) using absmax quantization (symmetric per channel quantization). This involves scaling the activations into a range [−128,127[. The quantization formula is: $scale_x = \frac{127}{|X|_{\text{max}, \, \text{dim}=-1}}$ $X_q = \text{clamp}_{[-128,127]}(\text{round}(X*scale))$ $X_{dequantized} = X_q * scale_x$

To learn more about how we trained, and fine-tuned bitnet models checkout the blogpost here

Load a BitNet Model from the Hub

BitNet models can’t be quantized on the fly—they need to be pre-trained or fine-tuned with the quantization applied (it’s a Quantization aware training technique). Once trained, these models are already quantized and available as packed versions on the hub.

A quantized model can be load :

from transformers import AutoModelForCausalLM
path = "/path/to/model"
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto")

Pre-training / Fine-tuning a BitNet Model

If you’re looking to pre-train or fine-tune your own 1.58-bit model using Nanotron, check out this PR, all you need to get started is there !

For fine-tuning, you’ll need to convert the model from Hugging Face format to Nanotron format (which has some differences). You can find the conversion steps in this PR.

Kernels

In our initial version, we chose to use @torch.compile to unpack the weights and perform the forward pass. It’s very straightforward to implement and delivers significant speed improvements. We plan to integrate additional optimized kernels in future versions.

< > Update on GitHub

←TorchAO compressed-tensors→