Quantization Using Brevitas

BrevitasQuantizer

class optimum.amd.BrevitasQuantizer

( model: Module model_name_or_path: str )

Handles the Runtime quantization process for models shared on huggingface.co/models.

from_pretrained

( model_name_or_path: str subfolder: str = '' revision: Optional = None cache_dir: Optional = None trust_remote_code: bool = False force_download: bool = False local_files_only: bool = False use_auth_token: Union = None device_map: Union = None **model_kwargs )

Parameters

model_name_or_path (Union[str, Path]) — Can be either the model id of a model repo on the Hugging Face Hub, or a path to a local directory containing a model.
subfolder (str, defaults to "") — In case the model files are located inside a subfolder of the model directory / repo on the Hugging Face Hub, you can specify the subfolder name here.
revision (Optional[str], optional, defaults to None) — Revision is the specific model version to use. It can be a branch name, a tag name, or a commit id.
cache_dir (Optional[str], optional) — Path to a directory in which a downloaded pretrained model weights have been cached if the standard cache should not be used.
trust_remote_code (bool, defaults to False) — Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository.
force_download (bool, defaults to False) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.
local_files_only (Optional[bool], defaults to False) — Whether or not to only look at local files (i.e., do not try to download the model).
use_auth_token (Optional[str], defaults to None) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running transformers-cli login (stored in ~/.huggingface).

Loads the BrevitasQuantizer and model.

quantize

< source >

( quantization_config: BrevitasQuantizationConfig calibration_dataset: Optional = None )

Parameters

quantization_config (BrevitasQuantizationConfig) — Quantization configuration to use to quantize the model.
calibration_dataset (Optional[List[Dict]], defaults to None) — In case the quantization involves a calibration phase, this argument needs to be specified as a list of inputs to the model. Example: calibration_dataset = [{"input_ids": torch.tensor([[1, 2, 3, 4]])}, {"input_ids": torch.tensor([[6, 7, 3, 4]])}] which is a dataset for a model taking input_ids as an argument, and which has two samples.

Quantizes the model using Brevitas according to the quantization_config.

BrevitasQuantizationConfig

class optimum.amd.BrevitasQuantizationConfig

< source >

( weights_bitwidth: int = 8 activations_bitwidth: Optional = 8 weights_only: bool = False weights_param_method: Literal = 'stats' weights_symmetric: bool = True scale_precision: Literal = 'float_scale' weights_quant_granularity: Literal = 'per_tensor' weights_group_size: Optional = None quantize_zero_point: bool = True activations_param_method: Optional = 'stats' is_static: bool = False activations_symmetric: Optional = False activations_quant_granularity: Optional = 'per_tensor' activations_group_size: Optional = None activations_equalization: Optional = 'cross_layer' apply_weight_equalization: bool = False apply_bias_correction: bool = False apply_gptq: bool = False gptq_act_order: Optional = None device: str = 'auto' layers_to_exclude: Optional = None gpu_device_map: Optional = None cpu_device_map: Optional = None )

Parameters

weights_bitwidth (int, defaults to 8) — Bitwidth of the weights quantization. For example, with weights_bitwidth=8, each weight value is quantized on 8 bits.
activations_bitwidth (Optional[int], defaults to 8) — Bitwidth of the activations quantization.
weights_only (bool, defaults to False) — If set to True, only weights are to be quantized, otherwise activations are quantized as well.
weights_param_method (str, defaults to stats) — Strategy to use to estimate the quantization parameters (scale, zero-point) for the weights. Two strategies are available:
- "stats": Use min-max to estimate the range to quantize on.
- "mse": Use mean-square error between the unquantized weights and quantized weights to estimate the range to quantize on.
weights_symmetric (bool, defaults to True) — Whether to use symmetric quantization on the weights.
scale_precision (str, defaults to "float_scale") — Precise the constraints on the scale. Can either be "float_scale" (arbitrary scales), or "power_of_two_scale" (scales constrainted to be a power of 2).
weights_quant_granularity (str, defaults to "per_tensor") — The granularity of the quantization of the weights. This parameter can either be:
- "per_tensor": A single scale (and possibly zero-point) is used for one weight matrix.
- "per_channel": Each column (outer dimension) of the weight matrix has its own scale (and possibly zero-point).
- "per_group": Each column of the weight matrix may have several scales, grouped by weight_group_size.
weights_group_size (Optional[int], defaults to None) — Group size to use for the weights in case weights_quant_granularity="per_group". Defaults to 128 in this case, to None otherwise.
quantize_zero_point (bool, defaults to True) — When set to True, the unquantized value 0.0 is exactly representable as a quantized value: the zero point. When set to False, a quantization range [a, b] is exactly reprensentable (no rounding on a and b), but the unquantized value zero is not exactly representable.
activations_param_method (List[str]) — Strategy to use to estimate the quantization parameters (scale, zero-point) for the activations. Two strategies are available:
- "stats": Use min-max to estimate the range to quantize on.
- "mse": Use mean-square error between the unquantized activations and quantized activations to estimate the range to quantize on.
is_static (bool, defaults to False) — Whether to apply static quantization or dynamic quantization.
activations_symmetric (bool, defaults to False) — Whether to use symmetric quantization on the activations.
activations_quant_granularity (str, defaults to "per_tensor") — The granularity of the quantization of the activations. This parameter can either be "per_tensor", "per_row" or "per_group". In case static quantization is used (is_static=True), only "per_tensor" may be used.
activations_group_size (int, defaults to None) — Group size to use for the activations in case activations_quant_granularity="per_group". Defaults to 64 in this case, to None otherwise.
activations_equalization (Optional[str], defaults to "cross_layer") — Whether to apply activation equalization (SmoothQuant). Possible options are:
- None: No activation equalization.
- "layerwise": Apply SmoothQuant as described in https://arxiv.org/abs/2211.10438. The activation rescaling will be added as multiplication node, that is not fused within a preceding layer.
- "cross_layer": Apply SmoothQuant, and fuse the activation rescaling within a preceding layer when possible (example: nn.LayerNorm followed by nn.Linear). This is achieved through a graph capture of the model using torch.fx.
apply_weight_equalization (bool, defaults to False) — Applies weight equalization across layers, following https://arxiv.org/abs/1906.04721. This parameter is useful for models whose activation function is linear or piecewise-linear (like ReLU, used in OPT model), and allows to reduce the quantization error of the weights by balancing scales across layers.
apply_bias_correction (bool, defaults to False) — Applies bias correction to compensate for changes in activation bias caused by quantization.
apply_gptq (bool, defaults to False) — Whether to apply GPTQ algorithm for quantizing the weights.
gptq_act_order (Optional[bool], defaults to None) — Whether to use activations reordering (act-order, also known as desc-act) when apply_gptq=True. If apply_gptq=True, defaults to False.
layers_to_exclude (Optional[List], defaults to None) — Specify the names of the layers that should not be quantized. This should only be the last part of the layer name. If the same name is repeated across multiple layers, they will all be excluded. If left to None, the last linear layer is automatically identified and excluded.

QuantizationConfig is the configuration class handling all the Brevitas quantization parameters.

< > Update on GitHub

Optimum

Quantization Using Brevitas

BrevitasQuantizer

class optimum.amd.BrevitasQuantizer

from_pretrained

quantize

BrevitasQuantizationConfig

class optimum.amd.BrevitasQuantizationConfig