metadata

license: other
license_name: mrl
license_link: https://mistral.ai/licenses/MRL-0.1.md
base_model: mistralai/Mistral-Large-Instruct-2407
language:
  - en
  - fr
  - de
  - es
  - it
  - pt
  - ru
  - zh
  - ja
pipeline_tag: text-generation
tags:
  - chat

Mistral-Large-Instruct-2407 FP8

This repository contains the quantized weights for Mistral-Large-Instruct-2407.

The weights have been converted to FP8 format, with FP8 weights, FP8 activations, and FP8 KV cache. You can use either vLLM or Aphrodite Engine to load this model.

Quantization Method

The library used is llm-compressor.

pip install llmcompressor

Then run this script:

from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

MODEL_ID = "mistralai/Mistral-Large-Instruct-2407"
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"  # Or use your own dataset
DATASET_SPLIT = "train_sft"

# You can increase the the number of samples to increase accuracy
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def process_and_tokenize(example):
    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
    return tokenizer(
        text,
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per-tensor scales
#   * quantize the activations to fp8 with per-tensor scales
#   * quantize the kv cache to fp8 with per-tensor scales
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save to disk compressed.
SAVE_DIR = "./Mistral-Large-Instruct-2407-FP8"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)