--- license: other license_name: mrl license_link: https://mistral.ai/licenses/MRL-0.1.md base_model: mistralai/Mistral-Large-Instruct-2407 language: - en - fr - de - es - it - pt - ru - zh - ja pipeline_tag: text-generation tags: - chat --- # Mistral-Large-Instruct-2407 FP8 This repository contains the quantized weights for [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407). The weights have been converted to FP8 format, with FP8 weights, FP8 activations, and FP8 KV cache. You can use either [vLLM](https://github.com/vllm-project/vllm) or [Aphrodite Engine](https://github.com/PygmalionAI/aphrodite-engine) to load this model. ## Quantization Method The library used is [llm-compressor](https://github.com/vllm-project/llm-compressor). ```console pip install llmcompressor ``` Then run this script: ```py from datasets import load_dataset from transformers import AutoTokenizer from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot MODEL_ID = "mistralai/Mistral-Large-Instruct-2407" model = SparseAutoModelForCausalLM.from_pretrained( MODEL_ID, device_map="auto", torch_dtype="auto", ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Select calibration dataset. DATASET_ID = "HuggingFaceH4/ultrachat_200k" # Or use your own dataset DATASET_SPLIT = "train_sft" # You can increase the the number of samples to increase accuracy NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) def process_and_tokenize(example): text = tokenizer.apply_chat_template(example["messages"], tokenize=False) return tokenizer( text, padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(process_and_tokenize, remove_columns=ds.column_names) # Configure the quantization algorithm and scheme. # In this case, we: # * quantize the weights to fp8 with per-tensor scales # * quantize the activations to fp8 with per-tensor scales # * quantize the kv cache to fp8 with per-tensor scales recipe = """ quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true input_activations: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true targets: ["Linear"] kv_cache_scheme: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true """ # Apply algorithms. oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, ) # Save to disk compressed. SAVE_DIR = "./Mistral-Large-Instruct-2407-FP8" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR)