alpindale commited on
Commit
49f5d7a
1 Parent(s): c5b4ad4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: mrl
4
+ license_link: https://mistral.ai/licenses/MRL-0.1.md
5
+ base_model: mistralai/Mistral-Large-Instruct-2407
6
+ language:
7
+ - en
8
+ - fr
9
+ - de
10
+ - es
11
+ - it
12
+ - pt
13
+ - ru
14
+ - zh
15
+ - ja
16
+ pipeline_tag: text-generation
17
+ tags:
18
+ - chat
19
+ ---
20
+
21
+ # Mistral-Large-Instruct-2407 FP8
22
+
23
+ This repository contains the quantized weights for [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407).
24
+
25
+ The weights have been converted to FP8 format, with FP8 weights, FP8 activations, and FP8 KV cache. You can use either [vLLM](https://github.com/vllm-project/vllm) or [Aphrodite Engine](https://github.com/PygmalionAI/aphrodite-engine) to load this model.
26
+
27
+
28
+ ## Quantization Method
29
+ The library used is [llm-compressor](https://github.com/vllm-project/llm-compressor).
30
+
31
+ ```console
32
+ pip install llmcompressor
33
+ ```
34
+
35
+ Then run this script:
36
+
37
+ ```py
38
+ from datasets import load_dataset
39
+ from transformers import AutoTokenizer
40
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
41
+
42
+ MODEL_ID = "mistralai/Mistral-Large-Instruct-2407"
43
+ model = SparseAutoModelForCausalLM.from_pretrained(
44
+ MODEL_ID,
45
+ device_map="auto",
46
+ torch_dtype="auto",
47
+ )
48
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
49
+
50
+ # Select calibration dataset.
51
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k" # Or use your own dataset
52
+ DATASET_SPLIT = "train_sft"
53
+
54
+ # You can increase the the number of samples to increase accuracy
55
+ NUM_CALIBRATION_SAMPLES = 512
56
+ MAX_SEQUENCE_LENGTH = 2048
57
+
58
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
59
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
60
+
61
+
62
+ def process_and_tokenize(example):
63
+ text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
64
+ return tokenizer(
65
+ text,
66
+ padding=False,
67
+ max_length=MAX_SEQUENCE_LENGTH,
68
+ truncation=True,
69
+ add_special_tokens=False,
70
+ )
71
+
72
+ ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
73
+
74
+ # Configure the quantization algorithm and scheme.
75
+ # In this case, we:
76
+ # * quantize the weights to fp8 with per-tensor scales
77
+ # * quantize the activations to fp8 with per-tensor scales
78
+ # * quantize the kv cache to fp8 with per-tensor scales
79
+ recipe = """
80
+ quant_stage:
81
+ quant_modifiers:
82
+ QuantizationModifier:
83
+ ignore: ["lm_head"]
84
+ config_groups:
85
+ group_0:
86
+ weights:
87
+ num_bits: 8
88
+ type: float
89
+ strategy: tensor
90
+ dynamic: false
91
+ symmetric: true
92
+ input_activations:
93
+ num_bits: 8
94
+ type: float
95
+ strategy: tensor
96
+ dynamic: false
97
+ symmetric: true
98
+ targets: ["Linear"]
99
+ kv_cache_scheme:
100
+ num_bits: 8
101
+ type: float
102
+ strategy: tensor
103
+ dynamic: false
104
+ symmetric: true
105
+ """
106
+
107
+ # Apply algorithms.
108
+ oneshot(
109
+ model=model,
110
+ dataset=ds,
111
+ recipe=recipe,
112
+ max_seq_length=MAX_SEQUENCE_LENGTH,
113
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
114
+ )
115
+
116
+ # Save to disk compressed.
117
+ SAVE_DIR = "./Mistral-Large-Instruct-2407-FP8"
118
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
119
+ tokenizer.save_pretrained(SAVE_DIR)