Lin-K76 commited on
Commit
8e23885
1 Parent(s): ff4cae8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +319 -0
README.md ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ license: llama3.1
6
+ license_link: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE
7
+ language:
8
+ - en
9
+ ---
10
+
11
+ # Meta-Llama-3.1-405B-FP8
12
+
13
+ ## Model Overview
14
+ - **Model Architecture:** Meta-Llama-3.1
15
+ - **Input:** Text
16
+ - **Output:** Text
17
+ - **Model Optimizations:**
18
+ - **Weight quantization:** FP8
19
+ - **Activation quantization:** FP8
20
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), this model serves as a base version.
21
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
22
+ - **Release Date:** 7/23/2024
23
+ - **Version:** 1.0
24
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
25
+ - **Model Developers:** Neural Magic
26
+
27
+ Quantized version of [Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B).
28
+ It achieves an average score of 82.00 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), recovering 98.7% of dense performance.
29
+ <!-- whereas the unquantized model achieves 79.84. -->
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) to FP8 data type, ready for inference with vLLM built from source.
34
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
35
+
36
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
37
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
38
+
39
+ <!-- ## Deployment
40
+
41
+ ### Use with vLLM
42
+
43
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
+
45
+ ```python
46
+ from vllm import LLM, SamplingParams
47
+ from transformers import AutoTokenizer
48
+
49
+ model_id = "neuralmagic/Meta-Llama-3.1-405B-FP8"
50
+ number_gpus = 2
51
+
52
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+
56
+ messages = [
57
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
58
+ {"role": "user", "content": "Who are you?"},
59
+ ]
60
+
61
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
62
+
63
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
64
+
65
+ outputs = llm.generate(prompts, sampling_params)
66
+
67
+ generated_text = outputs[0].outputs[0].text
68
+ print(generated_text)
69
+ ```
70
+
71
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. -->
72
+
73
+ ## Creation
74
+
75
+ This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
76
+
77
+ ```python
78
+ import torch
79
+ from datasets import load_dataset
80
+ from transformers import AutoTokenizer
81
+
82
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
83
+ from llmcompressor.transformers.compression.helpers import (
84
+ calculate_offload_device_map,
85
+ custom_offload_device_map,
86
+ )
87
+
88
+ recipe = """
89
+ quant_stage:
90
+ quant_modifiers:
91
+ QuantizationModifier:
92
+ ignore: ["lm_head"]
93
+ config_groups:
94
+ group_0:
95
+ weights:
96
+ num_bits: 8
97
+ type: float
98
+ strategy: tensor
99
+ dynamic: false
100
+ symmetric: true
101
+ input_activations:
102
+ num_bits: 8
103
+ type: float
104
+ strategy: tensor
105
+ dynamic: false
106
+ symmetric: true
107
+ targets: ["Linear"]
108
+ """
109
+
110
+ model_stub = "meta-llama/Meta-Llama-3.1-405B"
111
+ model_name = model_stub.split("/")[-1]
112
+
113
+ device_map = calculate_offload_device_map(
114
+ model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
115
+ )
116
+
117
+ model = SparseAutoModelForCausalLM.from_pretrained(
118
+ model_stub, torch_dtype=torch.float16, device_map=device_map
119
+ )
120
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
121
+
122
+ output_dir = f"./{model_name}-FP8"
123
+
124
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k"
125
+ DATASET_SPLIT = "train_sft"
126
+ NUM_CALIBRATION_SAMPLES = 512
127
+ MAX_SEQUENCE_LENGTH = 4096
128
+
129
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
130
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
131
+
132
+ def preprocess(example):
133
+ return {
134
+ "text": tokenizer.apply_chat_template(
135
+ example["messages"],
136
+ tokenize=False,
137
+ )
138
+ }
139
+
140
+ ds = ds.map(preprocess)
141
+
142
+ def tokenize(sample):
143
+ return tokenizer(
144
+ sample["text"],
145
+ padding=False,
146
+ max_length=MAX_SEQUENCE_LENGTH,
147
+ truncation=True,
148
+ add_special_tokens=False,
149
+ )
150
+
151
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
152
+
153
+ oneshot(
154
+ model=model,
155
+ output_dir=output_dir,
156
+ dataset=ds,
157
+ recipe=recipe,
158
+ max_seq_length=MAX_SEQUENCE_LENGTH,
159
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
160
+ save_compressed=True,
161
+ )
162
+ ```
163
+
164
+ ## Evaluation
165
+
166
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
167
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
168
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge that matches the prompting style of [Meta-Llama-3.1-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-evals).
169
+ An asterisk indicates that some evaluations are still being collected.
170
+
171
+ ### Accuracy
172
+
173
+ #### Open LLM Leaderboard evaluation scores
174
+ <table>
175
+ <tr>
176
+ <td><strong>Benchmark</strong>
177
+ </td>
178
+ <td><strong>Meta-Llama-3.1-405B </strong>
179
+ </td>
180
+ <td><strong>Meta-Llama-3.1-405B-FP8(this model)</strong>
181
+ </td>
182
+ <td><strong>Recovery</strong>
183
+ </td>
184
+ </tr>
185
+ <tr>
186
+ <td>MMLU (5-shot)
187
+ </td>
188
+ <td>*
189
+ </td>
190
+ <td>84.72
191
+ </td>
192
+ <td>*
193
+ </td>
194
+ </tr>
195
+ <tr>
196
+ <td>ARC Challenge (0-shot)
197
+ </td>
198
+ <td>95.99
199
+ </td>
200
+ <td>95.82
201
+ </td>
202
+ <td>99.82%
203
+ </td>
204
+ </tr>
205
+ <tr>
206
+ <td>GSM-8K (5-shot, strict-match)
207
+ </td>
208
+ <td>88.10
209
+ </td>
210
+ <td>87.94
211
+ </td>
212
+ <td>99.82%
213
+ </td>
214
+ </tr>
215
+ <tr>
216
+ <td>Hellaswag (10-shot)
217
+ </td>
218
+ <td>90.02
219
+ </td>
220
+ <td>89.14
221
+ </td>
222
+ <td>99.02%
223
+ </td>
224
+ </tr>
225
+ <tr>
226
+ <td>Winogrande (5-shot)
227
+ </td>
228
+ <td>87.61
229
+ </td>
230
+ <td>86.42
231
+ </td>
232
+ <td>98.64%
233
+ </td>
234
+ </tr>
235
+ <tr>
236
+ <td>TruthfulQA (0-shot)
237
+ </td>
238
+ <td>49.83
239
+ </td>
240
+ <td>47.93
241
+ </td>
242
+ <td>96.19%
243
+ </td>
244
+ </tr>
245
+ <tr>
246
+ <td><strong>Average</strong>
247
+ </td>
248
+ <td><strong>*</strong>
249
+ </td>
250
+ <td><strong>82.00</strong>
251
+ </td>
252
+ <td><strong>98.70%</strong>
253
+ </td>
254
+ </tr>
255
+ </table>
256
+
257
+ ### Reproduction
258
+
259
+ The results were obtained using the following commands:
260
+
261
+ #### MMLU
262
+ ```
263
+ lm_eval \
264
+ --model vllm \
265
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
266
+ --tasks mmlu \
267
+ --num_fewshot 5 \
268
+ --batch_size auto
269
+ ```
270
+
271
+ #### ARC-Challenge
272
+ ```
273
+ lm_eval \
274
+ --model vllm \
275
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
276
+ --tasks arc_challenge_llama_3.1_instruct \
277
+ --num_fewshot 25 \
278
+ --batch_size auto
279
+ ```
280
+
281
+ #### GSM-8K
282
+ ```
283
+ lm_eval \
284
+ --model vllm \
285
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
286
+ --tasks gsm8k \
287
+ --num_fewshot 5 \
288
+ --batch_size auto
289
+ ```
290
+
291
+ #### Hellaswag
292
+ ```
293
+ lm_eval \
294
+ --model vllm \
295
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
296
+ --tasks hellaswag \
297
+ --num_fewshot 10 \
298
+ --batch_size auto
299
+ ```
300
+
301
+ #### Winogrande
302
+ ```
303
+ lm_eval \
304
+ --model vllm \
305
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
306
+ --tasks winogrande \
307
+ --num_fewshot 5 \
308
+ --batch_size auto
309
+ ```
310
+
311
+ #### TruthfulQA
312
+ ```
313
+ lm_eval \
314
+ --model vllm \
315
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8 \
316
+ --tasks truthfulqa_mc \
317
+ --num_fewshot 0 \
318
+ --batch_size auto
319
+ ```