Lin-K76 commited on
Commit
02ba52a
1 Parent(s): 4b76cb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -26
README.md CHANGED
@@ -33,7 +33,7 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
- It achieves an average score of 83.14 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 83.61.
37
 
38
  ### Model Optimizations
39
 
@@ -118,11 +118,11 @@ model_stub = "meta-llama/Meta-Llama-3.1-70B-Instruct"
118
  model_name = model_stub.split("/")[-1]
119
 
120
  device_map = calculate_offload_device_map(
121
- model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype=torch.float16
122
  )
123
 
124
  model = SparseAutoModelForCausalLM.from_pretrained(
125
- model_stub, torch_dtype=torch.float16, device_map=device_map
126
  )
127
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
128
 
@@ -172,7 +172,7 @@ oneshot(
172
 
173
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
174
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
175
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
176
 
177
  ### Accuracy
178
 
@@ -191,71 +191,81 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
191
  <tr>
192
  <td>MMLU (5-shot)
193
  </td>
194
- <td>82.21
195
  </td>
196
- <td>82.24
197
  </td>
198
- <td>100.0%
 
 
 
 
 
 
 
 
 
 
199
  </td>
200
  </tr>
201
  <tr>
202
  <td>ARC Challenge (0-shot)
203
  </td>
204
- <td>95.05
205
  </td>
206
- <td>94.54
207
  </td>
208
- <td>99.46%
209
  </td>
210
  </tr>
211
  <tr>
212
  <td>GSM-8K-cot (8-shot, strict-match)
213
  </td>
214
- <td>93.18
215
  </td>
216
- <td>93.33
217
  </td>
218
- <td>100.1%
219
  </td>
220
  </tr>
221
  <tr>
222
  <td>Hellaswag (10-shot)
223
  </td>
224
- <td>86.33
225
  </td>
226
- <td>85.67
227
  </td>
228
- <td>99.24%
229
  </td>
230
  </tr>
231
  <tr>
232
  <td>Winogrande (5-shot)
233
  </td>
234
- <td>85.00
235
  </td>
236
- <td>85.79
237
  </td>
238
- <td>100.9%
239
  </td>
240
  </tr>
241
  <tr>
242
- <td>TruthfulQA (0-shot)
243
  </td>
244
- <td>59.90
245
  </td>
246
- <td>57.24
247
  </td>
248
- <td>95.56%
249
  </td>
250
  </tr>
251
  <tr>
252
  <td><strong>Average</strong>
253
  </td>
254
- <td><strong>83.61</strong>
255
  </td>
256
- <td><strong>83.14</strong>
257
  </td>
258
- <td><strong>99.43%</strong>
259
  </td>
260
  </tr>
261
  </table>
@@ -274,6 +284,17 @@ lm_eval \
274
  --batch_size auto
275
  ```
276
 
 
 
 
 
 
 
 
 
 
 
 
277
  #### ARC-Challenge
278
  ```
279
  lm_eval \
 
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
+ It achieves an average score of 84.29 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 84.40.
37
 
38
  ### Model Optimizations
39
 
 
118
  model_name = model_stub.split("/")[-1]
119
 
120
  device_map = calculate_offload_device_map(
121
+ model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype="auto"
122
  )
123
 
124
  model = SparseAutoModelForCausalLM.from_pretrained(
125
+ model_stub, torch_dtype="auto", device_map=device_map
126
  )
127
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
128
 
 
172
 
173
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
174
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
175
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
176
 
177
  ### Accuracy
178
 
 
191
  <tr>
192
  <td>MMLU (5-shot)
193
  </td>
194
+ <td>83.83
195
  </td>
196
+ <td>83.73
197
  </td>
198
+ <td>99.88%
199
+ </td>
200
+ </tr>
201
+ <tr>
202
+ <td>MMLU-cot (0-shot)
203
+ </td>
204
+ <td>86.01
205
+ </td>
206
+ <td>85.44
207
+ </td>
208
+ <td>99.34%
209
  </td>
210
  </tr>
211
  <tr>
212
  <td>ARC Challenge (0-shot)
213
  </td>
214
+ <td>93.26
215
  </td>
216
+ <td>92.92
217
  </td>
218
+ <td>99.64%
219
  </td>
220
  </tr>
221
  <tr>
222
  <td>GSM-8K-cot (8-shot, strict-match)
223
  </td>
224
+ <td>94.92
225
  </td>
226
+ <td>94.54
227
  </td>
228
+ <td>99.60%
229
  </td>
230
  </tr>
231
  <tr>
232
  <td>Hellaswag (10-shot)
233
  </td>
234
+ <td>86.75
235
  </td>
236
+ <td>86.64
237
  </td>
238
+ <td>99.87%
239
  </td>
240
  </tr>
241
  <tr>
242
  <td>Winogrande (5-shot)
243
  </td>
244
+ <td>85.32
245
  </td>
246
+ <td>85.95
247
  </td>
248
+ <td>100.7%
249
  </td>
250
  </tr>
251
  <tr>
252
+ <td>TruthfulQA (0-shot, mc2)
253
  </td>
254
+ <td>60.68
255
  </td>
256
+ <td>60.84
257
  </td>
258
+ <td>100.2%
259
  </td>
260
  </tr>
261
  <tr>
262
  <td><strong>Average</strong>
263
  </td>
264
+ <td><strong>84.40</strong>
265
  </td>
266
+ <td><strong>84.29</strong>
267
  </td>
268
+ <td><strong>99.88%</strong>
269
  </td>
270
  </tr>
271
  </table>
 
284
  --batch_size auto
285
  ```
286
 
287
+ #### MMLU-cot
288
+ ```
289
+ lm_eval \
290
+ --model vllm \
291
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
292
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
293
+ --apply_chat_template \
294
+ --num_fewshot 0 \
295
+ --batch_size auto
296
+ ```
297
+
298
  #### ARC-Challenge
299
  ```
300
  lm_eval \