File size: 2,640 Bytes
50ec7e7
5b84ce8
 
 
50ec7e7
 
5b84ce8
50ec7e7
5b84ce8
 
 
 
 
 
 
 
 
50ec7e7
5b84ce8
 
50ec7e7
5b84ce8
50ec7e7
5b84ce8
 
 
 
50ec7e7
 
 
5b84ce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
language:
- en
pipeline_tag: text-generation
---

# Meta-Llama-3-70B-Instruct-quantized.w8a16

## Model Overview
- **Model Architecture:** Meta-Llama-3
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Quantized:** INT8 weights
- **Release Date:** 7/2/2024
- **Version:** 1.0
- **Model Developers:** Neural Magic

Quantized version of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).
It achieves an average score of 79.18% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 77.90%.

## Model Optimizations

This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization.
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

## Evaluation

The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.

## Accuracy

### Open LLM Leaderboard evaluation scores
|                      | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Meta-Llama-3-70B-Instruct-quantized.w8a16<br>(this model) |
| :------------------: | :----------------------: | :------------------------------------------------: |
| arc-c<br>25-shot     | 72.44%                    | 71.59%                                          |
| hellaswag<br>10-shot | 85.54%                    | 85.65%                                              |
| mmlu<br>5-shot       | 80.18%                    | 78.69%                                              |
| truthfulqa<br>0-shot | 62.92%                    | 61.94%                                              |
| winogrande<br>5-shot | 83.19%                    | 83.11%                                              |
| gsm8k<br>5-shot      | 90.83%                    | 86.43%                                              |
| **Average<br>Accuracy**  | **79.18%**                    |              **77.90%**                                     |
| **Recovery**             | **100%**                     |              **98.38%**                                     |