--- language: - en pipeline_tag: text-generation --- # Meta-Llama-3-70B-Instruct-quantized.w8a16 ## Model Overview - **Model Architecture:** Meta-Llama-3 - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Quantized:** INT8 weights - **Release Date:** 7/2/2024 - **Version:** 1.0 - **Model Developers:** Neural Magic Quantized version of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct). It achieves an average score of 79.18% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 77.90%. ## Model Optimizations This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights. [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. ## Evaluation The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine. ## Accuracy ### Open LLM Leaderboard evaluation scores | | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Meta-Llama-3-70B-Instruct-quantized.w8a16
(this model) | | :------------------: | :----------------------: | :------------------------------------------------: | | arc-c
25-shot | 72.44% | 71.59% | | hellaswag
10-shot | 85.54% | 85.65% | | mmlu
5-shot | 80.18% | 78.69% | | truthfulqa
0-shot | 62.92% | 61.94% | | winogrande
5-shot | 83.19% | 83.11% | | gsm8k
5-shot | 90.83% | 86.43% | | **Average
Accuracy** | **79.18%** | **77.90%** | | **Recovery** | **100%** | **98.38%** |