neuralmagic
/

Meta-Llama-3-70B-Instruct-quantized.w8a16

Text Generation

text-generation-inference

Inference Endpoints

8-bit precision

Model card Files Files and versions Community

alexmarques commited on Jul 2

Commit

50ec7e7

•

1 Parent(s): 101ef04

Create README.md

Files changed (1) hide show

README.md +45 -0

README.md ADDED Viewed

	@@ -0,0 +1,45 @@

+---
+language:
+- en
+pipeline_tag: text-generation
+---
+# Meta-Llama-3-70B-Instruct-quantized.w8a16
+## Model Overview
+- **Model Architecture:** Meta-Llama-3
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Quantized:** INT8 weights
+- **Release Date:** 7/2/2024
+- **Version:** 1.0
+- **Model Developers:** Neural Magic
+Quantized version of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).
+It achieves an average score of 79.18% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 77.90%.
+## Model Optimizations
+This model was obtained by quantizing the weights of [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to INT8 data type.
+Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
+[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+## Evaluation
+The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
+## Accuracy
+### Open LLM Leaderboard evaluation scores
+|                      | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Meta-Llama-3-70B-Instruct-quantized.w8a16<br>(this model) |
+| :------------------: | :----------------------: | :------------------------------------------------: |
+| arc-c<br>25-shot     | 72.44%                    | 71.59%                                          |
+| hellaswag<br>10-shot | 85.54%                    | 85.65%                                              |
+| mmlu<br>5-shot       | 80.18%                    | 78.69%                                              |
+| truthfulqa<br>0-shot | 62.92%                    | 61.94%                                              |
+| winogrande<br>5-shot | 83.19%                    | 83.11%                                              |
+| gsm8k<br>5-shot      | 90.83%                    | 86.43%                                              |
+| **Average<br>Accuracy**  | **79.18%**                    |              **77.90%**                                     |
+| **Recovery**             | **100%**                     |              **98.38%**                                     |