neuralmagic
/

Qwen2-72B-Instruct-quantized.w8a8

Text Generation

text-generation-inference

Inference Endpoints

8-bit precision

compressed-tensors

Model card Files Files and versions Community

alexmarques commited on Jul 18

Commit

82d7ef8

•

1 Parent(s): 41bb75f

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ license: apache-2.0
 license_link: https://www.apache.org/licenses/LICENSE-2.0
 ---
-# Qwen2-0.5B-Instruct-quantized.w8a8
 ## Model Overview
 - **Model Architecture:** Qwen2
@@ -15,7 +15,7 @@ license_link: https://www.apache.org/licenses/LICENSE-2.0
 - **Model Optimizations:**
   - **Activation quantization:** INT8
   - **Weight quantization:** INT8
-- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct), this models is intended for assistant-like chat.
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 7/15/2024
 - **Version:** 1.0
@@ -27,7 +27,7 @@ It achieves an average score of 80.32 on the [OpenLLM](https://huggingface.co/sp
 ### Model Optimizations
-This model was obtained by quantizing the weights of [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) to INT8 data type.
 This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
 Weight quantization also reduces disk size requirements by approximately 50%.

 license_link: https://www.apache.org/licenses/LICENSE-2.0
 ---
+# Qwen2-72B-Instruct-quantized.w8a8
 ## Model Overview
 - **Model Architecture:** Qwen2
 - **Model Optimizations:**
   - **Activation quantization:** INT8
   - **Weight quantization:** INT8
+- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct), this models is intended for assistant-like chat.
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 7/15/2024
 - **Version:** 1.0
 ### Model Optimizations
+This model was obtained by quantizing the weights of [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) to INT8 data type.
 This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
 Weight quantization also reduces disk size requirements by approximately 50%.