Qwen
/

Qwen-7B-Chat-Int4

@@ -8,7 +8,7 @@ pipeline_tag: text-generation
 inference: false
 ---
-# Qwen-7B-Chat
 <p align="center">
     <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
@@ -22,31 +22,34 @@ inference: false
 ## 介绍（Introduction）
-**通义千问-7B（Qwen-7B）**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。本仓库为Qwen-7B-Chat的仓库。
 如果您想了解更多关于通义千问-7B开源模型的细节，我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
-**Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B-Chat.
 For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
 ## 要求（Requirements）
 * python 3.8及以上版本
-* pytorch 1.12及以上版本，推荐2.0及以上版本
 * 建议使用CUDA 11.4及以上（GPU用户、flash-attention用户等需考虑此选项）
 * python 3.8 and above
-* pytorch 1.12 and above, 2.0 and above are recommended
 * CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
 ## 依赖项（Dependency）
-运行Qwen-7B-Chat，请确保满足上述要求，再执行以下pip命令安装依赖库
-To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
 ```bash
-pip install transformers==4.31.0 accelerate tiktoken einops
 ```
 另外，推荐安装`flash-attention`库，以实现更高的效率和更低的显存占用。
@@ -70,49 +73,94 @@ We show an example of multi-turn interaction with Qwen-7B-Chat in the following
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers.generation import GenerationConfig
 # Note: The default behavior now has injection attack prevention off.
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
-# use bf16
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
-# use fp16
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
-# use cpu only
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
-# use auto mode, automatically select precision based on the device.
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
 # Specify hyperparameters for generation
-model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
-# 第一轮对话 1st dialogue turn
-response, history = model.chat(tokenizer, "你好", history=None)
 print(response)
 # 你好！很高兴为你提供帮助。
-# 第二轮对话 2nd dialogue turn
-response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
-print(response)
-# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
-# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
-# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
-# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
-# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
-# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
-# 第三轮对话 3rd dialogue turn
-response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
-print(response)
-# 《奋斗创业：一个年轻人的成功之路》
 ```
 关于更多的使用说明，请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
 For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
 ## Tokenizer
 > 注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。
@@ -297,108 +345,6 @@ Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https:/
 | StarCoder-15.5B |      87.04      |    87.96    |   68.89   |
 | **Qwen-7B**     |      90.74      |    92.59    |   74.07   |
-## 量化（Quantization）
-如希望使用更低精度的量化模型，如4比特和8比特的模型，我们提供了简单的示例来说明如何快速使用量化模型。在开始前，确保你已经安装了`bitsandbytes`。请注意，`bitsandbytes`的安装要求是：
-We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
-```
-**Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
-```
-Windows用户需安装特定版本的`bitsandbytes`，可选项包括[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。
-Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
-你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置，即可使用量化模型。如下所示：
-Then you only need to add your quantization configuration to `AutoModelForCausalLM.from_pretrained`. See the example below:
-```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-# quantization configuration for NF4 (4 bits)
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type='nf4',
-    bnb_4bit_compute_dtype=torch.bfloat16
-)
-# quantization configuration for Int8 (8 bits)
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen-7B-Chat",
-    device_map="cuda:0",
-    quantization_config=quantization_config,
-    max_memory=max_memory,
-    trust_remote_code=True,
-).eval()
-```
-上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取，帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失，但模型的显存开销大幅降低。
-With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.
-| Precision   |   MMLU   |  GPU Memory for Loading Model |
-| ----------- | :------: | :---------------------------: |
-|   BF16      |   56.7   |             16.38G            |
-|   Int8      |   52.8   |             10.44G            |
-|    NF4      |   48.9   |             7.79G             |
-注：表中显存占用的测试环境为A100-SXM4-80G单卡，PyTorch 2.0.1，CUDA 11.8，开启flash attention
-Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and CUDA 11.8, with flash attention used.
-## 推理性能（Inference Efficiency）
-### 推理速度（Inference Speed）
-我们分别测试了BF16和量化条件下，模型生成2K tokens的平均推理速度，结果如下
-We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
-| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
-| ------ | :---------------------------: | :---------------------------: |
-| BF16 (no quantization) | 30.06 | 27.55 |
-| Int8 (bnb) | 7.94 | 7.86 |
-| NF4 (bnb) | 21.43 | 20.37 |
-具体的评测方式为：指定输入context长度为1，生成长度为2048；测试硬件为A100-SXM4-80G单卡，软件环境为PyTorch 2.0.1，CUDA版本11.8，计算生成该2048序列的平均速度。
-In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.
-### 显存占用（GPU Memory Usage）
-在BF16和不同量化条件下，我们分别测算了模型编码2048长度序列（并生成1个token），和生成8192长度序列（编码1个token作为context）的峰值显存占用。结果如下
-We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below.
-打开flash attention时
-When using flash attention, the memory usage is:
-| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| --- | :---: | :---: |
-| BF16 | 18.11GB | 23.52GB |
-| Int8 | 12.17GB | 17.60GB |
-| NF4 | 9.52GB | 14.93GB |
-关闭flash attention时
-When not using flash attention, the memory usage is:
-| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| --- | :---: | :---: |
-| BF16 | 18.11GB | 24.40GB |
-| Int8 | 12.18GB | 18.47GB |
-| NF4 | 9.52GB | 15.81GB |
-以上测速和显存占用情况，均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。
-The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
 ## FAQ
@@ -412,10 +358,10 @@ If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen-7B/bl
 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen-7B/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
 ## 联系我们（Contact Us）
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

 inference: false
 ---
+# Qwen-7B-Chat-Int4
 <p align="center">
     <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
 ## 介绍（Introduction）
+**通义千问-7B（Qwen-7B）**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。本仓库为Qwen-7B-Chat的Int4量化模型的仓库。
 如果您想了解更多关于通义千问-7B开源模型的细节，我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
+**Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for the Int4 quantized model of Qwen-7B-Chat.
 For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
 ## 要求（Requirements）
 * python 3.8及以上版本
+* pytorch 2.0及以上版本，推荐2.0及以上版本
 * 建议使用CUDA 11.4及以上（GPU用户、flash-attention用户等需考虑此选项）
 * python 3.8 and above
+* pytorch 2.0 and above, 2.0 and above are recommended
 * CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
 ## 依赖项（Dependency）
+运行Qwen-7B-Chat，请确保满足上述要求，再执行以下pip命令安装依赖库。同时需要通过源代码安装AutoGPTQ。
+To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. Remember to install AutoGPTQ from source.
 ```bash
+pip install -r requirements.txt
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
 ```
 另外，推荐安装`flash-attention`库，以实现更高的效率和更低的显存占用。
 ```python
+from transformers import AutoTokenizer
+from transformers import GenerationConfig
+from auto_gptq import AutoGPTQForCausalLM
 # Note: The default behavior now has injection attack prevention off.
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
 # Specify hyperparameters for generation
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
+response, history = model.chat(tokenizer, "你好", history=None, generation_config=config)
 print(response)
 # 你好！很高兴为你提供帮助。
 ```
 关于更多的使用说明，请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
 For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
+## 量化 (Quantization)
+### 用法 (Usage)
+**请注意：我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化，提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案，该方案在模型评测效果几乎无损，且存储需求更低，推理速度更优。**
+**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**
+以下我们提供示例说明如何使用Int4量化模型。在开始使用前，请先保证满足AutoGPTQ的要求，并使用源代码安装（由于最新支持Qwen的代码未发布到PyPI）：
+Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
+```
+随后便能轻松读取量化模型：
+Then you can load the quantized model easily as shown below
+```
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
+```
+推理方法和基础用法类似，但注意需要从外部传入generation config：
+To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:
+```
+from transformers import GenerationConfig
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
+```
+### 推理速度 (Inference Speed)
+我们测算了BF16和Int4模型生成8192个token的平均推理速度。如图所示：
+We measured the average inference speed of generating 8192 tokens under BF16 precision and Int4 quantization level, respectively.
+| Quantization Level | Inference Speed (tokens/s) |
+| ------------------ | :-------------------------:|
+|        BF16        | 22.59                      |
+|        Int4        | 24.91                      |
+具体而言，我们记录在长度为1的上下文的��件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
+In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
+### 显存使用 (GPU Memory Usage)
+我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示：
+We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
+| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| ------------------ | :---------------------------------: | :-----------------------------------: |
+| BF16               |               18.99GB               |                24.40GB                |
+| In4                |               10.20GB                |                15.61GB                |
+上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
+The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
 ## Tokenizer
 > 注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。
 | StarCoder-15.5B |      87.04      |    87.96    |   68.89   |
 | **Qwen-7B**     |      90.74      |    92.59    |   74.07   |
 ## FAQ
 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen-7B/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
 ## 联系我们（Contact Us）
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].