JustinLin610 commited on
Commit
d431257
2 Parent(s): a987369 cd19432

Merge branch 'main' of hf.co:Qwen/Qwen-7B-Chat-Int4

Browse files
Files changed (1) hide show
  1. README.md +87 -141
README.md CHANGED
@@ -8,7 +8,7 @@ pipeline_tag: text-generation
8
  inference: false
9
  ---
10
 
11
- # Qwen-7B-Chat
12
 
13
  <p align="center">
14
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
@@ -22,31 +22,34 @@ inference: false
22
 
23
  ## 介绍(Introduction)
24
 
25
- **通义千问-7B(Qwen-7B)**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。本仓库为Qwen-7B-Chat的仓库。
26
 
27
  如果您想了解更多关于通义千问-7B开源模型的细节,我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
28
 
29
- **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B-Chat.
30
 
31
  For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
32
 
33
  ## 要求(Requirements)
34
 
35
  * python 3.8及以上版本
36
- * pytorch 1.12及以上版本,推荐2.0及以上版本
37
  * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
38
  * python 3.8 and above
39
- * pytorch 1.12 and above, 2.0 and above are recommended
40
  * CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
41
 
42
  ## 依赖项(Dependency)
43
 
44
- 运行Qwen-7B-Chat,请确保满足上述要求,再执行以下pip命令安装依赖库
45
 
46
- To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
47
 
48
  ```bash
49
- pip install transformers==4.31.0 accelerate tiktoken einops
 
 
 
50
  ```
51
 
52
  另外,推荐安装`flash-attention`库,以实现更高的效率和更低的显存占用。
@@ -70,49 +73,94 @@ We show an example of multi-turn interaction with Qwen-7B-Chat in the following
70
 
71
 
72
  ```python
73
- from transformers import AutoModelForCausalLM, AutoTokenizer
74
- from transformers.generation import GenerationConfig
 
75
 
76
  # Note: The default behavior now has injection attack prevention off.
77
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
78
 
79
- # use bf16
80
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
81
- # use fp16
82
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
83
- # use cpu only
84
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
85
- # use auto mode, automatically select precision based on the device.
86
- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
87
 
88
  # Specify hyperparameters for generation
89
- model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
90
-
91
- # 第一轮对话 1st dialogue turn
92
- response, history = model.chat(tokenizer, "你好", history=None)
93
  print(response)
94
  # 你好!很高兴为你提供帮助。
95
-
96
- # 第二轮对话 2nd dialogue turn
97
- response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
98
- print(response)
99
- # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
100
- # 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
101
- # 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
102
- # 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
103
- # 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
104
- # 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
105
-
106
- # 第三轮对话 3rd dialogue turn
107
- response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
108
- print(response)
109
- # 《奋斗创业:一个年轻人的成功之路》
110
  ```
111
 
112
  关于更多的使用说明,请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
113
 
114
  For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
115
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  ## Tokenizer
117
 
118
  > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。
@@ -297,108 +345,6 @@ Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https:/
297
  | StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
298
  | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
299
 
300
- ## 量化(Quantization)
301
-
302
- 如希望使用更低精度的量化模型,如4比特和8比特的模型,我们提供了简单的示例来说明如何快速使用量化模型。在开始前,确保你已经安装了`bitsandbytes`。请注意,`bitsandbytes`的安装要求是:
303
-
304
- We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
305
-
306
- ```
307
- **Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
308
- ```
309
-
310
- Windows用户需安装特定版本的`bitsandbytes`,可选项包括[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。
311
-
312
- Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
313
-
314
- 你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置,即可使用量化模型。如下所示:
315
-
316
- Then you only need to add your quantization configuration to `AutoModelForCausalLM.from_pretrained`. See the example below:
317
-
318
- ```python
319
- from transformers import AutoModelForCausalLM, BitsAndBytesConfig
320
-
321
- # quantization configuration for NF4 (4 bits)
322
- quantization_config = BitsAndBytesConfig(
323
- load_in_4bit=True,
324
- bnb_4bit_quant_type='nf4',
325
- bnb_4bit_compute_dtype=torch.bfloat16
326
- )
327
-
328
- # quantization configuration for Int8 (8 bits)
329
- quantization_config = BitsAndBytesConfig(load_in_8bit=True)
330
-
331
- model = AutoModelForCausalLM.from_pretrained(
332
- "Qwen/Qwen-7B-Chat",
333
- device_map="cuda:0",
334
- quantization_config=quantization_config,
335
- max_memory=max_memory,
336
- trust_remote_code=True,
337
- ).eval()
338
- ```
339
-
340
- 上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取,帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失,但模型的显存开销大幅降低。
341
-
342
- With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.
343
-
344
- | Precision | MMLU | GPU Memory for Loading Model |
345
- | ----------- | :------: | :---------------------------: |
346
- | BF16 | 56.7 | 16.38G |
347
- | Int8 | 52.8 | 10.44G |
348
- | NF4 | 48.9 | 7.79G |
349
-
350
- 注:表中显存占用的测试环境为A100-SXM4-80G单卡,PyTorch 2.0.1,CUDA 11.8,开启flash attention
351
-
352
- Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and CUDA 11.8, with flash attention used.
353
-
354
- ## 推理性能(Inference Efficiency)
355
-
356
- ### 推理速度(Inference Speed)
357
-
358
- 我们分别测试了BF16和量化条件下,模型生成2K tokens的平均推理速度,结果如下
359
-
360
- We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
361
-
362
- | Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
363
- | ------ | :---------------------------: | :---------------------------: |
364
- | BF16 (no quantization) | 30.06 | 27.55 |
365
- | Int8 (bnb) | 7.94 | 7.86 |
366
- | NF4 (bnb) | 21.43 | 20.37 |
367
-
368
- 具体的评测方式为:指定输入context长度为1,生成长度为2048;测试硬件为A100-SXM4-80G单卡,软件环境为PyTorch 2.0.1,CUDA版本11.8,计算生成该2048序列的平均速度。
369
-
370
- In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.
371
-
372
- ### 显存占用(GPU Memory Usage)
373
-
374
- 在BF16和不同量化条件下,我们分别测算了模型编码2048长度序列(并生成1个token),和生成8192长度序列(编码1个token作为context)的峰值显存占用。结果如下
375
-
376
- We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below.
377
-
378
- 打开flash attention时
379
-
380
- When using flash attention, the memory usage is:
381
-
382
- | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
383
- | --- | :---: | :---: |
384
- | BF16 | 18.11GB | 23.52GB |
385
- | Int8 | 12.17GB | 17.60GB |
386
- | NF4 | 9.52GB | 14.93GB |
387
-
388
- 关闭flash attention时
389
-
390
- When not using flash attention, the memory usage is:
391
-
392
- | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
393
- | --- | :---: | :---: |
394
- | BF16 | 18.11GB | 24.40GB |
395
- | Int8 | 12.18GB | 18.47GB |
396
- | NF4 | 9.52GB | 15.81GB |
397
-
398
-
399
- 以上测速和显存占用情况,均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。
400
-
401
- The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
402
 
403
  ## FAQ
404
 
@@ -412,10 +358,10 @@ If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen-7B/bl
412
 
413
  Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen-7B/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
414
 
415
-
416
  ## 联系我们(Contact Us)
417
 
418
  如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。
419
 
420
  If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
421
 
 
 
8
  inference: false
9
  ---
10
 
11
+ # Qwen-7B-Chat-Int4
12
 
13
  <p align="center">
14
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
 
22
 
23
  ## 介绍(Introduction)
24
 
25
+ **通义千问-7B(Qwen-7B)**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。本仓库为Qwen-7B-Chat的Int4量化模型的仓库。
26
 
27
  如果您想了解更多关于通义千问-7B开源模型的细节,我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
28
 
29
+ **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for the Int4 quantized model of Qwen-7B-Chat.
30
 
31
  For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
32
 
33
  ## 要求(Requirements)
34
 
35
  * python 3.8及以上版本
36
+ * pytorch 2.0及以上版本,推荐2.0及以上版本
37
  * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
38
  * python 3.8 and above
39
+ * pytorch 2.0 and above, 2.0 and above are recommended
40
  * CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
41
 
42
  ## 依赖项(Dependency)
43
 
44
+ 运行Qwen-7B-Chat,请确保满足上述要求,再执行以下pip命令安装依赖库。同时需要通过源代码安装AutoGPTQ。
45
 
46
+ To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. Remember to install AutoGPTQ from source.
47
 
48
  ```bash
49
+ pip install -r requirements.txt
50
+
51
+ git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
52
+ pip install .
53
  ```
54
 
55
  另外,推荐安装`flash-attention`库,以实现更高的效率和更低的显存占用。
 
73
 
74
 
75
  ```python
76
+ from transformers import AutoTokenizer
77
+ from transformers import GenerationConfig
78
+ from auto_gptq import AutoGPTQForCausalLM
79
 
80
  # Note: The default behavior now has injection attack prevention off.
81
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
82
 
83
+ model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
 
 
 
 
 
 
 
84
 
85
  # Specify hyperparameters for generation
86
+ config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
87
+ response, history = model.chat(tokenizer, "你好", history=None, generation_config=config)
 
 
88
  print(response)
89
  # 你好!很高兴为你提供帮助。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
  关于更多的使用说明,请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
93
 
94
  For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
95
 
96
+ ## 量化 (Quantization)
97
+
98
+ ### 用法 (Usage)
99
+
100
+ **请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化,提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案,该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。**
101
+
102
+ **Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**
103
+
104
+ 以下我们提供示例说明如何使用Int4量化模型。在开始使用前,请先保证满足AutoGPTQ的要求,并使用源代码安装(由于最新支持Qwen的代码未发布到PyPI):
105
+
106
+ Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):
107
+
108
+ ```bash
109
+ git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
110
+ pip install .
111
+ ```
112
+
113
+ 随后便能轻松读取量化模型:
114
+
115
+ Then you can load the quantized model easily as shown below
116
+
117
+ ```
118
+ from auto_gptq import AutoGPTQForCausalLM
119
+ model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
120
+ ```
121
+
122
+ 推理方法和基础用法类似,但注意需要从外部传入generation config:
123
+
124
+ To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:
125
+
126
+ ```
127
+ from transformers import GenerationConfig
128
+ config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
129
+ response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
130
+ ```
131
+
132
+ ### 推理速度 (Inference Speed)
133
+
134
+ 我们测算了BF16和Int4模型生成8192个token的平均推理速度。如图所示:
135
+
136
+ We measured the average inference speed of generating 8192 tokens under BF16 precision and Int4 quantization level, respectively.
137
+
138
+ | Quantization Level | Inference Speed (tokens/s) |
139
+ | ------------------ | :-------------------------:|
140
+ | BF16 | 22.59 |
141
+ | Int4 | 24.91 |
142
+
143
+ 具体而言,我们记录在长度为1的上下文的��件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
144
+
145
+ In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
146
+
147
+ ### 显存使用 (GPU Memory Usage)
148
+
149
+ 我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示:
150
+
151
+ We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
152
+
153
+ | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
154
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
155
+ | BF16 | 18.99GB | 24.40GB |
156
+ | In4 | 10.20GB | 15.61GB |
157
+
158
+ 上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
159
+
160
+ The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
161
+
162
+
163
+
164
  ## Tokenizer
165
 
166
  > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。
 
345
  | StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
346
  | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
347
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
348
 
349
  ## FAQ
350
 
 
358
 
359
  Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen-7B/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
360
 
 
361
  ## 联系我们(Contact Us)
362
 
363
  如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。
364
 
365
  If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
366
 
367
+