shunxing1234
commited on
Commit
•
dbe851b
1
Parent(s):
975bcc0
Update README.md
Browse files
README.md
CHANGED
@@ -2,206 +2,38 @@
|
|
2 |
license: other
|
3 |
---
|
4 |
|
5 |
-
#
|
6 |
|
7 |
-
|
8 |
-
|
|
|
|
|
|
|
9 |
|
10 |
-
The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.
|
11 |
-
|
12 |
-
AquilaChat-7B是在Aquila-7B模型的基础上,进行SFT微调后的支持中英双语的对话式语言模型。AquilaChat-7B模型由智源研究院研发。
|
13 |
|
14 |
-
AquilaChat-7B is a conversational language model that supports Chinese-English dialogue. It is based on the Aquila-7B model and fine-tuned using SFT. AquilaChat-7B model was developed by Beijing Academy of Artificial Intelligence.
|
15 |
|
|
|
16 |
|
17 |
-
AquilaChat模型主要为了验证基础模型能力,您可以根据自己需要对模型进行使用,修改和商业化,但必须遵守所有国家的法律法规,并且对任何第三方使用者提供Aquila系列模型的来源以及Aquila系列模型协议的副本。
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
-
|
22 |
-
| 模型/Model | 状态/State | 能否商用/Commercial use? | 所用显卡/GPU |
|
23 |
-
| :---------------- | :------- | :-- |:-- |
|
24 |
-
| Aquila-7B | 已发布 | ✅ | Nvidia-A100 |
|
25 |
-
| AquilaChat-7B |已发布 | ✅ | Nvidia-A100 |
|
26 |
-
| AquilaCode-7B-NV |已发布 | ✅ | Nvidia-A100 |
|
27 |
-
| AquilaCode-7B-TS |已发布 | ✅ | Tianshu-BI-V100 |
|
28 |
-
| Aquila-33B | **敬请期待** | ✅ | Nvidia-A100 |
|
29 |
-
| AquilaChat-33B |**敬请期待** | ✅ | Nvidia-A100 |
|
30 |
|
31 |
|
32 |
-
|
33 |
-
|
34 |
-
Aquila模型所采用的tokenizer是由我们从头开始训练的,支持中英双语。与其他tokenizer的参数对比见下表:
|
35 |
-
|
36 |
-
我们在处理英文、中文以及代码数据时,采用了不同的分词器对一万个样本进行了抽取。随后,我们统计了每个样本的token数量,并将其记录在表格中。
|
37 |
-
|
38 |
-
|
39 |
-
We used a series of more efficient low-level operators to assist with model training, including methods referenced from [flash-attention](https://github.com/HazyResearch/flash-attention) and replacing some intermediate calculations, as well as using RMSNorm. Building upon this foundation, we applied the [BMtrain](https://github.com/OpenBMB/BMTrain) for lightweight parallel training, which utilizes methods such as data parallelism, ZeRO (zero redundancy optimizer), optimizer offloading, checkpoint and operation fusion, and communication-computation overlap to optimize the model training process.
|
40 |
-
|
41 |
-
The tokenizer used in the Aquila model was trained from scratch by us and supports both English and Chinese. The parameters of this tokenizer are compared to those of other tokenizers in the table below:
|
42 |
-
|
43 |
-
We used different tokenizers to extract ten thousand data samples from English, Chinese, and code data respectively, obtained the count of tokens for each sample, and also included it in the table.
|
44 |
-
|
45 |
-
| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
|
46 |
-
| ----- | ---- | ----- | ---- | ----- | ---- |
|
47 |
-
| GPT2 | 50527 | bpe|1717 | 1764|2323 |
|
48 |
-
| LLaMA | 32000 | sp(bpe)|1805| 1257|1970 |
|
49 |
-
| Aquila | 100000 | bpe|1575 | 477|1679 |
|
50 |
-
|
51 |
-
## 训练数据集/Training data
|
52 |
-
|
53 |
-
我们采用了一系列高质量中英文数据集来训练和微调我们的对话语言模型,并且在不断更新迭代
|
54 |
-
|
55 |
-
We used a series of high-quality Chinese and English datasets to train and fine-tune our conversational language model, and continuously updated it through iterations.
|
56 |
|
57 |
|
58 |
## 使用方式/How to use
|
59 |
|
60 |
### 1. 推理/Inference
|
61 |
|
62 |
-
```python
|
63 |
-
import os
|
64 |
-
import torch
|
65 |
-
from flagai.auto_model.auto_loader import AutoLoader
|
66 |
-
from flagai.model.predictor.predictor import Predictor
|
67 |
-
from flagai.model.predictor.aquila import aquila_generate
|
68 |
-
from flagai.data.tokenizer import Tokenizer
|
69 |
-
import bminf
|
70 |
-
|
71 |
-
state_dict = "./checkpoints_in"
|
72 |
-
model_name = 'aquilachat-30b'
|
73 |
-
|
74 |
-
loader = AutoLoader(
|
75 |
-
"lm",
|
76 |
-
model_dir=state_dict,
|
77 |
-
model_name=model_name,
|
78 |
-
use_cache=True)
|
79 |
-
|
80 |
-
model = loader.get_model()
|
81 |
-
tokenizer = loader.get_tokenizer()
|
82 |
-
cache_dir = os.path.join(state_dict, model_name)
|
83 |
-
model.eval()
|
84 |
-
model.half()
|
85 |
-
model.cuda()
|
86 |
-
|
87 |
-
predictor = Predictor(model, tokenizer)
|
88 |
-
|
89 |
-
text = "北京为什么是中国的首都?"
|
90 |
-
|
91 |
-
def pack_obj(text):
|
92 |
-
obj = dict()
|
93 |
-
obj['id'] = 'demo'
|
94 |
-
|
95 |
-
obj['conversations'] = []
|
96 |
-
human = dict()
|
97 |
-
human['from'] = 'human'
|
98 |
-
human['value'] = text
|
99 |
-
obj['conversations'].append(human)
|
100 |
-
# dummy bot
|
101 |
-
bot = dict()
|
102 |
-
bot['from'] = 'gpt'
|
103 |
-
bot['value'] = ''
|
104 |
-
obj['conversations'].append(bot)
|
105 |
-
|
106 |
-
obj['instruction'] = ''
|
107 |
-
|
108 |
-
return obj
|
109 |
-
|
110 |
-
def delete_last_bot_end_singal(convo_obj):
|
111 |
-
conversations = convo_obj['conversations']
|
112 |
-
assert len(conversations) > 0 and len(conversations) % 2 == 0
|
113 |
-
assert conversations[0]['from'] == 'human'
|
114 |
-
|
115 |
-
last_bot = conversations[len(conversations)-1]
|
116 |
-
assert last_bot['from'] == 'gpt'
|
117 |
-
|
118 |
-
## from _add_speaker_and_signal
|
119 |
-
END_SIGNAL = "\n"
|
120 |
-
len_end_singal = len(END_SIGNAL)
|
121 |
-
len_last_bot_value = len(last_bot['value'])
|
122 |
-
last_bot['value'] = last_bot['value'][:len_last_bot_value-len_end_singal]
|
123 |
-
return
|
124 |
-
|
125 |
-
def convo_tokenize(convo_obj, tokenizer):
|
126 |
-
chat_desc = convo_obj['chat_desc']
|
127 |
-
instruction = convo_obj['instruction']
|
128 |
-
conversations = convo_obj['conversations']
|
129 |
-
|
130 |
-
# chat_desc
|
131 |
-
example = tokenizer.encode_plus(f"{chat_desc}", None, max_length=None)['input_ids']
|
132 |
-
EOS_TOKEN = example[-1]
|
133 |
-
example = example[:-1] # remove eos
|
134 |
-
# instruction
|
135 |
-
instruction = tokenizer.encode_plus(f"{instruction}", None, max_length=None)['input_ids']
|
136 |
-
instruction = instruction[1:-1] # remove bos & eos
|
137 |
-
example += instruction
|
138 |
-
|
139 |
-
for conversation in conversations:
|
140 |
-
role = conversation['from']
|
141 |
-
content = conversation['value']
|
142 |
-
print(f"role {role}, raw content {content}")
|
143 |
-
content = tokenizer.encode_plus(f"{content}", None, max_length=None)['input_ids']
|
144 |
-
content = content[1:-1] # remove bos & eos
|
145 |
-
print(f"role {role}, content {content}")
|
146 |
-
example += content
|
147 |
-
return example
|
148 |
-
|
149 |
-
print('-'*80)
|
150 |
-
print(f"text is {text}")
|
151 |
-
|
152 |
-
from examples.aquila.cyg_conversation import default_conversation
|
153 |
-
|
154 |
-
conv = default_conversation.copy()
|
155 |
-
conv.append_message(conv.roles[0], text)
|
156 |
-
conv.append_message(conv.roles[1], None)
|
157 |
-
|
158 |
-
tokens = tokenizer.encode_plus(f"{conv.get_prompt()}", None, max_length=None)['input_ids']
|
159 |
-
tokens = tokens[1:-1]
|
160 |
-
|
161 |
-
with torch.no_grad():
|
162 |
-
out = aquila_generate(tokenizer, model, [text], max_gen_len:=200, top_p=0.95, prompts_tokens=[tokens])
|
163 |
-
print(f"pred is {out}")
|
164 |
-
|
165 |
-
|
166 |
-
```
|
167 |
-
|
168 |
-
### 2. 可监督微调/Supervised Fine-tuning(SFT)
|
169 |
-
#### Step 1: 配置模型/ Setup Checkpoints
|
170 |
-
在`./checkpoints_in`里新建`aquila-7b`目录。将微调后的checkpoint,以及原始`aquila-7b`模型里的其余文件,包括`config.json`, `mergex.txt`, `vocab.json`, `special_tokens_map.json`放进去
|
171 |
-
|
172 |
-
Create a new directory named `aquila-7b` inside `./checkpoints_in`. Place the fine-tuned checkpoint and all other files from the original `aquila-7b` model, including `config.json`, `mergex.txt`, `vocab.json`, and `special_tokens_map.json`, into this directory.
|
173 |
-
|
174 |
-
#### Step 2: 修改参数/ Modify Parameters
|
175 |
-
* `cd /examples/aquila`
|
176 |
-
* 配置`hostfile`文件, 参考[这里](../../../doc_zh/TUTORIAL_8_ENVIRONMENT_SETUP.md#a配置hostfilehostfile-中的v100-1-与sshconfig-对应) ; Configure the `hostfile` file, refer to [here](../../../docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
|
177 |
-
* 配置`bmtrain_mgpu.sh`文件, 将`SCRIPT_FILE`改成`aquila_sft.py`; configure the `bmtrain_mgpu.sh` file, change `SCRIPT_FILE` to `aquila_sft.py`
|
178 |
-
* (可选) 在`Aquila-sft.yaml`文件里更改参��� ; (optional) change parameters in `Aquila-sft.yaml`
|
179 |
-
|
180 |
-
| 参数名 Parameter | 类型 Type | 描述 Description |
|
181 |
-
|--------------------------------|------------|-------------------------------------------------------|
|
182 |
-
| batch_size | int | 每次迭代训练时,从数据集中抽取的样本数。一般来说,它越大,处理速度越快,但会占用更多的内存; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memory |
|
183 |
-
| gradient_accumulation_steps | int | 在更新模型权重之前,要对多个小批次进行梯度计算的次数。主要应用于GPU显存较小的情况下,可以使用小的batch_size,通过梯度累积达到与大batch_size相同的效果; The number of samples extracted from the dataset for each iteration during training. Generally, a larger batch size can speed up processing but may also consume more memoryimages |
|
184 |
-
| lr | float | 指控制模型更新参数时的步长或速率。学习率过高可能导致模型不收敛,而学习率过低则可能导致训练时间过长或者陷入局部最优解; The step size or rate at which the model updates its parameters during training. A high learning rate may cause the model not to converge, while a low learning rate may result in long training times or being stuck in a local optimum |
|
185 |
-
| warm_up | float | 初始学习率与原始学习率的比例; The ratio between the initial learning rate and the original learning rate
|
186 |
-
| save_interval | int | 模型保存的间隔,即每训练多少个iteration保存一次模型。当训练时间较长时,保存间隔可以避免因突然中断或出现错误导致训练成果全部丢失; The interval at which the model is saved, i.e., how often the model is saved per epoch during training. When training takes a long time, saving intervals can prevent all training achievements from being lost due to sudden interruptions or errors. |
|
187 |
-
| enable_sft_conversations_dataset_v3 | bool | 数据处理方式; Data preprocessing method |
|
188 |
-
| enable_sft_dataset_dir | str | 可监督微调的数据集目录; Dataset directory of SFT dataset |
|
189 |
-
| enable_sft_dataset_file | str | 可监督微调的数据集文件名; Filename of SFT dataset | |
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
#### Step 3: 启动可监督微调/Start SFT
|
195 |
-
```
|
196 |
-
bash dist_trigger_docker.sh hostfile Aquila-sft.yaml aquilachat-7b [实验名]
|
197 |
-
```
|
198 |
-
接下来会输出下列信息,注意`NODES_NUM`应该与节点数相等,`LOGFILE`是模型运行的日志文件;The following information will be output. Note that `NODES_NUM` should be equal to the number of nodes, and `LOGFILE` is the log file for the model run.
|
199 |
-
|
200 |
-
![Screenshot](./info.jpg)
|
201 |
-
|
202 |
-
成功训练之前能看到如下信息(具体参数可能不同); Before successful training, you may see the following information with parameters that may differ:
|
203 |
-
|
204 |
-
![Screenshot](./info2.jpg)
|
205 |
|
206 |
|
207 |
|
|
|
2 |
license: other
|
3 |
---
|
4 |
|
5 |
+
# 悟道·天鹰(Aquila)
|
6 |
|
7 |
+
悟道·天鹰(Aquila) 语言大模型是首个具备中英双语知识、支持商用许可协议、国内数据合规需求的开源语言大模型。
|
8 |
+
- 🌟 **支持开源商用许可**。Aquila系列模型的源代码基于 [Apache 2.0 协议](https://www.apache.org/licenses/LICENSE-2.0),模型权重基于[《智源Aquila系列模型许可协议》](../../../BAAI_Aquila_Model_License.pdf),使用者在满足许可限制的情况下,可用于商业目的。
|
9 |
+
- ✍️ **具备中英文知识**。Aquila系列模型在中英文高质量语料基础上从 0 开始训练,中文语料约占 40%,保证模型在预训练阶段就开始积累原生的中文世界知识,而非翻译而来的知识。
|
10 |
+
- 👮♀️**符合国内数据合规需求**。Aquila系列模型的中文语料来自智源多年积累的中文数据集,包括来自1万多个站源的中文互联网数据(其中99%以上为国内站源),以及获得国内权威机构支持的高质量中文文献数据、中文书籍数据等。我们仍在持续积累高质量、多样化的数据集,并源源不断加入Aquila基础模型后续训练中。
|
11 |
+
- 🎯**持续迭代,持续开源开放**。我们将不断完善训练数据、优化训练方法、提升模型性能,在更优秀的基础模型基座上,培育枝繁叶茂的“模型树”,持续开源开放更新的版本。
|
12 |
|
|
|
|
|
|
|
13 |
|
|
|
14 |
|
15 |
+
悟道 · 天鹰 Aquila 模型的更多细节将在官方技术报告中呈现。请关注官方渠道更新。包括 [FlagAI GitHub仓库](https://github.com/FlagAI-Open/FlagAI/),[FlagAI 知乎账号](https://www.zhihu.com/people/95-22-20-18)、[FlagAI 官方技术交流群](https://github.com/FlagAI-Open/FlagAI/blob/master/wechat-qrcode.jpg)、智源研究院微信公众号、智源社区微信公众号。
|
16 |
|
|
|
17 |
|
18 |
+
| 模型 | 模型类型 | 简介 | 文件路径 | 单独下载模型权重 | 状态 | 训练所用显卡 |
|
19 |
+
| :---------------- | :------- | :-- |:-- | :-- | :-- | :-- |
|
20 |
+
| Aquila-7B | 基础模型,70亿参数 | **Aquila 基础模型**在技术上继承了 GPT-3、LLaMA 等的架构设计优点,替换了一批更高效的底层算子实现、重新设计实现了中英双语的 tokenizer,升级了 BMTrain 并行训练方法,实现了比 Magtron+DeepSpeed ZeRO-2 将近8倍的训练效率。 | [./examples/Aquila/Aquila-pretrain](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-pretrain) | [下载Aquila-7B](http://model.baai.ac.cn/model-detail/100098) | 已发布 | Nvidia-A100 |
|
21 |
+
| Aquila-33B |基础模型,330亿参数 | 同上 | —— | —— | **敬请期待** | Nvidia-A100 |
|
22 |
+
| AquilaChat-7B |SFT model,基于 Aquila-7B 进行微调和强化学习 | **AquilaChat 对话模型**支持流畅的文本对话及多种语言类生成任务,通过定义可扩展的特殊指令规范,实现 AquilaChat对其它模型和工具的调用,且易于扩展。 <br><br>例如,调用智源开源的 **[AltDiffusion](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltDiffusion-m18) 多语言文图生成模型**,实现了流畅的文图生成能力。配合智源 **InstructFace 多步可控文生图模型**,轻松实现对人脸图像的多步可控编辑。 | [./examples/Aquila/Aquila-chat](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat) | [下载AquilaChat-7B](https://model.baai.ac.cn/model-detail/100101) | 已发布 | Nvidia-A100 |
|
23 |
+
| AquilaChat-33B |SFT model,基于 Aquila-33B 进行微调和强化学习 | 同上 | —— |—— | **敬请期待** | Nvidia-A100 |
|
24 |
+
| AquilaCode-7B-NV | 基础模型,“文本-代码”生成模型,基于 Aquila-7B继续预训练,在英伟达芯片完成训练 | AquilaCode-7B 以小数据集、小参数量,实现高性能,是目前支持中英双语的、性能最好的开源代码模型,经过了高质量过滤、使用有合规开源许可的训练代码数据进行训练。<br><br> AquilaCode-7B 分别在英伟达和国产芯片上完成了代码模型的训练。 | [./examples/Aquila/Aquila-code](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-code) |[下载AquilaCode-7B-NV](https://model.baai.ac.cn/model-detail/100102) | 已发布 | Nvidia-A100 |
|
25 |
+
| AquilaCode-7B-TS |基础模型,“文本-代码”生成模型,基于 Aquila-7B继续预训练,在天数智芯芯片上完成训练 | 同上 | [./examples/Aquila/Aquila-code](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-code) | [下载AquilaCode-7B-TS](https://model.baai.ac.cn/model-detail/100099) | 已发布 | Tianshu-BI-V100 |
|
26 |
|
27 |
+
悟道·天鹰Aquila系列模型将持续开源更优版本,大家可以先删除原来目录下的 `model_pytorch.bin`,再下载新权重,其他使用方式不变。详情见:**[变更日志](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/changelog_zh.md)** 。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
|
30 |
+
## 快速开始使用 AquilaChat-7B 对话模型
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
|
33 |
## 使用方式/How to use
|
34 |
|
35 |
### 1. 推理/Inference
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
|
39 |
|