Update README.md
Browse files
README.md
CHANGED
@@ -9,13 +9,14 @@ Aquila语言大模型在技术上继承了GPT-3、LLaMA等的架构设计优点
|
|
9 |
|
10 |
The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.
|
11 |
|
12 |
-
|
13 |
-
|
|
14 |
-
|
|
15 |
-
|
|
16 |
-
| AquilaCode-7B-NV
|
17 |
-
| AquilaCode-7B-TS
|
18 |
-
|
|
|
|
19 |
|
20 |
|
21 |
我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们升级了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。
|
@@ -33,16 +34,18 @@ We used different tokenizers to extract ten thousand data samples from English,
|
|
33 |
|
34 |
| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
|
35 |
| ----- | ---- | ----- | ---- | ----- | ---- |
|
36 |
-
|
|
37 |
-
|
|
38 |
-
|
|
39 |
|
40 |
|
41 |
|
42 |
## 训练数据集/Training data
|
43 |
-
Aquila预训练使用了Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), 悟道中文数据集、电子书、专利、百科、论坛, github
|
|
|
|
|
|
|
44 |
|
45 |
-
The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), Wudao Corpus、e-book、Patent, encyclopedia, forum, github etc.
|
46 |
## 使用方式/How to use
|
47 |
|
48 |
### 1. 预训练/Pre-training
|
|
|
9 |
|
10 |
The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.
|
11 |
|
12 |
+
| 模型/Model | 状态/State | 能否商用/Commercial use? | 所用显卡/GPU |
|
13 |
+
| :---------------- | :------- | :-- |:-- |
|
14 |
+
| Aquila-7B | 已发布 | ✅ | Nvidia-A100 |
|
15 |
+
| AquilaChat-7B |已发布 | ✅ | Nvidia-A100 |
|
16 |
+
| AquilaCode-7B-NV |已发布 | ✅ | Nvidia-A100 |
|
17 |
+
| AquilaCode-7B-TS |已发布 | ✅ | Tianshu-BI-V100 |
|
18 |
+
| Aquila-33B | **敬请期待** | ✅ | Nvidia-A100 |
|
19 |
+
| AquilaChat-33B |**敬请期待** | ✅ | Nvidia-A100 |
|
20 |
|
21 |
|
22 |
我们使用了一系列更高效的底层算子来辅助模型训练,其中包括参考[flash-attention](https://github.com/HazyResearch/flash-attention)的方法并替换了一些中间计算,同时还使用了RMSNorm。在此基础上,我们升级了[BMtrain](https://github.com/OpenBMB/BMTrain)技术进行轻量化的并行训练,该技术采用了数据并行、ZeRO(零冗余优化器)、优化器卸载、检查点和操作融合、通信-计算重叠等方法来优化模型训练过程。
|
|
|
34 |
|
35 |
| 模型/Model | 词表大小/Vocab size | 说明/Note |英文平均tokens量/Avg tokens(English)| 中文平均tokens量/Avg tokens(Chinesse)|代码平均tokens量/Avg tokens(code) |
|
36 |
| ----- | ---- | ----- | ---- | ----- | ---- |
|
37 |
+
| GPT2 | 50527 | bpe|1717 | 1764|2323 |
|
38 |
+
| LLaMA | 32000 | sp(bpe)|1805| 1257|1970 |
|
39 |
+
| Aquila | 100000 | bpe|1575 | 477|1679 |
|
40 |
|
41 |
|
42 |
|
43 |
## 训练数据集/Training data
|
44 |
+
Aquila预训练使用了Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), 悟道中文数据集、电子书、专利、百科、论坛, github数据等, 详情可见下图。
|
45 |
+
|
46 |
+
The Aquila-7B model was pretrained on Pile,[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), [Wikipedia](https://huggingface.co/datasets/wikipedia), [C4](https://huggingface.co/datasets/c4), Wudao Corpus、e-book、Patent, encyclopedia, forum, github etc. Details are given in the figure below.
|
47 |
+
![Screenshot](./img/data_dist.png)
|
48 |
|
|
|
49 |
## 使用方式/How to use
|
50 |
|
51 |
### 1. 预训练/Pre-training
|