File size: 12,913 Bytes
2c09987
 
c96ddb5
 
 
 
f532750
2c09987
c96ddb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae39e5d
c96ddb5
 
 
 
ae39e5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3f0318
 
 
a6c30a8
df10792
c3f0318
04963ff
c3f0318
 
04963ff
c3f0318
 
04963ff
 
c3f0318
 
04963ff
c3f0318
 
04963ff
c3f0318
 
04963ff
c3f0318
 
04963ff
df10792
a6c30a8
 
c3f0318
 
a6c30a8
df10792
c3f0318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df10792
c3f0318
a6c30a8
c3f0318
 
a6c30a8
faea48b
c3f0318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faea48b
a6c30a8
c3f0318
 
cf9339c
 
 
2d87a15
cf9339c
 
 
 
 
 
2d87a15
cf9339c
 
c96ddb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3f0318
c96ddb5
 
 
f532750
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
license: apache-2.0
language:
- en
- zh
library_name: transformers
pipeline_tag: conversational
---
# Model Card for MindLLM

<!-- Provide a quick summary of what the model is/does. -->

## Model Details

### Model Description

MindLLM 1.3B is a Transformer model with 1.3 billion parameters by *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology*.

It was trained using the bilingual data sources including Pile, Wudao, CBook and other self-collected data source that consists of filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, MindLLM showcased a great performance and even surpass models with less than 13 billion parameters.

Our model has been fine-tuned with instruction dataset in chat format but hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges and adopt to domain-specific application.

- **Developed by:** *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology*
- **Model type:** Pretrained Causal Language Model
- **Language(s) (NLP):** Chinese & English
- **License:** apache-2.0
- **Train from Scratch**

### Model Sources

- **Paper:** https://arxiv.org/abs/2310.15777

To cite this model, please use
```bib
@article{mindllm,
  title={MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications},
  author={Yang, Yizhe and Sun, Huashan and Li, Jiawei and Liu, Runheng and Li, Yinghao and Liu, Yuhang and Huang, Heyan and Gao, Yang},
  journal={arXiv preprint arXiv:2310.15777},
  year={2023}
}
```

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

As the model has been supervised trained on instruction data in a special chat format. You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline
tokenizer = AutoTokenizer.from_pretrained('mindllm_path')
tokenizer.max_length = 1024
model = AutoModelForCausalLM.from_pretrained('mindllm_path').to(device)
generator = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
context = "<user>\n你知道电动车相对传统汽油车有哪些优点吗?\n<assistant>\n"
outputs = generator(context, max_new_tokens=1024, do_sample=True, num_beams=4, repetition_penalty=0.5, no_repeat_ngram_size=5, return_full_text=False)
[{'generated_text': '电动车相对传统汽油车的优点包括:\n1. 更低的排放和更高的能源效率 - 电动车所产生的有害排放物质远少于汽油车,并且它们的能源利用效率更高。\n2. 更低的维护成本 - 电动车需要更少的保养和通常拥有较少的运动部件,从而降低了总体维护成本。\n3. 更低的燃料成本 - 电动车需要比汽油车少得多的燃料,因此随着时间的推移,可以节省成本。\n4. 更长的续航里程 - 电动车单次充电可以行驶比汽油车更远的距离,非常适合长途通勤。\n5. 更为安静的运行 - 电动车比汽油车要安静得多,使驾驶更加愉悦。'}]
```

### Chat Template
To get the expected features and performance for the chat versions, specific formatting needs to be followed, including <user> and <assistant> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).
Here are some examples:
1. single-turn
```
# prompt
<|endoftext|><user>\n你知道电动车相对传统汽油车有哪些优点吗?\n<assistant>\n
# return 
电动车相对传统汽油车的优点包括:\n1. 更低的排放和更高的能源效率 - 电动车所产生的有害排放物质远少于汽油车,并且它们的能源利用效率更高。\n2. 更低的维护成本 - 电动车需要更少的保养和通常拥有较少的运动部件,从而降低了总体维护成本。\n3. 更低的燃料成本 - 电动车需要比汽油车少得多的燃料,因此随着时间的推移,可以节省成本。\n4. 更长的续航里程 - 电动车单次充电可以行驶比汽油车更远的距离,非常适合长途通勤。\n5. 更为安静的运行 - 电动车比汽油车要安静得多,使驾驶更加愉悦。
```

2. multi-turn
```
# prompt
<|endoftext|><user>\n你好,请问你叫什么名字?\n<assistant>\n嗨,我是一个AI助手。<|endoftext|>\n<|endoftext|><user>\n请问你如何看待AI助手?\n<assistant>\n
# return 
我认为AI助手是一个非常有前途的技术,它可以在多个领域帮助人们解决问题和完成任务。但是同时,我也认为在使用AI助手时要注意保护用户隐私和安全。
```
### Other Examples
<details>
<summary>Self Awareness</summary>

```
user:你好,请问你叫什么名字?
assistant:你好,我是 MindLLM,一个人工智能助手。

user:请问你如何看待AI助手?
assistant:我认为,AI助手是一种可以帮助人类完成各种任务的工具。它可以处理大量数据,并提供有用的信息给用户。然而,我们也需要保持警惕,确保其不会被滥用或误用。

user:你能介绍一下AI助手优点和缺点吗?
assistant:当然可以。AI助手的优点之一是提高了日常生活中的便利性。例如,我们可以使用语音助手来控制家庭设备、查询天气、安排日程等。此外,AI助手还可以自动化流程,减少人力和时间成本。
缺点也是人们需要思考的问题。一方面,AI助手使用算法和数据分析来提供更准确的结果;另一方面,有些人担心AI助手会取代人类工作,甚至可能取代人类职业。此外,AI助手也可能存在信息泄露和滥用的风险。

user:你觉得AI助手对人类存在的威胁或者问题是什么?
assistant:我认为,AI助手在某些场景下可能会导致一些问题,比如个人隐私泄露、工作岗位的流失、机器人成为主宰人类的物种,这些都需要我们认真思考和妥善解决。我们需要确保AI助手是安全可靠的,避免受到滥用和误用。

user:你可以具体举例说明一下吗?
assistant:比如说,一个公司可能需要大量数据来训练AI助手,但由于数据隐私泄露的问题,这些数据可能被滥用或泄露,导致用户数据被不当使用或泄露。另一个例子是,AI助手可能被用于武器开发,但由于缺乏监管,使用引起的伦理问题也需要关注。

user:你可以具体说一说,人类应该如何关注和管理AI助手带来的安全和隐私问题吗?
assistant:在关注安全和隐私问题上,我们需要采取一些措施。首先,我们需要强化AI助手的开发和使用,确保它们只被授权的开发者开发和使用。其次,我们需要建立更好的隐私保护机制,包括使用数据加密、授权访问和差分隐私等技术手段。最后,我们需要在AI系统中保护用户的隐私,确保他们的个人信息不被滥用或泄露。

user:好的,谢谢你!
assistant:不客气,我随时将继续为您提供帮助和支持。
```
</details>

<details>
<summary>Small Talk</summary>

```
user:你好
assistant:您好,有什么需要我帮忙的吗?

user:请问有哪些方法可以缓解焦虑情绪?
assistant:有的话,可以通过运动、音乐、冥想等方式来放松身心,还有一些药物也可以缓解焦虑。您可以按照我提供的建议来选择适合自己的方式,缓解焦虑情绪。

user:你可以详细下吗?
assistant:当我们感到焦虑时,大脑会产生一种内分泌系统的反应,这种反应会导致我们感到紧张、不安和无法集中注意力。这个时候可以尝试一些让自己放松的方法,比如深呼吸、瑜伽、冥想等。这些方法可以帮助我们缓解焦虑情绪,让我们感觉更好。

user:请详细介绍下冥想的方法。
assistant:冥想是一种可以帮助我们集中注意力,让心情平静、内心充实的活动。在冥想中,我们可以选择一个特定的地点,舒展身体、调整气息,让自己放松、专注和冥思。在冥想中,我们通常会采用舒缓的音乐,让自己的思维和情绪自由流动,缓解焦虑情绪。

user:有什么音乐可以推荐给我吗?缓解我的焦虑情绪。
assistant:常听舒缓的音乐,比如钢琴曲、古典音乐、流行音乐等,可以缓解焦虑情绪,促进身心放松。比如,您可以试试听听轻柔的钢琴曲,或者是轻声的流行歌曲,这些都可以缓解焦虑情绪。此外,您也可以通过听冥想音乐,它可以帮助您放松身心、减轻压力,带来更好的效果。

user:谢谢。
assistant:不客气,有需要再找我哦。
```
</details>

<details>
<summary>Coding</summary>

``````
user:找到下列数组的中位数[3.1,6.2,1.3,8.4,10.5,11.6,2.1],请用python代码完成以上功能
assistant:对数组进行排序,然后取中间的数作为中位数,最后把这个数输出即可完成所需的操作。
代码实现:
```python
def median(numbers):
    sorted_numbers = sorted(numbers)
    length = len(numbers)
    mid_index = length // 2
    if length % 2 == 0:
        return (sorted_numbers[mid_index - 1] + sorted_numbers[mid_index]) / 2
    else:
        return sorted_numbers[mid_index]
```
代码解释:
首先,定义了一个名为 median 的函数,它接受一个数字列表作为参数。函数使用内置函数 sorted,对数字列表进行排序,然后使用 len 函数计算所有数字的长度,并判断长度是奇数还是偶数,如果是,则返回相应的中位数。如果长度为奇数,则返回中位数。否则,将数组分成两半,找到中间的数,然后返回它。
``````

</details>

### Deployment resource consumption
<details>
<summary>Clik to expand</summary>
  
| Precision  | Minimum GPU memory (Inference)   | Minimum GPU memory (Full Parameter Fine-tuning)    |
|-------|-------|-------|
| float32    | 6.08G    | 32.65G    |
| float16(unquantized)    | 3.45G    | -(36.94G*)    |
| bfloat16(unquantized)    | 3.45G    | 20.47G(33.93G*)    |
* \* Indicates use of mixed precision

</details>
  
## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Our training corpus is a diverse blend of both English and Chinese language data sources. The English component originates from the Pile dataset, and the Chinese component comprises data from Wudao, CBooks, and data meticulously gathered through web crawling.

To ensure data quality, we execute a thorough preprocessing pipeline, which involves purging special tags via rigorous data cleaning, data deduplication using Locality-Sensitive Hashing (LSH), and comprehensive filtering to eliminate low-quality content predominantly from advertisements or inappropriate material. We also examine the relationship between data volume and model capacity, assess the impact of different data types on model fitting effectiveness, and evaluate model training stability when handling mixed data sources. This analysis offers valuable insights into the vital role of pre-training data and the complexities of processing it. We also apply some mixture craftsmanship to construct training data based on data engineering and experience.


### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

This version of model was trained on about 241 billion English tokens and 82 billion Chinese tokens with a two-stage training strategy. It was trained as a autoregressive language model, using cross-entropy loss.

This version of model was also fine-tuned on 4 million Chinese instruction samples which are collected from open source instruction tuning datasets. The instruction tuning stage make the model can answer questions and perform multi-turns conversation **in Chinese**.

**For more detailed information, please refer to the paper.**