--- license: apache-2.0 language: - en - zh library_name: transformers pipeline_tag: text-generation --- # Model Card for MindLLM ## Model Details ### Model Description MindLLM 1.3B is a Transformer model with 1.3 billion parameters by *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology*. It was trained using the bilingual data sources including Pile, Wudao, CBook and other self-collected data source that consists of filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, MindLLM showcased a great performance and even surpass models with less than 13 billion parameters. Our model has been fine-tuned with instruction dataset in chat format but hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges and adopt to domain-specific application. - **Developed by:** *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology* - **Model type:** Pretrained Causal Language Model - **Language(s) (NLP):** Chinese & English - **License:** apache-2.0 - **Train from Scratch** ### Model Sources - **Paper:** https://arxiv.org/abs/2310.15777 To cite this model, please use ```bib @article{mindllm, title={MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications}, author={Yang, Yizhe and Sun, Huashan and Li, Jiawei and Liu, Runheng and Li, Yinghao and Liu, Yuhang and Huang, Heyan and Gao, Yang}, journal={arXiv preprint arXiv:2310.15777}, year={2023} } ``` ## Uses ### Direct Use As the model has been supervised trained on instruction data in a special chat format. You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run: ```python from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline tokenizer = AutoTokenizer.from_pretrained('mindllm_path') tokenizer.max_length = 1024 model = AutoModelForCausalLM.from_pretrained('mindllm_path').to(device) generator = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device) context = ": 你知道电动车相对传统汽油车有哪些优点吗?" outputs = generator(context, max_new_tokens=1024, do_sample=True, num_beams=4, repetition_penalty=0.5, no_repeat_ngram_size=5, return_full_text=False) [{'generated_text': '电动车相对传统汽油车的优点包括:\n1. 更低的排放和更高的能源效率 - 电动车所产生的有害排放物质远少于汽油车,并且它们的能源利用效率更高。\n2. 更低的维护成本 - 电动车需要更少的保养和通常拥有较少的运动部件,从而降低了总体维护成本。\n3. 更低的燃料成本 - 电动车需要比汽油车少得多的燃料,因此随着时间的推移,可以节省成本。\n4. 更长的续航里程 - 电动车单次充电可以行驶比汽油车更远的距离,非常适合长途通勤。\n5. 更为安静的运行 - 电动车比汽油车要安静得多,使驾驶更加愉悦。'}] ``` ### Other Examples
Self Awareness user:你好,请问你叫什么名字? assistant:嗨,我是一个AI助手。 user:请问你如何看待AI助手? assistant:我认为AI助手是一个非常有前途的技术,它可以在多个领域帮助人们解决问题和完成任务。但是同时,我也认为在使用AI助手时要注意保护用户隐私和安全。 user:你能介绍一下AI助手优点和缺点吗? assistant:当然。AI助手的优点包括:能够快速学习,处理大量数据;能够根据用户的偏好进行个性化推荐;能够进行自动化任务等等。缺点包括:需要大量的数据来训练模型,且有时候数据质量可能不如人类等等。 user:你觉得AI助手对人类存在的威胁或者问题是什么? assistant:我认为AI助手在未来可能会对人类存在的威胁是安全和隐私问题。因为人工智能技术的发展速度非常快,可能会带来一些安全和隐私方面的风险。这也需要人类更加关注和管理这些风险。 user:你可以具体举例说明一下吗? assistant:比如说,一个在线购物平台的AI助手可以通过收集用户购买记录,来预测用户的需求,从而定制最合适的推荐。这样不仅可以减少欺诈问题,还可以提高用户的满意度。但对于一些需要实时查询销售信息的行业和场景,如何保证用户的隐私信息不被泄露,这是一个需要考虑的问题。 user:你可以具体说一说,人类应该如何关注和管理AI助手带来的安全和隐私问题吗? assistant:当然,这可以通过加强用户教育,让用户了解AI助手的工作原理和安全措施,提供安全的使用环境等等来实现。同时,也可以对AI助手的用户数据进行保护和隐私评估,并制定相关的法律等等。我们也可以通过技术手段,如大数据分析和人工智能加密等技术,来帮助人们识别和关注AI助手带来的安全和隐私问题。 user:好的,谢谢你! assistant:不用谢,我随时为您效劳!
details>
Small Talk user:你好 assistant:您好,有什么需要我帮忙的吗? user:请问有哪些方法可以缓解焦虑情绪? assistant:有的话,可以通过运动、音乐、冥想等方式来放松身心,还有一些药物也可以缓解焦虑。您可以按照我提供的建议来选择适合自己的方式,缓解焦虑情绪。 user:你可以详细下吗? assistant:当我们感到焦虑时,大脑会产生一种内分泌系统的反应,这种反应会导致我们感到紧张、不安和无法集中注意力。这个时候可以尝试一些让自己放松的方法,比如深呼吸、瑜伽、冥想等。这些方法可以帮助我们缓解焦虑情绪,让我们感觉更好。 user:请详细介绍下冥想的方法。 assistant:冥想是一种可以帮助我们集中注意力,让心情平静、内心充实的活动。在冥想中,我们可以选择一个特定的地点,舒展身体、调整气息,让自己放松、专注和冥思。在冥想中,我们通常会采用舒缓的音乐,让自己的思维和情绪自由流动,缓解焦虑情绪。 user:有什么音乐可以推荐给我吗?缓解我的焦虑情绪。 assistant:常听舒缓的音乐,比如钢琴曲、古典音乐、流行音乐等,可以缓解焦虑情绪,促进身心放松。比如,您可以试试听听轻柔的钢琴曲,或者是轻声的流行歌曲,这些都可以缓解焦虑情绪。此外,您也可以通过听冥想音乐,它可以帮助您放松身心、减轻压力,带来更好的效果。 user:谢谢。 assistant:不客气,有需要再找我哦。
Coding user:找到下列数组的中位数[3.1,6.2,1.3,8.4,10.5,11.6,2.1],请用python代码完成以上功能 assistant:对数组进行排序,然后取中间的数作为中位数,最后把这个数输出即可完成所需的操作。 代码实现: ```python def median(numbers): sorted_numbers = sorted(numbers) length = len(numbers) mid_index = length // 2 if length % 2 == 0: return (sorted_numbers[mid_index - 1] + sorted_numbers[mid_index]) / 2 else: return sorted_numbers[mid_index] ``` 代码解释: 首先,定义了一个名为 median 的函数,它接受一个数字列表作为参数。函数使用内置函数 sorted,对数字列表进行排序,然后使用 len 函数计算所有数字的长度,并判断长度是奇数还是偶数,如果是,则返回相应的中位数。如果长度为奇数,则返回中位数。否则,将数组分成两半,找到中间的数,然后返回它。
## Training Details ### Training Data Our training corpus is a diverse blend of both English and Chinese language data sources. The English component originates from the Pile dataset, and the Chinese component comprises data from Wudao, CBooks, and data meticulously gathered through web crawling. To ensure data quality, we execute a thorough preprocessing pipeline, which involves purging special tags via rigorous data cleaning, data deduplication using Locality-Sensitive Hashing (LSH), and comprehensive filtering to eliminate low-quality content predominantly from advertisements or inappropriate material. We also examine the relationship between data volume and model capacity, assess the impact of different data types on model fitting effectiveness, and evaluate model training stability when handling mixed data sources. This analysis offers valuable insights into the vital role of pre-training data and the complexities of processing it. We also apply some mixture craftsmanship to construct training data based on data engineering and experience. ### Training Procedure This version of model was trained on about 241 billion English tokens and 82 billion Chinese tokens with a two-stage training strategy. It was trained as a autoregressive language model, using cross-entropy loss. This version of model was also fine-tuned on 4 million Chinese instruction samples which are collected from open source instruction tuning datasets. The instruction tuning stage make the model can answer questions and perform multi-turns conversation **in Chinese**. **For more detailed information, please refer to the paper.**