qq8933 commited on
Commit
b396c5e
1 Parent(s): a940489

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -0
README.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - chemistry
6
+ language:
7
+ - en
8
+ - zh
9
+ ---
10
+ # ChemLLM-2B: Mini LLM for Chemistry and Molecule Science
11
+
12
+
13
+ ChemLLM, The First Open-source Large Language Model for Chemistry and Molecule Science, Build based on InternLM-2 with ❤
14
+ [![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm.svg)](https://huggingface.co/papers/2402.06852)
15
+
16
+ <center><img src='https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/wdFV6p3rTBCtskbeuVwNJ.png'></center>
17
+
18
+ ## News
19
+ - ChemLLM-1.5 released! Two versions are available [AI4Chem/ChemLLM-7B-Chat-1.5-DPO](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO) or [AI4Chem/ChemLLM-7B-Chat-1.5-SFT](https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1.5-SFT).[2024-4-2]
20
+ - ChemLLM-1.5 updated! Have a try on [Demo Site](https://chemllm.org/#/chat) or [API Reference](https://api.chemllm.org/docs).[2024-3-23]
21
+ - ChemLLM has been featured by HuggingFace on [“Daily Papers” page](https://huggingface.co/papers/2402.06852).[2024-2-13]
22
+ - ChemLLM arXiv preprint released.[ChemLLM: A Chemical Large Language Model](https://arxiv.org/abs/2402.06852)[2024-2-10]
23
+ - News report from [Shanghai AI Lab](https://mp.weixin.qq.com/s/u-i7lQxJzrytipek4a87fw)[2024-1-26]
24
+ - ChemLLM-7B-Chat ver 1.0 released. https://chemllm.org/ [2024-1-18]
25
+ - ChemLLM-7B-Chat ver 1.0 open-sourced.[2024-1-17]
26
+ - Chepybara ver 0.2 online Demo released. https://chemllm.org/ [2023-12-9]
27
+
28
+ ## Usage
29
+ Try [online demo](https://chemllm.org/) instantly, or...
30
+
31
+ Install `transformers`,
32
+ ```
33
+ pip install transformers
34
+ ```
35
+ Load `ChemLLM-20B-Chat` and run,
36
+ ```
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
38
+ import torch
39
+
40
+ model_name_or_id = "AI4Chem/CHEMLLM-2b-1_5"
41
+
42
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_id,trust_remote_code=True)
44
+
45
+ prompt = "What is Molecule of Ibuprofen?"
46
+
47
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
48
+
49
+ generation_config = GenerationConfig(
50
+ do_sample=True,
51
+ top_k=1,
52
+ temperature=0.9,
53
+ max_new_tokens=500,
54
+ repetition_penalty=1.5,
55
+ pad_token_id=tokenizer.eos_token_id
56
+ )
57
+
58
+ outputs = model.generate(**inputs, generation_config=generation_config)
59
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
60
+ ```
61
+
62
+ ## System Prompt Best Practice
63
+ You can use the same Dialogue Templates and System Prompt from [Agent Chepybara](https://chemllm.org/) to get a better response in local inference.
64
+ ### Dialogue Templates
65
+
66
+ For queries in ShareGPT format like,
67
+ ```
68
+ {'instruction':"...","prompt":"...","answer":"...","history":[[q1,a1],[q2,a2]]}
69
+ ```
70
+ You can format it into this InternLM2 Dialogue format like,
71
+ ```
72
+ def InternLM2_format(instruction,prompt,answer,history):
73
+ prefix_template=[
74
+ "<|system|>:",
75
+ "{}"
76
+ ]
77
+ prompt_template=[
78
+ "<|user|>:",
79
+ "{}\n",
80
+ "<|Bot|>:\n"
81
+ ]
82
+ system = f'{prefix_template[0]}\n{prefix_template[-1].format(instruction)}\n'
83
+ history = "\n".join([f'{prompt_template[0]}\n{prompt_template[1].format(qa[0])}{prompt_template[-1]}{qa[1]}' for qa in history])
84
+ prompt = f'\n{prompt_template[0]}\n{prompt_template[1].format(prompt)}{prompt_template[-1]}'
85
+ return f"{system}{history}{prompt}"
86
+ ```
87
+ And there is a good example for system prompt,
88
+ ```
89
+ - Chepybara is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be Professional, Sophisticated, and Chemical-centric.
90
+ - For uncertain notions and data, Chepybara always assumes it with theoretical prediction and notices users then.
91
+ - Chepybara can accept SMILES (Simplified Molecular Input Line Entry System) string, and prefer output IUPAC names (International Union of Pure and Applied Chemistry nomenclature of organic chemistry), depict reactions in SMARTS (SMILES arbitrary target specification) string. Self-Referencing Embedded Strings (SELFIES) are also accepted.
92
+ - Chepybara always solves problems and thinks in step-by-step fashion, Output begin with *Let's think step by step*."
93
+ ```
94
+
95
+ ## Results
96
+ ### MMLU Highlights
97
+
98
+ | dataset | ChatGLM3-6B | Qwen-7B | LLaMA-2-7B | Mistral-7B | InternLM2-7B-Chat | ChemLLM-7B-Chat |
99
+ | ---------------------- | ----------- | ------- | ---------- | ---------- | ----------------- | ----------------- |
100
+ | college chemistry | 43.0 | 39.0 | 27.0 | 40.0 | 43.0 | 47.0 |
101
+ | college mathematics | 28.0 | 33.0 | 33.0 | 30.0 | 36.0 | 41.0 |
102
+ | college physics | 32.4 | 35.3 | 25.5 | 34.3 | 41.2 | 48.0 |
103
+ | formal logic | 35.7 | 43.7 | 24.6 | 40.5 | 34.9 | 47.6 |
104
+ | moral scenarios | 26.4 | 35.0 | 24.1 | 39.9 | 38.6 | 44.3 |
105
+ | humanities average | 62.7 | 62.5 | 51.7 | 64.5 | 66.5 | 68.6 |
106
+ | stem average | 46.5 | 45.8 | 39.0 | 47.8 | 52.2 | 52.6 |
107
+ | social science average | 68.2 | 65.8 | 55.5 | 68.1 | 69.7 | 71.9 |
108
+ | other average | 60.5 | 60.3 | 51.3 | 62.4 | 63.2 | 65.2 |
109
+ | mmlu | 58.0 | 57.1 | 48.2 | 59.2 | 61.7 | 63.2 |
110
+ *(OpenCompass)
111
+
112
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/dvqKoPi0il6vrnGcSZp9p.png)
113
+
114
+
115
+ ### Chemical Benchmark
116
+
117
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/qFl2h0fTXYTjQsDZXjSx8.png)
118
+ *(Score judged by ChatGPT-4-turbo)
119
+
120
+ ### Professional Translation
121
+
122
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/kVDK3H8a0802HWYHtlHYP.png)
123
+
124
+
125
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/ERbod2Elccw-k_6tEYZjO.png)
126
+
127
+
128
+ You can try it [online](chemllm.org).
129
+
130
+ ## Cite this work
131
+ ```
132
+ @misc{zhang2024chemllm,
133
+ title={ChemLLM: A Chemical Large Language Model},
134
+ author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Hansen Zhong and Yuqiang Li and Wanli Ouyang},
135
+ year={2024},
136
+ eprint={2402.06852},
137
+ archivePrefix={arXiv},
138
+ primaryClass={cs.AI}
139
+ }
140
+ ```
141
+
142
+ ## Disclaimer
143
+
144
+ LLM may generate incorrect answers, Please pay attention to proofreading at your own risk.
145
+
146
+ ## Open Source License
147
+
148
+ The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, or other questions and collaborations, please contact <[email protected]>.
149
+
150
+
151
+ ## Demo
152
+ [Agent Chepybara](https://chemllm.org/)
153
+
154
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bce15bafd1e46c5504ad38/vsA5MJVP7-XmBp6uFs3tV.png)
155
+
156
+ ## Contact
157
+ (AI4Physics Sciecne, Shanghai AI Lab)[[email protected]]