dahara1 commited on
Commit
a21c7f8
1 Parent(s): f6cb5e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -2
README.md CHANGED
@@ -2,7 +2,7 @@
2
  tags:
3
  - npu
4
  - amd
5
- - llama3
6
  - Ryzen AI
7
  ---
8
 
@@ -14,4 +14,122 @@ The following sample assumes that the setup on the above page has been completed
14
 
15
  This model has only been tested on RyzenAI for Windows 11. It does not work in Linux environments such as WSL.
16
 
17
- Sample script will be uploaded tomorrow.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - npu
4
  - amd
5
+ - llama3.1
6
  - Ryzen AI
7
  ---
8
 
 
14
 
15
  This model has only been tested on RyzenAI for Windows 11. It does not work in Linux environments such as WSL.
16
 
17
+
18
+
19
+ ### setup
20
+ In cmd windows.
21
+ ```
22
+ conda activate ryzenai-transformers
23
+ <your_install_path>\RyzenAI-SW\example\transformers\setup.bat
24
+
25
+ pip install transformers==4.43.3
26
+ # Updating the Transformers library will cause the LLama 2 sample to stop working.
27
+ # If you want to run LLama 2, revert to pip install transformers==4.34.0.
28
+ pip install tokenizers==0.19.1
29
+
30
+ git lfs install
31
+ git clone https://huggingface.co/dahara1/llama3.1-8b-Instruct-amd-npu
32
+ cd llama3.1-8b-Instruct-amd-npu
33
+ git lfs pull
34
+ cd ..
35
+ copy <your_install_path>\RyzenAI-SW\example\transformers\models\llama2\modeling_llama_amd.py .
36
+
37
+ # set up Runtime. see https://ryzenai.docs.amd.com/en/latest/runtime_setup.html
38
+ set XLNX_VART_FIRMWARE=<your_install_path>\voe-4.0-win_amd64\1x4.xclbin
39
+ set NUM_OF_DPU_RUNNERS=1
40
+
41
+ # save below sample script as utf8 and llama-3.1-test.py
42
+ python llama3.1-test.py
43
+ ```
44
+
45
+ ### Sample Script
46
+
47
+ ```
48
+ import torch
49
+ import psutil
50
+ import transformers
51
+ from transformers import AutoTokenizer, set_seed
52
+ import qlinear
53
+ import logging
54
+
55
+ set_seed(123)
56
+ transformers.logging.set_verbosity_error()
57
+ logging.disable(logging.CRITICAL)
58
+
59
+ messages = [
60
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
61
+ ]
62
+
63
+ message_list = [
64
+ "Who are you? ",
65
+ # Japanese
66
+ "あなたの乗っている船の名前は何ですか?英語ではなく全て日本語だけを使って返事をしてください",
67
+ # Chainese
68
+ "你经历过的最危险的冒险是什么?请用中文回答所有问题,不要用英文。",
69
+ # French
70
+ "À quelle vitesse va votre bateau ? Veuillez répondre uniquement en français et non en anglais.",
71
+ # Korean
72
+ "당신은 그 배의 어디를 좋아합니까? 영어를 사용하지 않고 모두 한국어로 대답하십시오.",
73
+ # German
74
+ "Wie würde Ihr Schiffsname auf Deutsch lauten? Bitte antwortet alle auf Deutsch statt auf Englisch.",
75
+ # Taiwanese
76
+ "您發現過的最令人驚奇的寶藏是什麼?請僅使用台語和繁體中文回答,不要使用英文。",
77
+ ]
78
+
79
+
80
+ if __name__ == "__main__":
81
+ p = psutil.Process()
82
+ p.cpu_affinity([0, 1, 2, 3])
83
+ torch.set_num_threads(4)
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained("llama3.1-8b-Instruct-amd-npu")
86
+ ckpt = "llama3.1-8b-Instruct-amd-npu/llama3.1_8b_w_bit_4_awq_amd.pt"
87
+ terminators = [
88
+ tokenizer.eos_token_id,
89
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
90
+ ]
91
+ model = torch.load(ckpt)
92
+ model.eval()
93
+ model = model.to(torch.bfloat16)
94
+
95
+ for n, m in model.named_modules():
96
+ if isinstance(m, qlinear.QLinearPerGrp):
97
+ print(f"Preparing weights of layer : {n}")
98
+ m.device = "aie"
99
+ m.quantize_weights()
100
+
101
+ print("system: " + messages[0]['content'])
102
+
103
+ for i in range(len(message_list)):
104
+ messages.append({"role": "user", "content": message_list[i]})
105
+ print("user: " + message_list[i])
106
+
107
+ input = tokenizer.apply_chat_template(
108
+ messages,
109
+ add_generation_prompt=True,
110
+ return_tensors="pt",
111
+ return_dict=True
112
+ )
113
+
114
+ outputs = model.generate(input['input_ids'],
115
+ max_new_tokens=600,
116
+ eos_token_id=terminators,
117
+ attention_mask=input['attention_mask'],
118
+ do_sample=True,
119
+ temperature=0.6,
120
+ top_p=0.9)
121
+
122
+ response = outputs[0][input['input_ids'].shape[-1]:]
123
+ response_message = tokenizer.decode(response, skip_special_tokens=True)
124
+ print("assistant: " + response_message)
125
+ messages.append({"role": "system", "content": response_message})
126
+
127
+ ```
128
+
129
+ ## Acknowledgements
130
+ - [amd/RyzenAI-SW](https://github.com/amd/RyzenAI-SW)
131
+ Sample Code and Drivers.
132
+ - [mit-han-lab/llm-awq](https://github.com/mit-han-lab/llm-awq)
133
+ Thanks for AWQ quantization Method.
134
+ - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
135
+ [Built with Meta Llama 3](https://llama.meta.com/llama3/license/)