fix(readme): Adds information about placeholder tokens.
Browse files- README.md +4 -0
- added_tokens.json +1 -1
README.md
CHANGED
@@ -56,6 +56,10 @@ The current `transformers` version can be verified with: `pip list | grep transf
|
|
56 |
|
57 |
Phi-3 Mini-4K-Instruct is also available in [HuggingChat](https://aka.ms/try-phi3-hf-chat).
|
58 |
|
|
|
|
|
|
|
|
|
59 |
### Chat Format
|
60 |
|
61 |
Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
|
|
|
56 |
|
57 |
Phi-3 Mini-4K-Instruct is also available in [HuggingChat](https://aka.ms/try-phi3-hf-chat).
|
58 |
|
59 |
+
### Tokenizer
|
60 |
+
|
61 |
+
Phi-3 Mini-4K-Instruct supports a vocabulary size of up to `32064` tokens. The [tokenizer files](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/added_tokens.json) already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
|
62 |
+
|
63 |
### Chat Format
|
64 |
|
65 |
Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
|
added_tokens.json
CHANGED
@@ -9,5 +9,5 @@
|
|
9 |
"<|end|>": 32007,
|
10 |
"<|placeholder5|>": 32008,
|
11 |
"<|placeholder6|>": 32009,
|
12 |
-
"<|user|>":
|
13 |
}
|
|
|
9 |
"<|end|>": 32007,
|
10 |
"<|placeholder5|>": 32008,
|
11 |
"<|placeholder6|>": 32009,
|
12 |
+
"<|user|>": 32010
|
13 |
}
|