gugarosa commited on
Commit
3c0c9df
1 Parent(s): 65be4e0

fix(readme): Adds information about placeholder tokens.

Browse files
Files changed (2) hide show
  1. README.md +4 -0
  2. added_tokens.json +1 -1
README.md CHANGED
@@ -56,6 +56,10 @@ The current `transformers` version can be verified with: `pip list | grep transf
56
 
57
  Phi-3 Mini-4K-Instruct is also available in [HuggingChat](https://aka.ms/try-phi3-hf-chat).
58
 
 
 
 
 
59
  ### Chat Format
60
 
61
  Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
 
56
 
57
  Phi-3 Mini-4K-Instruct is also available in [HuggingChat](https://aka.ms/try-phi3-hf-chat).
58
 
59
+ ### Tokenizer
60
+
61
+ Phi-3 Mini-4K-Instruct supports a vocabulary size of up to `32064` tokens. The [tokenizer files](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/added_tokens.json) already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
62
+
63
  ### Chat Format
64
 
65
  Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
added_tokens.json CHANGED
@@ -9,5 +9,5 @@
9
  "<|end|>": 32007,
10
  "<|placeholder5|>": 32008,
11
  "<|placeholder6|>": 32009,
12
- "<|user|>": 320010
13
  }
 
9
  "<|end|>": 32007,
10
  "<|placeholder5|>": 32008,
11
  "<|placeholder6|>": 32009,
12
+ "<|user|>": 32010
13
  }