gugarosa commited on
Commit
f10fb29
1 Parent(s): 48788d7

fix(readme): Adds information about placeholder tokens.

Browse files
Files changed (2) hide show
  1. README.md +4 -0
  2. added_tokens.json +1 -1
README.md CHANGED
@@ -53,6 +53,10 @@ Phi-3 Mini-128K-Instruct has been integrated in the development version (4.40.0)
53
 
54
  The current `transformers` version can be verified with: `pip list | grep transformers`.
55
 
 
 
 
 
56
  ### Chat Format
57
 
58
  Given the nature of the training data, the Phi-3 Mini-128K-Instruct model is best suited for prompts using the chat format as follows.
 
53
 
54
  The current `transformers` version can be verified with: `pip list | grep transformers`.
55
 
56
+ ### Tokenizer
57
+
58
+ Phi-3 Mini-128K-Instruct supports a vocabulary size of up to `32064` tokens. The [tokenizer files](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/added_tokens.json) already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
59
+
60
  ### Chat Format
61
 
62
  Given the nature of the training data, the Phi-3 Mini-128K-Instruct model is best suited for prompts using the chat format as follows.
added_tokens.json CHANGED
@@ -9,5 +9,5 @@
9
  "<|end|>": 32007,
10
  "<|placeholder5|>": 32008,
11
  "<|placeholder6|>": 32009,
12
- "<|user|>": 320010
13
  }
 
9
  "<|end|>": 32007,
10
  "<|placeholder5|>": 32008,
11
  "<|placeholder6|>": 32009,
12
+ "<|user|>": 32010
13
  }