Is it possible to extend tokens to models?
#8
by
badrabbitt
- opened
Can I extend the token for this model to extend the language for the model beyond English?
Hi there!
Basically, we are using the same tokenizer of Falcon-7B/11B, which has the supports for English (en), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Czech (cz), Romanian (ro) and Swedish (sw).
For the above languages, you can simply continue the pretraining to enable multilingual capabilities. Beyond that, you may need to extend the vocabulary to the target languages, a simple example is Chinese-LLaMA
We will put more details in our technical report.
Stay tuned! :)
Thanksss