tiiuae/falcon-mamba-7b · Is it possible to extend tokens to models?

Aug 15

Can I extend the token for this model to extend the language for the model beyond English?

Technology Innovation Institute org Sep 1

Hi there!

Basically, we are using the same tokenizer of Falcon-7B/11B, which has the supports for English (en), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Czech (cz), Romanian (ro) and Swedish (sw).

For the above languages, you can simply continue the pretraining to enable multilingual capabilities. Beyond that, you may need to extend the vocabulary to the target languages, a simple example is Chinese-LLaMA

We will put more details in our technical report.

Stay tuned! :)

badrabbitt

Sep 14

Thanksss