Supported Languages
#72
by
vpkprasanna
- opened
What are the languages supported by the LLM ?
how to find out easily from tokenizer vocab file ?
Hi @vpkprasanna ,
- Load the tokenizer associated with the model.
- Retrieve the vocabulary, which contains the tokens, to inspect their structure and determine if we can infer the supported languages.
- Further filter these tokens to search for language-specific characters, such as those from Hindi and Chinese, to confirm whether the model supports these languages.
You can refer to the below IPython notebook where I use the google/gemma-2b
model to check if it supports Hindi and Chinese.
https://colab.research.google.com/gist/Gopi-Uppari/2600403197351f4a746b988f937adc4e/supported-languagesipynb.ipynb
Thank you.