Mxode
/

Bilingual-Tokenizer

Model card Files Files and versions Community

Edit model card

Bilingual Tokenizer

A portion of the data from the IndustryCorpus-Subset-zh-en dataset was used for training.

This dataset consists of Chinese and English bilingual text.

10,000 samples were taken from the untrained portion to test the compression rate of the tokenizer.

Compression rate formula:

$\text{Compression rate} = \frac{\text{length after tokenization}}{\text{character length of the original corpus}}$

Here is the test result:

Model	Tokenizer Size	Compression Rate
deepseek-llm-7b-base	100015	36.63%
deepseek-coder-33b-base	32022	41.75%
gemma-2-27b	256000	37.75%
glm-4-9b	151343	34.26%
internlm2_5-7b-chat	92550	35.15%
Llama-2-7b-hf	32000	63.33%
Meta-Llama-3.1-8B	128256	41.48%
Mistral-7B-Instruct-v0.3	32768	52.43%
Phi-3.5-mini-instruct	32011	63.29%
Qwen2-7B-Instruct	151646	35.91%
Yi-1.5-9B	63992	36.86%
BilingualTokenizer-1K	1000	75.61%
BilingualTokenizer-2K	2000	62.26%
BilingualTokenizer-4K	4000	52.81%
BilingualTokenizer-8K	8000	45.92%
BilingualTokenizer-16K	16000	40.94%

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference API

Unable to determine this model's library. Check the docs .

Dataset used to train Mxode/Bilingual-Tokenizer

Collections including Mxode/Bilingual-Tokenizer

NanoLM

a collection of nano LMs • 13 items • Updated 8 days ago • 4

NanoTranslator

a collection of nano translators • 13 items • Updated 2 days ago • 1