xu-song commited on
Commit
988921c
1 Parent(s): 7d2062e

update compress rate

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +174 -81
  2. app.py +2 -2
  3. stats/README.md +0 -0
  4. stats/compress_rate/amber.en.json +1 -0
  5. stats/compress_rate/amber.zh-Hans.json +1 -0
  6. stats/compress_rate/aya_101.en.json +1 -0
  7. stats/compress_rate/aya_101.zh-Hans.json +1 -0
  8. stats/compress_rate/baichuan.en.json +1 -0
  9. stats/compress_rate/baichuan.zh-Hans.json +1 -0
  10. stats/compress_rate/baichuan2.en.json +1 -0
  11. stats/compress_rate/baichuan2.zh-Hans.json +1 -0
  12. stats/compress_rate/bert_base_cased.en.json +1 -0
  13. stats/compress_rate/bert_base_cased.zh-Hans.json +1 -0
  14. stats/compress_rate/bert_base_chinese.en.json +1 -0
  15. stats/compress_rate/bert_base_chinese.zh-Hans.json +1 -0
  16. stats/compress_rate/bert_base_uncased.en.json +1 -0
  17. stats/compress_rate/bert_base_uncased.zh-Hans.json +1 -0
  18. stats/compress_rate/bloom.en.json +1 -0
  19. stats/compress_rate/bloom.zh-Hans.json +1 -0
  20. stats/compress_rate/byt5_small.en.json +1 -0
  21. stats/compress_rate/byt5_small.zh-Hans.json +1 -0
  22. stats/compress_rate/character_glm_6b.en.json +1 -0
  23. stats/compress_rate/character_glm_6b.zh-Hans.json +1 -0
  24. stats/compress_rate/chatglm2_6b.en.json +1 -0
  25. stats/compress_rate/chatglm2_6b.zh-Hans.json +1 -0
  26. stats/compress_rate/chatglm3_6b.en.json +1 -0
  27. stats/compress_rate/chatglm3_6b.zh-Hans.json +1 -0
  28. stats/compress_rate/chatglm_6b.en.json +1 -0
  29. stats/compress_rate/chatglm_6b.zh-Hans.json +1 -0
  30. stats/compress_rate/chatyuan_large_v2.en.json +1 -0
  31. stats/compress_rate/chatyuan_large_v2.zh-Hans.json +1 -0
  32. stats/compress_rate/chinese_llama.en.json +1 -0
  33. stats/compress_rate/chinese_llama.zh-Hans.json +1 -0
  34. stats/compress_rate/chinese_llama2.en.json +1 -0
  35. stats/compress_rate/chinese_llama2.zh-Hans.json +1 -0
  36. stats/compress_rate/code_davinci_002.en.json +1 -0
  37. stats/compress_rate/code_davinci_002.zh-Hans.json +1 -0
  38. stats/compress_rate/crystal_coder.en.json +1 -0
  39. stats/compress_rate/crystal_coder.zh-Hans.json +1 -0
  40. stats/compress_rate/dbrx_instruct.en.json +1 -0
  41. stats/compress_rate/dbrx_instruct.zh-Hans.json +1 -0
  42. stats/compress_rate/deepseek_coder_33b_instruct.en.json +1 -0
  43. stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json +1 -0
  44. stats/compress_rate/deepseek_llm_7b_base.en.json +1 -0
  45. stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json +1 -0
  46. stats/compress_rate/falcon_180b.en.json +1 -0
  47. stats/compress_rate/falcon_180b.zh-Hans.json +1 -0
  48. stats/compress_rate/falcon_7b.en.json +1 -0
  49. stats/compress_rate/falcon_7b.zh-Hans.json +1 -0
  50. stats/compress_rate/fastchat_t5_3b.en.json +1 -0
README.md CHANGED
@@ -14,9 +14,17 @@ pinned: false
14
  ## 压缩率 Compress Rate
15
 
16
 
17
- 在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。压缩率指标 `g_bytes/b_tokens`
18
 
19
- 您可通过以下脚本进行复现
 
 
 
 
 
 
 
 
20
  ```sh
21
  python utils/compress_rate_util.py
22
  ```
@@ -24,92 +32,177 @@ python utils/compress_rate_util.py
24
 
25
 
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  <details> <summary>简体中文压缩率</summary>
29
  在简体中文数据集 cc100-zh-Hans 计算压缩率
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  </details>
33
 
34
- | tokenizer | vocab_size | g_bytes/b_tokens | t_bytes/t_tokens | b_tokens/g_bytes |
35
- |:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
36
- | amber | 32000 | 1.84 | 1.8 | 0.54 |
37
- | aya_101 | 250100 | 3.89 | 3.79 | 0.26 |
38
- | baichuan | 64000 | 3.92 | 3.82 | 0.26 |
39
- | baichuan2 | 125696 | 4.53 | 4.42 | 0.22 |
40
- | bert_base_cased | 28996 | 2.73 | 2.66 | 0.37 |
41
- | bert_base_chinese | 21128 | 2.74 | 2.67 | 0.37 |
42
- | bert_base_uncased | 30522 | 2.73 | 2.67 | 0.37 |
43
- | bloom | 250680 | 4.28 | 4.18 | 0.23 |
44
- | byt5_small | 256 | 0.93 | 0.91 | 1.08 |
45
- | character_glm_6b | 64794 | 4.2 | 4.1 | 0.24 |
46
- | chatglm2_6b | 64794 | 4.2 | 4.1 | 0.24 |
47
- | chatglm3_6b | 64798 | 4.2 | 4.1 | 0.24 |
48
- | chatglm_6b | 150344 | 4.65 | 4.54 | 0.22 |
49
- | chatyuan_large_v2 | 32128 | 4.34 | 4.24 | 0.23 |
50
- | chinese_llama | 49953 | 3.93 | 3.84 | 0.25 |
51
- | chinese_llama2 | 55296 | 3.92 | 3.83 | 0.26 |
52
- | code_davinci_002 | 50281 | 1.31 | 1.28 | 0.77 |
53
- | crystal_coder | 32000 | 1.86 | 1.81 | 0.54 |
54
- | deepseek_coder_33b_instruct | 32000 | 3.4 | 3.32 | 0.29 |
55
- | deepseek_llm_7b_base | 100000 | 4.05 | 3.96 | 0.25 |
56
- | falcon_180b | 65024 | 2.18 | 2.13 | 0.46 |
57
- | falcon_7b | 65024 | 2.18 | 2.13 | 0.46 |
58
- | fastchat_t5_3b | 32000 | 13.7 | 13.38 | 0.07 |
59
- | flan_t5_base | 32100 | 14.13 | 13.8 | 0.07 |
60
- | gemma_7b | 256000 | 3.82 | 3.73 | 0.26 |
61
- | gpt2 | 50257 | 1.31 | 1.28 | 0.77 |
62
- | gpt2_chinese | 21128 | 2.73 | 2.66 | 0.37 |
63
- | gpt_35_turbo | 100277 | 2.26 | 2.21 | 0.44 |
64
- | gpt_4 | 100277 | 2.26 | 2.21 | 0.44 |
65
- | gpt_nexo_20b | 50254 | 2.01 | 1.96 | 0.5 |
66
- | internlm2_chat_7b | 92544 | 4.23 | 4.13 | 0.24 |
67
- | internlm2_math_7b | 92544 | 4.23 | 4.13 | 0.24 |
68
- | internlm_chat_7b | 103168 | 4.23 | 4.14 | 0.24 |
69
- | internlm_xcomposer_7b | 103168 | 4.23 | 4.14 | 0.24 |
70
- | kplug | 10261 | 2.72 | 2.65 | 0.37 |
71
- | llama | 32000 | 1.84 | 1.8 | 0.54 |
72
- | llama2 | 32000 | 1.84 | 1.8 | 0.54 |
73
- | mistral_7b | 32000 | 2.36 | 2.3 | 0.42 |
74
- | mixtral_8_7b | 32000 | 2.36 | 2.3 | 0.42 |
75
- | mobilebert_uncased | 30522 | 2.73 | 2.67 | 0.37 |
76
- | moss | 106029 | 4.4 | 4.3 | 0.23 |
77
- | mt5_large | 250100 | 3.89 | 3.79 | 0.26 |
78
- | olmo_7b | 50280 | 2.01 | 1.96 | 0.5 |
79
- | orion_14b_chat | 84608 | 4.63 | 4.52 | 0.22 |
80
- | phi_1 | 50257 | 1.31 | 1.28 | 0.77 |
81
- | phi_2 | 50257 | 1.31 | 1.28 | 0.77 |
82
- | pko_t5_large | 50258 | 0.97 | 0.95 | 1.03 |
83
- | prompt_clue | 32128 | 4.34 | 4.24 | 0.23 |
84
- | qwen1_5_14b_chat | 151643 | 4.16 | 4.06 | 0.24 |
85
- | qwen_1_8b_chat | 151851 | 4.16 | 4.06 | 0.24 |
86
- | qwen_72b_chat | 151851 | 4.16 | 4.06 | 0.24 |
87
- | qwen_7b_chat | 151851 | 4.16 | 4.06 | 0.24 |
88
- | roberta_chinese_clue | 8021 | 2.7 | 2.64 | 0.37 |
89
- | skywork_13b_base | 65519 | 3.69 | 3.61 | 0.27 |
90
- | skywork_13b_math | 65519 | 3.69 | 3.61 | 0.27 |
91
- | solar_10_7b | 32000 | 2.36 | 2.3 | 0.42 |
92
- | starchat_alpha | 49152 | 2.78 | 2.72 | 0.36 |
93
- | switch_c_2048 | 32100 | 14.13 | 13.8 | 0.07 |
94
- | t5_base | 32100 | 14.13 | 13.8 | 0.07 |
95
- | t5_large | 32100 | 14.13 | 13.8 | 0.07 |
96
- | t5_small | 32100 | 14.13 | 13.8 | 0.07 |
97
- | text_davinci_003 | 50281 | 1.31 | 1.28 | 0.77 |
98
- | tigerbot_13b_chat_v2 | 60512 | 4.25 | 4.15 | 0.24 |
99
- | tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 4.15 | 0.24 |
100
- | wizardcoder_15b_v1 | 49152 | 2.78 | 2.72 | 0.36 |
101
- | wizardcoder_python_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
102
- | wizardlm_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
103
- | wizardmath_70b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
104
- | xlm_roberta | 250002 | 3.96 | 3.86 | 0.25 |
105
- | yi_34b | 64000 | 4.17 | 4.07 | 0.24 |
106
- | yi_6b | 64000 | 4.17 | 4.07 | 0.24 |
107
- | yi_vl34b | 64000 | 4.11 | 4.02 | 0.24 |
108
- | zephyr_7b_beta | 32000 | 2.36 | 2.3 | 0.42 |
109
-
110
-
111
- **结论**
112
- larger vocabulary sizes
113
 
114
 
115
 
 
14
  ## 压缩率 Compress Rate
15
 
16
 
17
+ 在 [cc-100](https://huggingface.co/datasets/cc100) 数据集,每个语言取1万条数据,测试不同tokenizer的压缩率。
18
 
19
+ > 压缩率示例:
20
+ llama3扩充了词典,具有更高的压缩比。同样1T字节的简体中文语料,llama分词后是 0.56万亿个token,llama3只需要0.31万亿个token。
21
+
22
+ | tokenizer | vocab_size | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
23
+ |:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
24
+ | llama | 32000 | 1.8 | 0.56 | 0.7 |
25
+ | llama3 | 128000 | 3.2 | 0.31 | 1.24 |
26
+
27
+ 可通过以下脚本进行复现
28
  ```sh
29
  python utils/compress_rate_util.py
30
  ```
 
32
 
33
 
34
 
35
+ <details> <summary>英文压缩率</summary>
36
+ 在英文数据集 cc100-en 计算压缩率
37
+
38
+ | tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
39
+ |:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
40
+ | amber | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
41
+ | aya_101 | 250100 | 3.3 | 0.3 | 3.22 | 0.31 | 3.53 |
42
+ | baichuan | 64000 | 3.74 | 0.27 | 3.65 | 0.27 | 4 |
43
+ | baichuan2 | 125696 | 3.89 | 0.26 | 3.8 | 0.26 | 4.17 |
44
+ | bert_base_cased | 28996 | 3.64 | 0.27 | 3.55 | 0.28 | 3.89 |
45
+ | bert_base_chinese | 21128 | 2.78 | 0.36 | 2.71 | 0.37 | 2.97 |
46
+ | bert_base_uncased | 30522 | 3.73 | 0.27 | 3.65 | 0.27 | 4 |
47
+ | bloom | 250680 | 4.07 | 0.25 | 3.97 | 0.25 | 4.36 |
48
+ | byt5_small | 256 | 0.92 | 1.08 | 0.9 | 1.11 | 0.99 |
49
+ | character_glm_6b | 64794 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
50
+ | chatglm2_6b | 64794 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
51
+ | chatglm3_6b | 64798 | 3.62 | 0.28 | 3.54 | 0.28 | 3.88 |
52
+ | chatglm_6b | 150344 | 3.68 | 0.27 | 3.59 | 0.28 | 3.94 |
53
+ | chatyuan_large_v2 | 32128 | 1.95 | 0.51 | 1.91 | 0.52 | 2.09 |
54
+ | chinese_llama | 49953 | 3.59 | 0.28 | 3.51 | 0.28 | 3.85 |
55
+ | chinese_llama2 | 55296 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
56
+ | code_davinci_002 | 50281 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
57
+ | crystal_coder | 32000 | 3.68 | 0.27 | 3.59 | 0.28 | 3.94 |
58
+ | dbrx_instruct | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
59
+ | deepseek_coder_33b_instruct | 32000 | 3.64 | 0.27 | 3.56 | 0.28 | 3.9 |
60
+ | deepseek_llm_7b_base | 100000 | 3.85 | 0.26 | 3.76 | 0.27 | 4.12 |
61
+ | falcon_180b | 65024 | 3.99 | 0.25 | 3.9 | 0.26 | 4.27 |
62
+ | falcon_7b | 65024 | 3.99 | 0.25 | 3.9 | 0.26 | 4.27 |
63
+ | fastchat_t5_3b | 32000 | 2.16 | 0.46 | 2.11 | 0.47 | 2.31 |
64
+ | flan_t5_base | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
65
+ | gemma_7b | 256000 | 3.91 | 0.26 | 3.82 | 0.26 | 4.18 |
66
+ | gpt2 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
67
+ | gpt2_chinese | 21128 | 2.67 | 0.37 | 2.61 | 0.38 | 2.86 |
68
+ | gpt_35_turbo | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
69
+ | gpt_4 | 100277 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
70
+ | gpt_nexo_20b | 50254 | 4.04 | 0.25 | 3.94 | 0.25 | 4.32 |
71
+ | grok_1 | 131072 | 4.06 | 0.25 | 3.96 | 0.25 | 4.35 |
72
+ | internlm2_chat_7b | 92544 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
73
+ | internlm2_math_7b | 92544 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
74
+ | internlm_chat_7b | 103168 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
75
+ | internlm_xcomposer_7b | 103168 | 3.86 | 0.26 | 3.77 | 0.27 | 4.13 |
76
+ | jamba_v0_1 | 65536 | 3.82 | 0.26 | 3.73 | 0.27 | 4.09 |
77
+ | kplug | 10261 | 2.66 | 0.38 | 2.6 | 0.38 | 2.85 |
78
+ | llama | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
79
+ | llama2 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
80
+ | llama3 | 128000 | 4.11 | 0.24 | 4.01 | 0.25 | 4.4 |
81
+ | mistral_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
82
+ | mixtral_8_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
83
+ | mobilebert_uncased | 30522 | 3.73 | 0.27 | 3.65 | 0.27 | 4 |
84
+ | moss | 106029 | 4.08 | 0.25 | 3.98 | 0.25 | 4.36 |
85
+ | mt5_large | 250100 | 3.3 | 0.3 | 3.22 | 0.31 | 3.53 |
86
+ | olmo_7b | 50280 | 4.04 | 0.25 | 3.94 | 0.25 | 4.32 |
87
+ | orion_14b_chat | 84608 | 3.94 | 0.25 | 3.85 | 0.26 | 4.22 |
88
+ | phi_1 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
89
+ | phi_2 | 50257 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
90
+ | pko_t5_large | 50258 | 1.59 | 0.63 | 1.55 | 0.64 | 1.7 |
91
+ | prompt_clue | 32128 | 1.95 | 0.51 | 1.91 | 0.52 | 2.09 |
92
+ | qwen1_5_14b_chat | 151643 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
93
+ | qwen_1_8b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
94
+ | qwen_72b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
95
+ | qwen_7b_chat | 151851 | 4.06 | 0.25 | 3.97 | 0.25 | 4.35 |
96
+ | roberta_chinese_clue | 8021 | 1.8 | 0.56 | 1.75 | 0.57 | 1.92 |
97
+ | skywork_13b_base | 65519 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
98
+ | skywork_13b_math | 65519 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
99
+ | solar_10_7b | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
100
+ | starchat_alpha | 49152 | 3.63 | 0.28 | 3.54 | 0.28 | 3.88 |
101
+ | switch_c_2048 | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
102
+ | t5_base | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
103
+ | t5_large | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
104
+ | t5_small | 32100 | 3.61 | 0.28 | 3.53 | 0.28 | 3.87 |
105
+ | text_davinci_003 | 50281 | 4.05 | 0.25 | 3.96 | 0.25 | 4.34 |
106
+ | tigerbot_13b_chat_v2 | 60512 | 3.67 | 0.27 | 3.58 | 0.28 | 3.93 |
107
+ | tigerbot_70b_chat_v4_4k | 65107 | 3.65 | 0.27 | 3.57 | 0.28 | 3.91 |
108
+ | wizardcoder_15b_v1 | 49152 | 3.63 | 0.28 | 3.54 | 0.28 | 3.88 |
109
+ | wizardcoder_python_7b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
110
+ | wizardlm_7b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
111
+ | wizardmath_70b_v1 | 32000 | 3.56 | 0.28 | 3.47 | 0.29 | 3.81 |
112
+ | xlm_roberta | 250002 | 3.49 | 0.29 | 3.41 | 0.29 | 3.74 |
113
+ | yi_34b | 64000 | 3.87 | 0.26 | 3.78 | 0.26 | 4.15 |
114
+ | yi_6b | 64000 | 3.87 | 0.26 | 3.78 | 0.26 | 4.15 |
115
+ | yi_vl34b | 64000 | 3.88 | 0.26 | 3.79 | 0.26 | 4.16 |
116
+ | zephyr_7b_beta | 32000 | 3.67 | 0.27 | 3.58 | 0.28 | 3.92 |
117
+
118
+ </details>
119
+
120
 
121
  <details> <summary>简体中文压缩率</summary>
122
  在简体中文数据集 cc100-zh-Hans 计算压缩率
123
 
124
+ | tokenizer | vocab_size | g_bytes/b_tokens | b_tokens/g_bytes | t_bytes/t_tokens | t_tokens/t_bytes | n_chars/n_tokens |
125
+ |:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
126
+ | amber | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
127
+ | aya_101 | 250100 | 3.89 | 0.26 | 3.79 | 0.26 | 1.47 |
128
+ | baichuan | 64000 | 3.92 | 0.26 | 3.82 | 0.26 | 1.48 |
129
+ | baichuan2 | 125696 | 4.53 | 0.22 | 4.42 | 0.23 | 1.71 |
130
+ | bert_base_cased | 28996 | 2.73 | 0.37 | 2.66 | 0.38 | 1.03 |
131
+ | bert_base_chinese | 21128 | 2.74 | 0.37 | 2.67 | 0.37 | 1.03 |
132
+ | bert_base_uncased | 30522 | 2.73 | 0.37 | 2.67 | 0.38 | 1.03 |
133
+ | bloom | 250680 | 4.28 | 0.23 | 4.18 | 0.24 | 1.62 |
134
+ | byt5_small | 256 | 0.93 | 1.08 | 0.91 | 1.1 | 0.35 |
135
+ | character_glm_6b | 64794 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
136
+ | chatglm2_6b | 64794 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
137
+ | chatglm3_6b | 64798 | 4.2 | 0.24 | 4.1 | 0.24 | 1.59 |
138
+ | chatglm_6b | 150344 | 4.65 | 0.22 | 4.54 | 0.22 | 1.76 |
139
+ | chatyuan_large_v2 | 32128 | 4.34 | 0.23 | 4.24 | 0.24 | 1.64 |
140
+ | chinese_llama | 49953 | 3.93 | 0.25 | 3.84 | 0.26 | 1.49 |
141
+ | chinese_llama2 | 55296 | 3.92 | 0.26 | 3.83 | 0.26 | 1.48 |
142
+ | code_davinci_002 | 50281 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
143
+ | crystal_coder | 32000 | 1.86 | 0.54 | 1.81 | 0.55 | 0.7 |
144
+ | dbrx_instruct | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
145
+ | deepseek_coder_33b_instruct | 32000 | 3.4 | 0.29 | 3.32 | 0.3 | 1.29 |
146
+ | deepseek_llm_7b_base | 100000 | 4.05 | 0.25 | 3.96 | 0.25 | 1.53 |
147
+ | falcon_180b | 65024 | 2.18 | 0.46 | 2.13 | 0.47 | 0.82 |
148
+ | falcon_7b | 65024 | 2.18 | 0.46 | 2.13 | 0.47 | 0.82 |
149
+ | fastchat_t5_3b | 32000 | 13.7 | 0.07 | 13.38 | 0.07 | 5.18 |
150
+ | flan_t5_base | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
151
+ | gemma_7b | 256000 | 3.82 | 0.26 | 3.73 | 0.27 | 1.44 |
152
+ | gpt2 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
153
+ | gpt2_chinese | 21128 | 2.73 | 0.37 | 2.66 | 0.38 | 1.03 |
154
+ | gpt_35_turbo | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
155
+ | gpt_4 | 100277 | 2.26 | 0.44 | 2.21 | 0.45 | 0.85 |
156
+ | gpt_nexo_20b | 50254 | 2.01 | 0.5 | 1.96 | 0.51 | 0.76 |
157
+ | grok_1 | 131072 | 1.73 | 0.58 | 1.69 | 0.59 | 0.66 |
158
+ | internlm2_chat_7b | 92544 | 4.23 | 0.24 | 4.13 | 0.24 | 1.6 |
159
+ | internlm2_math_7b | 92544 | 4.23 | 0.24 | 4.13 | 0.24 | 1.6 |
160
+ | internlm_chat_7b | 103168 | 4.23 | 0.24 | 4.14 | 0.24 | 1.6 |
161
+ | internlm_xcomposer_7b | 103168 | 4.23 | 0.24 | 4.14 | 0.24 | 1.6 |
162
+ | jamba_v0_1 | 65536 | 2.3 | 0.44 | 2.24 | 0.45 | 0.87 |
163
+ | kplug | 10261 | 2.72 | 0.37 | 2.65 | 0.38 | 1.03 |
164
+ | llama | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
165
+ | llama2 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
166
+ | llama3 | 128000 | 3.28 | 0.3 | 3.2 | 0.31 | 1.24 |
167
+ | mistral_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
168
+ | mixtral_8_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
169
+ | mobilebert_uncased | 30522 | 2.73 | 0.37 | 2.67 | 0.38 | 1.03 |
170
+ | moss | 106029 | 4.4 | 0.23 | 4.3 | 0.23 | 1.66 |
171
+ | mt5_large | 250100 | 3.89 | 0.26 | 3.79 | 0.26 | 1.47 |
172
+ | olmo_7b | 50280 | 2.01 | 0.5 | 1.96 | 0.51 | 0.76 |
173
+ | orion_14b_chat | 84608 | 4.63 | 0.22 | 4.52 | 0.22 | 1.75 |
174
+ | phi_1 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
175
+ | phi_2 | 50257 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
176
+ | pko_t5_large | 50258 | 0.97 | 1.03 | 0.95 | 1.06 | 0.37 |
177
+ | prompt_clue | 32128 | 4.34 | 0.23 | 4.24 | 0.24 | 1.64 |
178
+ | qwen1_5_14b_chat | 151643 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
179
+ | qwen_1_8b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
180
+ | qwen_72b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
181
+ | qwen_7b_chat | 151851 | 4.16 | 0.24 | 4.06 | 0.25 | 1.57 |
182
+ | roberta_chinese_clue | 8021 | 2.7 | 0.37 | 2.64 | 0.38 | 1.02 |
183
+ | skywork_13b_base | 65519 | 3.69 | 0.27 | 3.61 | 0.28 | 1.4 |
184
+ | skywork_13b_math | 65519 | 3.69 | 0.27 | 3.61 | 0.28 | 1.4 |
185
+ | solar_10_7b | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
186
+ | starchat_alpha | 49152 | 2.78 | 0.36 | 2.72 | 0.37 | 1.05 |
187
+ | switch_c_2048 | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
188
+ | t5_base | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
189
+ | t5_large | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
190
+ | t5_small | 32100 | 14.13 | 0.07 | 13.8 | 0.07 | 5.34 |
191
+ | text_davinci_003 | 50281 | 1.31 | 0.77 | 1.28 | 0.78 | 0.49 |
192
+ | tigerbot_13b_chat_v2 | 60512 | 4.25 | 0.24 | 4.15 | 0.24 | 1.61 |
193
+ | tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 0.24 | 4.15 | 0.24 | 1.61 |
194
+ | wizardcoder_15b_v1 | 49152 | 2.78 | 0.36 | 2.72 | 0.37 | 1.05 |
195
+ | wizardcoder_python_7b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
196
+ | wizardlm_7b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
197
+ | wizardmath_70b_v1 | 32000 | 1.84 | 0.54 | 1.8 | 0.56 | 0.7 |
198
+ | xlm_roberta | 250002 | 3.96 | 0.25 | 3.86 | 0.26 | 1.5 |
199
+ | yi_34b | 64000 | 4.17 | 0.24 | 4.07 | 0.25 | 1.58 |
200
+ | yi_6b | 64000 | 4.17 | 0.24 | 4.07 | 0.25 | 1.58 |
201
+ | yi_vl34b | 64000 | 4.11 | 0.24 | 4.02 | 0.25 | 1.56 |
202
+ | zephyr_7b_beta | 32000 | 2.36 | 0.42 | 2.3 | 0.43 | 0.89 |
203
 
204
  </details>
205
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
 
208
 
app.py CHANGED
@@ -78,13 +78,13 @@ with gr.Blocks(css="css/style.css", title="Tokenizer Arena") as demo:
78
  gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
79
  with gr.Row():
80
  compress_rate_corpus = gr.CheckboxGroup(
81
- ["cc100-en", "cc100-zh-Hans", "cc100-es", "code"],
82
  value=["cc100-en", "cc100-zh-Hans"],
83
  label="corpus",
84
  # info=""
85
  )
86
  compress_rate_unit = gr.Radio(
87
- ["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens"],
88
  value="b_tokens/g_bytes",
89
  label="unit",
90
  )
 
78
  gr.Markdown("Please select corpus and unit of compress rate, get more details at [github](https://github.com/xu-song/tokenizer-arena/). ")
79
  with gr.Row():
80
  compress_rate_corpus = gr.CheckboxGroup(
81
+ ["cc100-en", "cc100-zh-Hans", "cc100-es"], # , "code"
82
  value=["cc100-en", "cc100-zh-Hans"],
83
  label="corpus",
84
  # info=""
85
  )
86
  compress_rate_unit = gr.Radio(
87
+ ["b_tokens/g_bytes", "g_bytes/b_tokens", "t_tokens/t_bytes", "t_bytes/t_tokens", "n_chars/n_tokens"],
88
  value="b_tokens/g_bytes",
89
  label="unit",
90
  )
stats/README.md ADDED
File without changes
stats/compress_rate/amber.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}
stats/compress_rate/amber.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1330093, "n_chars": 927311}
stats/compress_rate/aya_101.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 250100, "n_bytes": 1124813, "n_tokens": 317881, "n_chars": 1121360}
stats/compress_rate/aya_101.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 250100, "n_bytes": 2633047, "n_tokens": 631182, "n_chars": 927311}
stats/compress_rate/baichuan.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64000, "n_bytes": 1124813, "n_tokens": 280108, "n_chars": 1121360}
stats/compress_rate/baichuan.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64000, "n_bytes": 2633047, "n_tokens": 626117, "n_chars": 927311}
stats/compress_rate/baichuan2.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 125696, "n_bytes": 1124813, "n_tokens": 269011, "n_chars": 1121360}
stats/compress_rate/baichuan2.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 125696, "n_bytes": 2633047, "n_tokens": 541464, "n_chars": 927311}
stats/compress_rate/bert_base_cased.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 28996, "n_bytes": 1124813, "n_tokens": 288022, "n_chars": 1121360}
stats/compress_rate/bert_base_cased.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 28996, "n_bytes": 2633047, "n_tokens": 899709, "n_chars": 927311}
stats/compress_rate/bert_base_chinese.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 21128, "n_bytes": 1124813, "n_tokens": 377068, "n_chars": 1121360}
stats/compress_rate/bert_base_chinese.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 21128, "n_bytes": 2633047, "n_tokens": 896599, "n_chars": 927311}
stats/compress_rate/bert_base_uncased.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 30522, "n_bytes": 1124813, "n_tokens": 280575, "n_chars": 1121360}
stats/compress_rate/bert_base_uncased.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 30522, "n_bytes": 2633047, "n_tokens": 898554, "n_chars": 927311}
stats/compress_rate/bloom.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 250680, "n_bytes": 1124813, "n_tokens": 257405, "n_chars": 1121360}
stats/compress_rate/bloom.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 250680, "n_bytes": 2633047, "n_tokens": 573008, "n_chars": 927311}
stats/compress_rate/byt5_small.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 256, "n_bytes": 1124813, "n_tokens": 1134813, "n_chars": 1121360}
stats/compress_rate/byt5_small.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 256, "n_bytes": 2633047, "n_tokens": 2643047, "n_chars": 927311}
stats/compress_rate/character_glm_6b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}
stats/compress_rate/character_glm_6b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
stats/compress_rate/chatglm2_6b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64794, "n_bytes": 1124813, "n_tokens": 289329, "n_chars": 1121360}
stats/compress_rate/chatglm2_6b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64794, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
stats/compress_rate/chatglm3_6b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64798, "n_bytes": 1124813, "n_tokens": 289347, "n_chars": 1121360}
stats/compress_rate/chatglm3_6b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 64798, "n_bytes": 2633047, "n_tokens": 583646, "n_chars": 927311}
stats/compress_rate/chatglm_6b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 150344, "n_bytes": 1124813, "n_tokens": 284761, "n_chars": 1121360}
stats/compress_rate/chatglm_6b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 150344, "n_bytes": 2633047, "n_tokens": 527384, "n_chars": 927311}
stats/compress_rate/chatyuan_large_v2.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32128, "n_bytes": 1124813, "n_tokens": 536033, "n_chars": 1121360}
stats/compress_rate/chatyuan_large_v2.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32128, "n_bytes": 2633047, "n_tokens": 564905, "n_chars": 927311}
stats/compress_rate/chinese_llama.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 49953, "n_bytes": 1124813, "n_tokens": 291514, "n_chars": 1121360}
stats/compress_rate/chinese_llama.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 49953, "n_bytes": 2633047, "n_tokens": 623219, "n_chars": 927311}
stats/compress_rate/chinese_llama2.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 55296, "n_bytes": 1124813, "n_tokens": 294627, "n_chars": 1121360}
stats/compress_rate/chinese_llama2.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 55296, "n_bytes": 2633047, "n_tokens": 625766, "n_chars": 927311}
stats/compress_rate/code_davinci_002.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 50281, "n_bytes": 1124813, "n_tokens": 258403, "n_chars": 1121360}
stats/compress_rate/code_davinci_002.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 50281, "n_bytes": 2633047, "n_tokens": 1876809, "n_chars": 927311}
stats/compress_rate/crystal_coder.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 284627, "n_chars": 1121360}
stats/compress_rate/crystal_coder.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 1320093, "n_chars": 927311}
stats/compress_rate/dbrx_instruct.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 100277, "n_bytes": 1124813, "n_tokens": 254985, "n_chars": 1121360}
stats/compress_rate/dbrx_instruct.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 100277, "n_bytes": 2633047, "n_tokens": 1084939, "n_chars": 927311}
stats/compress_rate/deepseek_coder_33b_instruct.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 287408, "n_chars": 1121360}
stats/compress_rate/deepseek_coder_33b_instruct.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 2633047, "n_tokens": 720577, "n_chars": 927311}
stats/compress_rate/deepseek_llm_7b_base.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 100000, "n_bytes": 1124813, "n_tokens": 272324, "n_chars": 1121360}
stats/compress_rate/deepseek_llm_7b_base.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 100000, "n_bytes": 2633047, "n_tokens": 605081, "n_chars": 927311}
stats/compress_rate/falcon_180b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}
stats/compress_rate/falcon_180b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}
stats/compress_rate/falcon_7b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 65024, "n_bytes": 1124813, "n_tokens": 262509, "n_chars": 1121360}
stats/compress_rate/falcon_7b.zh-Hans.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 65024, "n_bytes": 2633047, "n_tokens": 1124681, "n_chars": 927311}
stats/compress_rate/fastchat_t5_3b.en.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"vocab_size": 32000, "n_bytes": 1124813, "n_tokens": 484941, "n_chars": 1121360}