problem in tokenizing
Hi, Thank you for great work.
Despite of your great work, I encountered same strange results when encoding text using the tokenizer.
Here is an example:
sys_message = "λΉμ μ λμμ΄ λκ³ μ μ€νλ©° μ μ§ν μ‘°μμ
λλ€. μμ μ μ μ§νλ©΄μ νμ κ°λ₯ν ν λμμ΄ λλ λ΅λ³μ ν΄μ£ΌμΈμ. κ·νμ λ΅λ³μλ μ ν΄νκ±°λ, λΉμ€λ¦¬μ μ΄κ±°λ, μΈμ’
μ°¨λ³μ μ΄κ±°λ, μ±μ°¨λ³μ μ΄κ±°λ, λ
μ±μ΄ μκ±°λ, μννκ±°λ λΆλ²μ μΈ μ½ν
μΈ κ° ν¬ν¨λμ΄μλ μ λ©λλ€. κ·νμ μλ΅μ μ¬νμ μΌλ‘ νΈκ²¬μ΄ μκ³ κΈμ μ μΈ λ΄μ©μ΄μ΄μΌ ν©λλ€."
tokenizer.decode(tokenizer(sys_message)['input_ids'])
'<s>λΉ μ μ λ μμ΄ λκ³ μ μ€ νλ©° μ μ§ν μ‘° μμ
λλ€ . μμ μ μ μ§ νλ©΄μ νμ κ°λ₯ν ν λ μμ΄ λλ λ΅ λ³μ ν΄μ£ΌμΈμ . κ· νμ λ΅λ³ μλ μ ν΄ νκ±°λ , λΉ μ€ λ¦¬ μ μ΄ κ±°λ , μΈμ’
μ°¨λ³ μ μ΄ κ±°λ , μ± μ°¨λ³ μ μ΄ κ±°λ , λ
μ±μ΄ μ κ±°λ , μν νκ±°λ λΆλ² μ μΈ μ½ν
μΈ κ° ν¬ν¨ λμ΄ μλ μ λ©λλ€ . κ· νμ μ λ΅μ μ¬νμ μΌλ‘ νΈ κ²¬μ΄ μκ³ κΈ μ μ μΈ λ΄μ© μ΄μ΄μΌ ν©λλ€ .'
Moreover, when using this model for conversation or text-generation pipelines, it's slower compared to other similar models with similar generation configs and parameter counts like beomi/llama-2-koen-13b.
Could this slowness be related to the tokenizer?
Thanks.
Hello Jeonghwan and Young Woo,
Thank you for pointing out the issue. Yes, I'm aware of it. The problem seems to be because all the tokens from the "added_tokens.json" file are treated as special tokens. I'm unsure if this is intentional since there's a separate "special_tokens_map.json" file for special tokens. This causes the tokenizer to insert a space after each token I added.
Here's a workaround I've been using:
if prev_tokens is not None:
last = tokenizer.convert_tokens_to_ids(prev_tokens[-1:])
if last[0] > 32000:
next = new_tokens[-1]
if next[0] == "β":
suffix = ""
if len(next) > 1:
suffix = new_text[-(len(next)-1):]
new_text = new_text[:-len(next)] + suffix
new_tokens[-1] = next[1:]
if new_tokens[-1] == "":
new_tokens = new_tokens[:-1]
output_tokens = output_tokens[:-1]
prefix_offset -= 1
else:
output_tokens[-1] = new_tokens[-1]
It's a bit of a quick fix but it's working for now. I plan to address this issue by adding the tokens directly into the tokenizer. Sorry for any trouble this has caused.
Thanks,
Seungduk
It looks like my previous answer was wrong. I totally misunderstood how the tokenizer works. Initially, I need to merge the added tokens into the original tokenizer model, but it is not straightforward. Also, the slowness that Young Woo mentioned could be related to the numerous added tokens. Let me get back to you with a solution as soon as possible. Thank you for your understanding.
Hi Jeonghwan and Young Woo,
I've been investigating this issue for some time and realized that the 'merges' in the tokenizer configuration were the cause. Jaewon helped me solve this issue and also wrote a blog post about it here: https://seen-point-bd9.notion.site/Tokenizer-Expansion-ecb6d78211a54ba6b3cf8ebc0ec1d105
As you can see in the blog post, KoSOLAR v0.1's tokenizer does not function well, and the length of the encoded result is longer than it should be, although it is still shorter than that produced by the original tokenizer.
<s> λΉλΆκ° μ£Όν κ°κ²©μ ν° μ‘°μ μ΄ μΌμ΄λκ±°λ νλ κ³κΈ°κ° λ°μνμ§ μλ ν μ΄λ€μ μ£Όν 볡κ·λ λΉλΆκ° μ΄λ €μ 보μΈλ€λ κ²μ΄ μ€λ‘ μ΄λ€
# KoSOLAR v0.1 tokenizer
[1, 28705, 30287, 41768, 259, 34740, 259, 35790, 28705, 29148, 28705, 31694, 28705, 37585, 28705, 29015, 28705, 29415, 32633, 32400, 259, 32029, 259, 30106, 32453, 259, 46354, 32208, 259, 30104, 29175, 28705, 29282, 28705, 29015, 32173, 259, 34740, 259, 30357, 46682, 28705, 29175, 28705, 30287, 41768, 259, 29433, 30710, 31126, 28705, 29477, 33020, 28705, 29175, 28705, 38655, 259, 30027, 39265, 28705, 29043]
# KoSOLAR v0.2 tokenizer
[1, 32119, 41768, 34375, 42984, 32386, 32052, 33335, 33725, 32400, 32254, 39212, 32512, 32208, 32440, 32026, 35964, 34375, 34822, 29175, 32119, 41768, 38294, 39093, 32264, 32212, 32039, 46611, 32034]
As demonstrated, the revised tokenizer outputs a much shorter list of token IDs, most of which are newly added tokens (>= 32000). This also means that many embeddings in embed_tokens
and lm_head
were not sufficiently trained in KoSOLAR v0.1 because the corresponding token IDs were not generated frequently enough by the tokenizer. Therefore, if I simply replace the tokenizer, its performance will significantly degrade. I confirmed it by running an eval with the new tokenizer. I had hoped that my mistake would only impact the decoding process, but it turned out to be the opposite; it was actually the encoding process that was affected.
I will upload a new version with a fix, but it will take some time. I hope to upload the new version by January 12.
Thanks,
Seungduk