problem in tokenizing

by jhflow - opened Jan 4

Discussion

jhflow

Jan 4

•

edited Jan 4

Thank you for your remarkable work.

But I have found a problem in tokenizer.

when I tokenize text with the provided tokenizer, I find that the tokenizer makes too many spacings, and they are not grammatically correct.

Do you have any idea to remedy this problem?

spow12

Jan 4

•

edited Jan 4

Hi, Thank you for great work.

Despite of your great work, I encountered same strange results when encoding text using the tokenizer.

Here is an example:

sys_message = "당신은 도움이 되고 정중하며 정직한 조수입니다. 안전을 유지하면서 항상 가능한 한 도움이 되는 답변을 해주세요. 귀하의 답변에는 유해하거나, 비윤리적이거나, 인종차별적이거나, 성차별적이거나, 독성이 있거나, 위험하거나 불법적인 콘텐츠가 포함되어서는 안 됩니다. 귀하의 응답은 사회적으로 편견이 없고 긍정적인 내용이어야 합니다."

tokenizer.decode(tokenizer(sys_message)['input_ids'])
'<s>당 신은 도 움이 되고 정중 하며 정 직한 조 수입니다 . 안전 을 유지 하면서 항상 가능한 한 도 움이 되는 답 변을 해주세요 . 귀 하의 답변 에는 유해 하거나 , 비 윤 리 적이 거나 , 인종 차별 적이 거나 , 성 차별 적이 거나 , 독 성이 있 거나 , 위험 하거나 불법 적인 콘텐츠 가 포함 되어 서는 안 됩니다 . 귀 하의 응 답은 사회적 으로 편 견이 없고 긍 정 적인 내용 이어야 합니다 .'

Moreover, when using this model for conversation or text-generation pipelines, it's slower compared to other similar models with similar generation configs and parameter counts like beomi/llama-2-koen-13b.

Could this slowness be related to the tokenizer?

Thanks.

seungduk

Yanolja org Jan 4

Hello Jeonghwan and Young Woo,

Thank you for pointing out the issue. Yes, I'm aware of it. The problem seems to be because all the tokens from the "added_tokens.json" file are treated as special tokens. I'm unsure if this is intentional since there's a separate "special_tokens_map.json" file for special tokens. This causes the tokenizer to insert a space after each token I added.

Here's a workaround I've been using:

if prev_tokens is not None:
    last = tokenizer.convert_tokens_to_ids(prev_tokens[-1:])
    if last[0] > 32000:
        next = new_tokens[-1]
        if next[0] == "▁":
            suffix = ""
            if len(next) > 1:
                suffix = new_text[-(len(next)-1):]
            new_text = new_text[:-len(next)] + suffix
            new_tokens[-1] = next[1:]
            if new_tokens[-1] == "":
                new_tokens = new_tokens[:-1]
                output_tokens = output_tokens[:-1]
                prefix_offset -= 1
            else:
                output_tokens[-1] = new_tokens[-1]

It's a bit of a quick fix but it's working for now. I plan to address this issue by adding the tokens directly into the tokenizer. Sorry for any trouble this has caused.

Thanks,
Seungduk

seungduk

Yanolja org Jan 4

It looks like my previous answer was wrong. I totally misunderstood how the tokenizer works. Initially, I need to merge the added tokens into the original tokenizer model, but it is not straightforward. Also, the slowness that Young Woo mentioned could be related to the numerous added tokens. Let me get back to you with a solution as soon as possible. Thank you for your understanding.

seungduk

Yanolja org Jan 5

Hi Jeonghwan and Young Woo,

I've been investigating this issue for some time and realized that the 'merges' in the tokenizer configuration were the cause. Jaewon helped me solve this issue and also wrote a blog post about it here: https://seen-point-bd9.notion.site/Tokenizer-Expansion-ecb6d78211a54ba6b3cf8ebc0ec1d105
As you can see in the blog post, KoSOLAR v0.1's tokenizer does not function well, and the length of the encoded result is longer than it should be, although it is still shorter than that produced by the original tokenizer.

<s> 당분간 주택 가격에 큰 조정이 일어나거나 하는 계기가 발생하지 않는 한 이들의 주택 복귀는 당분간 어려워 보인다는 것이 중론이다

# KoSOLAR v0.1 tokenizer
[1, 28705, 30287, 41768, 259, 34740, 259, 35790, 28705, 29148, 28705, 31694, 28705, 37585, 28705, 29015, 28705, 29415, 32633, 32400, 259, 32029, 259, 30106, 32453, 259, 46354, 32208, 259, 30104, 29175, 28705, 29282, 28705, 29015, 32173, 259, 34740, 259, 30357, 46682, 28705, 29175, 28705, 30287, 41768, 259, 29433, 30710, 31126, 28705, 29477, 33020, 28705, 29175, 28705, 38655, 259, 30027, 39265, 28705, 29043]

# KoSOLAR v0.2 tokenizer
[1, 32119, 41768, 34375, 42984, 32386, 32052, 33335, 33725, 32400, 32254, 39212, 32512, 32208, 32440, 32026, 35964, 34375, 34822, 29175, 32119, 41768, 38294, 39093, 32264, 32212, 32039, 46611, 32034]

As demonstrated, the revised tokenizer outputs a much shorter list of token IDs, most of which are newly added tokens (>= 32000). This also means that many embeddings in embed_tokens and lm_head were not sufficiently trained in KoSOLAR v0.1 because the corresponding token IDs were not generated frequently enough by the tokenizer. Therefore, if I simply replace the tokenizer, its performance will significantly degrade. I confirmed it by running an eval with the new tokenizer. I had hoped that my mistake would only impact the decoding process, but it turned out to be the opposite; it was actually the encoding process that was affected.

I will upload a new version with a fix, but it will take some time. I hope to upload the new version by January 12.

Thanks,
Seungduk

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment