词表裁剪的问题

by annisamansa - opened May 20

May 20

你好，想请教个词表裁剪的问题。我们在做词表裁剪后，初始化tokenizer时，会出现
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 2957 column 3
的异常。
只有当我们把merges置空，才能正常初始化。请问遇见过类似的错误吗？你们的词表裁剪是怎么做的？感谢~

annisamansa

May 20

•

edited May 20

got it~~~~~~~
不但需要确保merges的两个token都在 vocab中，它们的合成词也需要在
for meg in old_merges:
tokens = meg.strip().split()
new_token = "".join((tokens[0], tokens[1]))
if all(token in continuous_vocab for token in tokens) and new_token in continuous_vocab:
new_tokenizer_data['model']['merges'].append(meg)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment