词表裁剪的问题

#4
by annisamansa - opened

你好,想请教个词表裁剪的问题。我们在做词表裁剪后,初始化tokenizer时,会出现
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 2957 column 3
的异常。
只有当我们把merges置空,才能正常初始化。请问遇见过类似的错误吗?你们的词表裁剪是怎么做的?感谢~

got it~~~~~~~
不但需要确保merges的两个token都在 vocab中,它们的合成词也需要在
for meg in old_merges:
tokens = meg.strip().split()
new_token = "".join((tokens[0], tokens[1]))
if all(token in continuous_vocab for token in tokens) and new_token in continuous_vocab:
new_tokenizer_data['model']['merges'].append(meg)

Sign up or log in to comment