SP tokenizer missing mode tokens
#9
by
keremturgutlu
- opened
Simply load with sp_model = spm.SentencePieceProcessor(spiece.model)
and run:
sp_model.piece_to_id("[NLG]")
sp_model.piece_to_id("[S2S]")
sp_model.piece_to_id("[NLU]")
all maps to <unk>
linking this here: https://github.com/google-research/google-research/issues/1100
It turns out that these are not special tokens in the vocab but rather plain text, e.g. like a prefix prompt. A bit wasteful I guess :)