Edwin Rijgersberg commited on
Commit
93d00ba
1 Parent(s): 0837ce3

Fix mixup of `<pad>` and `<s>` tokens in vocab

Browse files

When using this model, it outputs many `<s>`-tokens, including in the middle of words. You can observe this by running locally, or by using the widget on this page.

It seems to be fixed by switching the vocab ids of `<s>` and `<pad>`.

Other GroNLP-models also seem affected by this, for example https://huggingface.co/GroNLP/wav2vec2-dutch-large-ft-cgn

Files changed (1) hide show
  1. vocab.json +1 -1
vocab.json CHANGED
@@ -1 +1 @@
1
- {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, "|": 4, "A": 5, "B": 6, "C": 7, "D": 8, "E": 9, "F": 10, "G": 11, "H": 12, "I": 13, "J": 14, "K": 15, "L": 16, "M": 17, "N": 18, "O": 19, "P": 20, "Q": 21, "R": 22, "S": 23, "T": 24, "U": 25, "V": 26, "W": 27, "X": 28, "Y": 29, "Z": 30, "È": 31, "É": 32, "Ë": 33, "?": 34, "'": 35, "-": 36}
 
1
+ {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "A": 5, "B": 6, "C": 7, "D": 8, "E": 9, "F": 10, "G": 11, "H": 12, "I": 13, "J": 14, "K": 15, "L": 16, "M": 17, "N": 18, "O": 19, "P": 20, "Q": 21, "R": 22, "S": 23, "T": 24, "U": 25, "V": 26, "W": 27, "X": 28, "Y": 29, "Z": 30, "È": 31, "É": 32, "Ë": 33, "?": 34, "'": 35, "-": 36}