Why is the tokenizer.json not the same as LLaMa-3.1-8B-Instruct
@Crystalcareai Since you guys trained on top of LLaMa-3.1-8B-Instruct, I found it odd that your config.json files were different, mainly that Llama-3.1-SuperNova-Lite/tokenizer.json's was missing some things from meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json. I copied what was missing to Llama-3.1-SuperNova-Lite/tokenizer.json; however in order to make the change to the tokenizer.json, HuggingFace wants me to download your whole repo for Llama-3.1-SuperNova-Lite, make the change and push a PR. I feel like that is overkill, so I included the diff, so you guys can update the tokenizer.json, if it is supposed to be the same as Meta-Llama-3.1-8B-Instruct/tokenizer.json. 🤔
a="/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/tokenizer.json"
b="/Users/jsarnecki/opt/Workspace/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
diff "$a" "$b"
2332,2335c2332,2394
< "type": "ByteLevel",
< "add_prefix_space": true,
< "trim_offsets": false,
< "use_regex": true
---
> "type": "Sequence",
> "processors": [
> {
> "type": "ByteLevel",
> "add_prefix_space": true,
> "trim_offsets": false,
> "use_regex": true
> },
> {
> "type": "TemplateProcessing",
> "single": [
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 0
> }
> },
> {
> "Sequence": {
> "id": "A",
> "type_id": 0
> }
> }
> ],
> "pair": [
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 0
> }
> },
> {
> "Sequence": {
> "id": "A",
> "type_id": 0
> }
> },
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 1
> }
> },
> {
> "Sequence": {
> "id": "B",
> "type_id": 1
> }
> }
> ],
> "special_tokens": {
> "<|begin_of_text|>": {
> "id": "<|begin_of_text|>",
> "ids": [
> 128000
> ],
> "tokens": [
> "<|begin_of_text|>"
> ]
> }
> }
> }
> ]
I appreciate that, Joseph. It’s likely a peculiarity with how all of these different tools (distillation, new merge technique, etc) handle saving the tokenizer between different variations of the model. We’ll look into it.