Why is the tokenizer.json not the same as LLaMa-3.1-8B-Instruct

#6
by Joseph717171 - opened

@Crystalcareai Since you guys trained on top of LLaMa-3.1-8B-Instruct, I found it odd that your config.json files were different, mainly that Llama-3.1-SuperNova-Lite/tokenizer.json's was missing some things from meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json. I copied what was missing to Llama-3.1-SuperNova-Lite/tokenizer.json; however in order to make the change to the tokenizer.json, HuggingFace wants me to download your whole repo for Llama-3.1-SuperNova-Lite, make the change and push a PR. I feel like that is overkill, so I included the diff, so you guys can update the tokenizer.json, if it is supposed to be the same as Meta-Llama-3.1-8B-Instruct/tokenizer.json. 🤔

a="/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/tokenizer.json"
b="/Users/jsarnecki/opt/Workspace/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"

diff "$a" "$b"
2332,2335c2332,2394
<     "type": "ByteLevel",
<     "add_prefix_space": true,
<     "trim_offsets": false,
<     "use_regex": true
---
>     "type": "Sequence",
>     "processors": [
>       {
>         "type": "ByteLevel",
>         "add_prefix_space": true,
>         "trim_offsets": false,
>         "use_regex": true
>       },
>       {
>         "type": "TemplateProcessing",
>         "single": [
>           {
>             "SpecialToken": {
>               "id": "<|begin_of_text|>",
>               "type_id": 0
>             }
>           },
>           {
>             "Sequence": {
>               "id": "A",
>               "type_id": 0
>             }
>           }
>         ],
>         "pair": [
>           {
>             "SpecialToken": {
>               "id": "<|begin_of_text|>",
>               "type_id": 0
>             }
>           },
>           {
>             "Sequence": {
>               "id": "A",
>               "type_id": 0
>             }
>           },
>           {
>             "SpecialToken": {
>               "id": "<|begin_of_text|>",
>               "type_id": 1
>             }
>           },
>           {
>             "Sequence": {
>               "id": "B",
>               "type_id": 1
>             }
>           }
>         ],
>         "special_tokens": {
>           "<|begin_of_text|>": {
>             "id": "<|begin_of_text|>",
>             "ids": [
>               128000
>             ],
>             "tokens": [
>               "<|begin_of_text|>"
>             ]
>           }
>         }
>       }
>     ]
Joseph717171 changed discussion title from Why is the tokenizer not the same as LLaMa-3.1-8B-Instruct to Why is the tokenizer.json not the same as LLaMa-3.1-8B-Instruct
Arcee AI org

I appreciate that, Joseph. It’s likely a peculiarity with how all of these different tools (distillation, new merge technique, etc) handle saving the tokenizer between different variations of the model. We’ll look into it.

Crystalcareai changed discussion status to closed

Sign up or log in to comment