Salesforce/xgen-7b-8k-inst · fim_tokens, what is its use?

Jul 2, 2023

Hello, I hope everything goes well.

https://huggingface.co/Salesforce/xgen-7b-8k-inst/blob/main/tokenization_xgen.py

fim_tokens = [
            "<fim_prefix>",
            "<fim_middle>",
            "<fim_suffix>",
            "<fim_pad>",
            "<filename>",
            "<gh_stars>",
            "<issue_start>",
            "<issue_comment>",
            "<issue_closed>",
            "<jupyter_start>",
            "<jupyter_text>",
            "<jupyter_code>",
            "<jupyter_output>",
            "<empty_output>",
            "<commit_before>",
            "<commit_msg>",
            "<commit_after>",
            "<reponame>"
        ]

Could you explain these special tokens how they are used, thanks

rooa

Salesforce org Jul 2, 2023

The following appears in StarCoderData, the code data we used for training the model:

            "<filename>",
            "<gh_stars>",
            "<issue_start>",
            "<issue_comment>",
            "<issue_closed>",
            "<jupyter_start>",
            "<jupyter_text>",
            "<jupyter_code>",
            "<jupyter_output>",
            "<empty_output>",
            "<commit_before>",
            "<commit_msg>",
            "<commit_after>",
            "<reponame>"

Please refer to the StarCoder paper for more details. You could, for example, condition the generation using these special tokens to bias the model prediction.

The remaining (as follows) are the special tokens used by StarCoder for their FIM training, but we did not use them. You can ignore these tokens:

            "<fim_prefix>",
            "<fim_middle>",
            "<fim_suffix>",
            "<fim_pad>",

rooa changed discussion status to closed Jul 2, 2023