Are there special tokens that are ignored during loss computation?
I am wondering if there are some special tokens that are ignored from the processed inputs when computing the loss(given the input ids as labels). My inputs ids looks something like below(attached image). When forwarding all inputs from processor to the model, the loss is very high. Without 71011(Speaker token), the loss is low but still not good. From the source code of processor, the speaker tokens are place-holder for image. Given that images are represented by speaker token, it's somehow counterintuitive that removing the speaker tokens in loss computations will reduce the loss.
Is there something that I am missing or something that was not documented about how the loss is computed?
Hey @Nyandwi , the code available around Fuyu is indeed inference-centered, and we do not have an official training script. There are ongoing contributors work at https://github.com/huggingface/transformers/pull/26997 for instance, and the folks from OtterHD have independently retrained a Fuyu-like architecture on high-resolution data, you can take a look there https://github.com/Luodian/Otter/blob/main/docs/OtterHD.md ! If you have a code snippet, that can be helpful :)
Hi @Molbap . Thanks for the quick response and sharing those pointers.
I am also doing inference where given an image and text prompt, I want to compute the loss, the labels being the input_ids.
model_id = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_id)
model = FuyuForCausalLM.from_pretrained(model_id, device_map="cuda:0", torch_dtype=torch.bfloat16)
inputs = processor(text=prompt, images=sample_im_1, return_tensors="pt").to("cuda:0", torch.bfloat16)
with torch.inference_mode():
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
The above gives me high loss but when I removes the speaker token, the loss reduces significantly but the model is likely not using images since speaker token refers to image place holder according to preprocessing script.
input_ids = inputs["input_ids"]
speaker_token = processor.tokenizer.convert_tokens_to_ids('|SPEAKER|')
input_labels = input_ids.masked_fill(input_ids == speaker_token, -100)
Also related, is prepare_inputs_for_generation() hard requirement to apply to model inputs(from processor)?
The model is using images: in the model, the embeddings of your image patches sample_im_1
are placed at the positions indicated by the placeholder image tokens. Check out the source code here in particular https://github.com/huggingface/transformers/blob/235e5d4991e8a0984aa78db91087b49622c7740e/src/transformers/models/fuyu/modeling_fuyu.py#L289C1-L300C18. So it is expected to have different losses. By removing the speaker tokens, you're actually disabling the image input.
I'm not sure I get your second question, it might be better suited for the forums https://discuss.huggingface.co/ ? prepare_inputs_for_generation
is a GenerationMixin
-related method, it is not linked to the call model(**inputs)
. It will be if you do model.generate(**inputs)
, for instance.
Thanks for the support, @Molbap . I will keep inspecting where the issue of exponential loss(with all input ids) might be coming from. For the later, that was my guess too since using it for further input preprocessing didn't even make a difference for my case.
Edit: Can the absence of training support impacts the inference of model(**inputs)
? It's unlikely, but just to clarify. With generate functionality, I get good response out. It's a bit weird the loss does not agree.
As you said, no, I don't think it would impact the output of a forward pass. :)
@Nyandwi you will want to mask out the |SPEAKER| and tokens related to your input/instructions to the ignore index. Otherwise while finetuning this model, it will be focused on predicting what part of the input is actually just image patches and instructions (not helpful).
Hi
@besiktas
. Thanks for sharing that, really appreciate. That might speak to why removing the speaker tokens reduce loss substantially. What part of the input instructions that you meant or I generally should mask all special tokens? One of my input prompts is "Is the {caption} a good description of the image?".
Do you meant to mask everything except {caption}
? If you have a reason why we have to mask out those tokens from the model perspective or beyond, I would love to hear that. Thank you!
@Nyandwi
I would maybe rephrase the instruction so that it is "Is the following caption a good description of the image: {caption}" in which case you just mask out the length of the tokens for Is the following caption a good description of the image:
(meaning don't mask out based on token values and you may also want to include the boa token before caption).
I believe I have seen it done both ways, conceptually masking out the instruction is trying to get the model to provide an appropriate response given the caption.
If you don't mask them out, in theory I would bet it still works and trains eventually to what you aim for but given the CrossEntropyLoss uses mean reduction (if you provide labels to the model for forward) as the model learns to complete the instruction, the gradient wrt the caption/answer you actually want the model to output will be dampened compared to if you had masked out the instructions.
@besiktas , Thanks a lot, that really helps and once again appreciate your time. With that, is the instruction(or any other prefix) useful in this regard since we ultimately mask it out, or it is(since the input ids in forward still contains all tokens, speaker & instructions) but it's not relevant for loss computation? I like the instruction rephrase, that's brilliant.