blog: https://huggingface.co/blog/paligemma
Thanks for the model, I am following the steps and completing some on the blog but when I run the one to train it tells me the following:
I share the 'Colab' link:
https://colab.research.google.com/drive/1eSJoBGOO0_oulB5gfXqkhtIqiLngKBwy?usp=sharing
I would also like to know if it is also necessary to implement:
model.hidden_activation= "gelu_pytorch_tanh"
He asks me by message.
Is it also possible to implement flash-attn?
I also wanted to know if it is compatible with the library:
from trl import SFTTrainer
thank you so much.
Hello, SFTTrainer
is just a wrapper around the Trainer
so I think it should work although it has some features on top like neftune which I don't know if they would work. About Gemma warnings you can ignore them. For index error let me check, I've wrote that part and I ran it a ton of times so it shouldn't've happened 😅
Let me give you my training script that for sure works in the meanwhile I figure out what line I missed when I moved to blog:
from datasets import load_dataset
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
import os
import torch
from PIL import Image
from transformers import TrainingArguments, Trainer
def collate_fn(examples):
texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
images = [example["image"].convert("RGB") for example in examples]
tokens = processor(text=texts, images=images,
return_tensors="pt", padding="longest",
tokenize_newline_separately=False)
labels = tokens["input_ids"].clone()#.squeeze()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == 256000] = -100
tokens["labels"] = labels
tokens = tokens.to(DTYPE).to("cuda")
return tokens
ds = load_dataset('HuggingFaceM4/VQAv2', split="train")
ds_remove = ["question_type", "answer_type", "answers", "image_id", "question_id"]
ds = ds.remove_columns(ds_remove)
model_id = "google/paligemma-3b-pt-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)
print("initialized processor")
DTYPE = model.dtype
for param in model.vision_tower.parameters():
param.requires_grad = False
# todo: try again with projector unfrozen
for param in model.multi_modal_projector.parameters():
param.requires_grad = False
ds = ds.train_test_split(test_size=0.1)
train_ds = ds["train"]
val_ds = ds["test"]
args=TrainingArguments(
num_train_epochs=2,
remove_unused_columns=False,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
learning_rate=2e-5,
weight_decay=1e-6,
adam_beta2=0.999,
logging_steps=100,
output_dir="./output10",
optim="adamw_hf",
save_strategy="steps",
save_steps=1000,
#optim="paged_adamw_8bit",
push_to_hub=True,
save_total_limit=1,
bf16=True,
report_to=["tensorboard"],
dataloader_pin_memory=False
)
trainer = Trainer(
model=model,
train_dataset=train_ds,
eval_dataset=val_ds,
data_collator=collate_fn,
args=args
)
print("initialized trainer")
print("Current device:", trainer.model.device)
trainer.train()
trainer.push_to_hub()
@NickyNicky
in blog post I forgot to pass in remove_unused_columns=False
hence the error 🤦♀️ irrelevant but using data collator we also need to pass dataloader_pin_memory=False
(related if you load data from CPU to GPU)
Don't worry, we all made mistakes, thank you very much for the prompt response, I'm going to try the code.
I also have another question, wasn't this model trained with a template?
How does the model know what the beginning and end of a response is without the tokens or what were the ones used for this model?
Your code.
def collate_fn(examples):
texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
images = [example["image"].convert("RGB") for example in examples]
tokens = processor(text=texts, images=images,
return_tensors="pt", padding="longest",
tokenize_newline_separately=False)
labels = tokens["input_ids"].clone()#.squeeze()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == 256000] = -100
tokens["labels"] = labels
tokens = tokens.to(DTYPE).to("cuda")
return tokens
I added this code but I don't know if it's right.
device = "cuda"
image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
# texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
# prompt= template.replace("{text_user}",example["question"]).replace("{text_user}",example['multiple_choice_answer'])
template= """<bos><start_of_turn>system\nyou are a useful AI.<end_of_turn>\n<start_of_turn>user\n{text_user}<end_of_turn>\n<start_of_turn>model\n{text_model}<end_of_turn><eos>"""
texts = [template.replace("{text_user}",example["question"]).replace("{text_model}",example['multiple_choice_answer']) for example in examples]
images = [example["image"].convert("RGB") for example in examples]
tokens = processor(text=texts, images=images,
return_tensors="pt", padding="longest",
tokenize_newline_separately=False)
labels = tokens["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == image_token] = -100
tokens["labels"] = labels
tokens = tokens.to(torch.bfloat16).to(device)
return tokens
new code collate_fn.
template= """<bos><start_of_turn>system\nyou are a useful AI.<end_of_turn>\n<start_of_turn>user\n{text_user}<end_of_turn>\n<start_of_turn>model\n{text_model}<end_of_turn><eos>"""
texts = [template.replace("{text_user}",example["question"]).replace("{text_model}",example['multiple_choice_answer']) for example in examples]
Hello,
SFTTrainer
is just a wrapper around theTrainer
so I think it should work although it has some features on top like neftune which I don't know if they would work. About Gemma warnings you can ignore them. For index error let me check, I've wrote that part and I ran it a ton of times so it shouldn't've happened 😅
can be used without problems.
neftune_noise_alpha = 10 and AdaLora and LoftQ.
@NickyNicky this is not a conversation/multiturn type of model really, it's a single turn model, and newline is conditioning model to generate the responses here, that's also why the newline tokenization flag is needed during FT but not inference. eos token could be added maybe but not heavy chat templates
thank you so much.
close.
Hello @merve ,
I ran your code and fine-tuned Paligemma, but the output model is behaving strangely and replying with more questions. Here is the demo space: https://huggingface.co/spaces/taesiri/sample-paligemma-finetuned.
I am getting this warning when loading the model:
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'GemmaTokenizerFast'.
Are we sure that the training dataset format, tokenizer, and other configurations are set correctly? How can I debug this? Many thanks. 🤗🤗
@NickyNicky this is not a conversation/multiturn type of model really, it's a single turn model, and newline is conditioning model to generate the responses here, that's also why the newline tokenization flag is needed during FT but not inference. eos token could be added maybe but not heavy chat templates
Can you please put documentation on this and on how tokens are managed for training without depending on the Trainer wrapper ?
Hello, we have made a few changes which also include API changes around preprocessing for finetuning, you can refer to this notebook: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing