Hello,

I'm trying to train a model using the Trainer from the Transformers library. I am using a quantized model with FP16 optimization, but during training, I encounter the error ValueError: Attempting to unscale FP16 gradients..

Here is my code:

import transformers
from torch.nn import CrossEntropyLoss
from transformers import AutoTokenizer
from datasets import load_dataset

Define your model and tokenizer (these should already be defined in your code)

MODEL_NAME = "vilsonrodrigues/falcon-7b-instruct-sharded"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

Load your data

data = load_dataset('csv', data_files='/content/Sumoquote Training Database.csv')

Define your tokenizer function

def tokenize_and_format(examples):
# Here, I'm assuming that the 'User' and 'Prompt' fields in your CSV contains the text you want to model.
text = [f"{x} {y}" for x, y in zip(examples['User'], examples['Prompt'])]
tokenized = tokenizer(text, truncation=True, padding='max_length')

# Format the data for causal language modeling
tokenized['labels'] = tokenized['input_ids'].copy()
tokenized['input_ids'] = [ids[:-1] for ids in tokenized['input_ids']]
tokenized['labels'] = [ids[1:] for ids in tokenized['labels']]

return tokenized

Apply the tokenizer function to your data

data = data.map(tokenize_and_format, batched=True)
data.set_format(type='torch', columns=['input_ids', 'labels'])

Define the training arguments

training_args = transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
fp16=True,
save_total_limit=3,
logging_steps=1,
output_dir="experiments",
optim="adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05,
)

Define the callback

class EnsureGradsAreFP32(transformers.TrainerCallback):
def on_backward_end(self, args, state, control, **kwargs):
if args.fp16:
for param in model.parameters():
if param.grad is not None:
param.grad.data = param.grad.data.float()

Create the Trainer

trainer = transformers.Trainer(
model=model,
train_dataset=data['train'], # Here, I've used the Dataset
args=training_args,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
callbacks=[EnsureGradsAreFP32()]
)

Disable caching

model.config.use_cache = False

Train the model

trainer.train()

Things I've tried:

-Disabling gradient accumulation.
-Changing the optimizer to "adamw_8bit".
-Making sure all gradients are in FP32 before calling optimizer.step().
-Disabling caching.

Despite these efforts, the problem still persists. Any guidance would be greatly appreciated.

vilsonrodrigues
/

falcon-7b-instruct-sharded

Unscale FP16 Gradients Help