@nroggendorff on Hugging Face: "Do you guys want to see my training code for…"

To train the model from scratch is not easy hence using a Text Corpus dataset to begin :
This must be nearly overFit ! ( so as many epochs as possible is the key )
Then It needs to train for QA ! ...

Then it may become rereasonable !

The model should be trained on Q/A For a massive Turn perhaps , Even with repetitive data is fine , so a massive corpus and one epoch ad the model will begin to develop!

Then you can begin Alpaca !

So for us self made poor developers with lack of resources !
We can do this on a small text corpus ( not shakespeare , ) Mass Epochs
And a Dolly Dataset ! .. or UltraChat ! ( small selective samples groups of 1-5k samples many epochs ( usually 5 makes it fit the data ) ... It also better to do large batches ... so the trick is to train the gradient steps ... but raising this number the steps you choose will be divded across these devices as it makes devices for the step :

MY TRAINING SETTINGS


from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 10,
        gradient_accumulation_steps = 10,
        warmup_steps = 10,
        max_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 4747,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

So here we have 10 gradient_accumulation_steps which are the number of devices to split the per_device_train_batch_size by !

So essentially this step is 100 samples ! 10 Batches ... the GOOGLE COLAB ... Seems to divide the job into these devices one by one in sperate memory spaces ... so each step is only 10 samples, then after completes the batches that step finishes as a single step ... so you can push more from a single step ! if you play with these numbers you can find a balance for your max sequence size!

ANOTHER Tip


        learning_rate = 2e-5,
        embedding_learning_rate = 5e-6,

You can also train the embeddings !! < Important for new models : as well as training new tasks! ( im not sure if embedding learning rate is ( only ) unsloth )

Oh yes and Packing !

Dont get me started !
If your chunking your dataset then you know already thagt each field will be inside the correct size , but that means no packing !
If your not chunking ! then packing ca be very useful for datasets with varuable feild sizes ... And indeed its like compressing the dataset ad reducing the total samples to a smaller datset , so less steps for a epoch !

ME:

Bro im interested in nseeing the training setting you use ! I like t compare with my own settings and see if im on target !

I did train a model from scratch but i think i should have been dedicated to the task more as i got bored as no resources bro !

So i rushed step 1: the most important step as well as my training data for step 1 was to variable ! it needs to be a single focused .. simular to a corpus of shakespeare ... IT was not greatbut i did merge the model with other versions of itself and it got better!

Join the conversation

MY TRAINING SETTINGS

ANOTHER Tip

Oh yes and Packing !

ME: