models are in models/
names are model_dimension and n_layers (768-8 is not fully trained, but the loss is pretty flat)
inside models/old/ there are models that were trained on the non-cleaned dataset (with a tokenizer trained on that dataset)(I think all off them are fully trained, but some are missing from my wandb)
tok4096.model is of the cleaned dataset, tok4096_old.model is on the non_cleaned one
train_snakes.py is the training script (you need to change the outdir, d_model and n_layer). It initializes the mamba using the MambaLMHeadModel class.
model.py is where the MambaLMHeadModel class is defined.
context lenght is 256