elinas
/

Llama-3-15B-Instruct-ft-v2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-15B-Instruct-ft-v2 / README.md

elinas's picture

Update README.md

1bfec0e verified 5 months ago

|

history blame contribute delete

3.69 kB

	---
	base_model:
	- elinas/Llama-3-15B-Instruct-zeroed
	library_name: transformers
	tags:
	- mergekit
	- merge
	- finetune
	datasets:
	- Chat-Error/Pure-dove-sharegpt
	license: llama3
	---
	# Llama-3-15B-Instruct-zeroed-ft-v2

	This is a QLoRA finetune of a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).

	The model is based on a "zeroed" passthrough merge of [Llama-3-15B-Instruct-zeroed](https://huggingface.co/elinas/Llama-3-15B-Instruct-zeroed)

	This was primarily an experiment to see how a passthrough merge will respond to further finetuning of all LoRA modules.

	The model was finetuned on 8192 context length and it can possibly be extended using RoPE up to 32k.

	v3 of the model will contain significantly more data, primarily human focused, aimed to excel at writing as well as maintaining logic, coherency, and continuity.

	[GGUF Quants provided by @gelukuMLG](https://huggingface.co/gelukuMLG/Llama-3-15B-Instruct-ft-v2-GGUF)

	## Datasets

	* [Chat-Error/Pure-dove-sharegpt](https://huggingface.co/datasets/Chat-Error/Pure-dove-sharegpt)

	A small, high quality, curated dataset was used as a PoC / validation on stabilizing the model after the original passthrough merge.

	## Finetuning details
	This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets.
	the first version of this model only targeted o_proj and up_proj
	```yaml
	lora_target_modules:
	- gate_proj
	- down_proj
	- up_proj
	- q_proj
	- v_proj
	- k_proj
	- o_proj
	lora_modules_to_save:
	- embed_tokens
	- lm_head
	```

	The model is coherent even with training the "zeroed" layers plus the additional layers, as this was the recommendation from [Charles Goddard](https://huggingface.co/chargoddard) (mergekit developer) - thank you for sharing the method of merging as well as Toasty
	Pigeon for bringing it to my attention!

	```yaml
	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 3
	- total_train_batch_size: 3
	- total_eval_batch_size: 3
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 25
	- num_epochs: 1
	```

	Optimizer `paged_adamw_8bit` and Deepspeed ZeRO 3 was used at a LR of `1e-5` using the cosine scheduler for 1 epoch on 3x3090s taking 4 hours total.

	Unsloth was used for speed and memory savings.

	Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

	W&B Run Summary
	```
	wandb: eval/loss 0.90895
	wandb: eval/runtime 463.4688
	wandb: eval/samples_per_second 0.833
	wandb: eval/steps_per_second 0.278
	wandb: total_flos 8270790524928.0
	wandb: train/epoch 1.0
	wandb: train/global_step 1157
	wandb: train/grad_norm 7.3847
	wandb: train/learning_rate 0.0
	wandb: train/loss 0.8702
	wandb: train_loss 0.87814
	wandb: train_runtime 16425.2713
	wandb: train_samples_per_second 0.211
	wandb: train_steps_per_second 0.07
	```

	### Framework versions

	- PEFT 0.10.0
	- Transformers 4.40.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1

	## Model Evaluation

	TBD

	If you have any questions or comments on the model, feel free to open a discussion in the community tab.

	[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)