Magistrate 3.2 3B
Continued pretraining applied to meta-llama/Llama-3.2-3B using no synthetic legal data. ~250M tokens.
The model achieves the following results on the evaluation set:
- Loss: 0.6802
Instruct version is available here
See axolotl config
axolotl version: 0.4.1
base_model: meta-llama/Llama-3.2-3B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: json
data_files: "data/amendments_with_content_converted.json"
type: completion
- path: json
data_files: "data/federal_rules_converted.json"
type: completion
- path: json
data_files: "data/cornell_legal_encyclopedias_converted.json"
type: completion
- path: json
data_files: "data/pocket_guide_for_judges_converted.json"
type: completion
- path: json
data_files: "data/us_federal_code.json"
type: completion
- path: json
data_files: "data/us_supreme_court_summaries_converted.json"
type: completion
- path: json
data_files: "data/us_supreme_court_converted.json"
type: completion
- path: json
data_files: "data/ucfr.json"
type: completion
- path: json
data_files: "data/map-code-filtered.json"
type: completion
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/lora-out
sequence_len: 8192
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
# adapter: lora
# lora_model_dir:
# lora_r: 128
# lora_alpha: 32
# lora_dropout: 0.05
# lora_target_linear: true
# lora_fan_in_fan_out:
# lora_modules_to_save:
# - embed_tokens
# - lm_head
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# mlp.down_proj layers
- model.layers.0.mlp.down_proj
- model.layers.1.mlp.down_proj
- model.layers.17.mlp.down_proj
- model.layers.19.mlp.down_proj
- model.layers.18.mlp.down_proj
- model.layers.5.mlp.down_proj
- model.layers.20.mlp.down_proj
- model.layers.2.mlp.down_proj
- model.layers.4.mlp.down_proj
- model.layers.6.mlp.down_proj
- model.layers.3.mlp.down_proj
- model.layers.16.mlp.down_proj
- model.layers.15.mlp.down_proj
- model.layers.13.mlp.down_proj
# mlp.gate_proj layers
- model.layers.0.mlp.gate_proj
- model.layers.1.mlp.gate_proj
- model.layers.2.mlp.gate_proj
- model.layers.3.mlp.gate_proj
- model.layers.22.mlp.gate_proj
- model.layers.21.mlp.gate_proj
- model.layers.20.mlp.gate_proj
- model.layers.23.mlp.gate_proj
- model.layers.19.mlp.gate_proj
- model.layers.4.mlp.gate_proj
- model.layers.18.mlp.gate_proj
- model.layers.17.mlp.gate_proj
- model.layers.5.mlp.gate_proj
- model.layers.24.mlp.gate_proj
# mlp.up_proj layers
- model.layers.4.mlp.up_proj
- model.layers.3.mlp.up_proj
- model.layers.5.mlp.up_proj
- model.layers.6.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.2.mlp.up_proj
- model.layers.8.mlp.up_proj
- model.layers.14.mlp.up_proj
- model.layers.13.mlp.up_proj
- model.layers.11.mlp.up_proj
- model.layers.9.mlp.up_proj
- model.layers.1.mlp.up_proj
- model.layers.15.mlp.up_proj
- model.layers.12.mlp.up_proj
# self_attn.k_proj layers
- model.layers.25.self_attn.k_proj
- model.layers.22.self_attn.k_proj
- model.layers.19.self_attn.k_proj
- model.layers.20.self_attn.k_proj
- model.layers.17.self_attn.k_proj
- model.layers.24.self_attn.k_proj
- model.layers.23.self_attn.k_proj
- model.layers.18.self_attn.k_proj
- model.layers.21.self_attn.k_proj
- model.layers.27.self_attn.k_proj
- model.layers.15.self_attn.k_proj
- model.layers.10.self_attn.k_proj
- model.layers.6.self_attn.k_proj
- model.layers.5.self_attn.k_proj
# self_attn.o_proj layers
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_32bit
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
warmup_steps: 690
evals_per_epoch: 2
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero3.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
pad_token: <|end_of_text|>
Model description
This is a base model trained on US Supreme Court proceedings, US federal code and regulations.
Intended uses & limitations
This model is intended for research purposes. You are liable for all model outputs.
Training and evaluation data
The training data consists of US Supreme Court verdicts, federal regulations, laws and treaties.
Some other resources have been included from institutions like CLL to fill in the gaps in knowledge for industry jargon.
Training procedure
Spectrum top 35% fine tune. Thanks to the cognitive computations team for the work done on spectrum.
Methodology based on Cohere's paper: To Code, or Not To Code? Exploring Impact of Code in Pre-training
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 690
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.3589 | 0.0004 | 1 | 1.5640 |
0.9936 | 0.4984 | 1154 | 0.9440 |
0.8384 | 0.9968 | 2308 | 0.8392 |
0.8226 | 1.4963 | 3462 | 0.7802 |
0.6568 | 1.9949 | 4616 | 0.7059 |
0.5163 | 2.4923 | 5770 | 0.6886 |
0.492 | 2.9922 | 6924 | 0.6802 |
Framework versions
- Transformers 4.45.0
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.20.0
- Downloads last month
- 42
Model tree for macadeliccc/magistrate-3.2-3b-base
Base model
meta-llama/Llama-3.2-3B