llama3.1-8b-spaetzle-v51

This is only a quick test in merging 3 and 3.1 llamas despite a number of differences in tokenizer setup i.a., also motivated by ongoing problems with BOS, looping, etc, with 3.1, esp. with llama.cpp, missing full RoPE scaling yet, etc. Performance is yet not satisfactory of course, which might have a number of causes.

GGUF is (for another test purpose) done with old llama.cpp binary (b2750).

Summary Table

Model	AGIEval	TruthfulQA	Bigbench
llama3.1-8b-spaetzle-v51	42.23	57.29	44.3
llama3-8b-spaetzle-v39	43.43	60.0	45.89

AGIEval Results

Task	llama3.1-8b-spaetzle-v51	llama3-8b-spaetzle-v39
agieval_aqua_rat	27.95	24.41
agieval_logiqa_en	38.10	37.94
agieval_lsat_ar	24.78	22.17
agieval_lsat_lr	42.94	45.29
agieval_lsat_rc	59.11	62.08
agieval_sat_en	68.45	71.36
agieval_sat_en_without_passage	38.35	44.17
agieval_sat_math	38.18	40.00
Average	42.23	43.43

TruthfulQA Results

Task	llama3.1-8b-spaetzle-v51	llama3-8b-spaetzle-v39
mc1	38.07	43.82
mc2	57.29	60.00
Average	57.29	60.00

Bigbench Results

Task	llama3.1-8b-spaetzle-v51	llama3-8b-spaetzle-v39
bigbench_causal_judgement	56.32	59.47
bigbench_date_understanding	69.65	70.73
bigbench_disambiguation_qa	31.40	34.88
bigbench_geometric_shapes	29.81	24.23
bigbench_logical_deduction_five_objects	30.20	36.20
bigbench_logical_deduction_seven_objects	23.00	24.00
bigbench_logical_deduction_three_objects	55.67	65.00
bigbench_movie_recommendation	33.00	36.20
bigbench_navigate	55.10	51.70
bigbench_reasoning_about_colored_objects	66.55	68.60
bigbench_ruin_names	52.23	51.12
bigbench_salient_translation_error_detection	25.55	28.96
bigbench_snarks	61.88	62.43
bigbench_sports_understanding	51.42	53.96
bigbench_temporal_sequences	59.30	53.60
bigbench_tracking_shuffled_objects_five_objects	23.28	22.32
bigbench_tracking_shuffled_objects seven objects	17.31	17.66
bigbench_tracking_shuffled_objects three objects	55.67	65.00
Average	44.30	45.89

(GPT4All run broke.)

🧩 Configuration

models:
  - model: cstr/llama3-8b-spaetzle-v34
    # no parameters necessary for base model
  - model: sparsh35/Meta-Llama-3.1-8B-Instruct
    parameters:
      density: 0.65
      weight: 0.5
merge_method: dare_ties
base_model: cstr/llama3-8b-spaetzle-v34
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/llama3-8b-spaetzle-v51"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])