LoRA Adapter for Llama 65B 'pre-trained' on several datasets part of the OpenAssistant project

This repo contains a low-rank adapter for Llama 65B fit on datasets part of the OpenAssistant project.

The model was trained with flash attention and gradient checkpointing and deepspeed stage 3 and parameter/optimizer offload on 8 x A100 80gb

You will most likely need a deepspeed inference script: check out bloom inference

Dataset Details

- alpaca_gpt4:
    val_split: 0.025
    max_val_set: 250
- vicuna:
    val_split: 0.025
    max_val_set: 250
- gpteacher_roleplay:
    val_split: 0.05
- wizardlm_70k:
    val_split: 0.05
    max_val_set: 500
- joke:
    val_split: 0.05
- oa_stackexchange:
    val_split: 0.05
    fraction: 0.1
    max_val_set: 1000
- tell_a_joke:
    val_split: 0.05
    max_val_set: 250
- webgpt:
    val_split: 0.05
    max_val_set: 250
- gpt4all:
    val_split: 0.01
    max_val_set: 1000
- code_alpaca:
    val_split: 0.05
    max_val_set: 250
- oig_file:
    source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
    max_count: 10000
    min_length: 250
    val_split: 0.05
    max_val_set: 250
- minimath:
    val_split: 0.05
- humaneval_mbpp_codegen_qa:
    val_split: 0.05
- humaneval_mbpp_testgen_qa:
    val_split: 0.05
- grade_school_math_instructions:
    val_split: 0.05
- recipes:
    val_split: 0.05
- cmu_wiki_qa:
    val_split: 0.05
- oa_wiki_qa_bart_10000row:
    val_split: 0.05
    max_val_set: 250
- soda:
    fraction: 0.25
    max_val_set: 1000
- oa_leet10k:
    val_split: 0.05
    max_val_set: 250
- dolly15k:
    val_split: 0.05
    max_val_set: 300

Model Details

  • Developed as part of the OpenAssistant Project

  • Model type: PEFT Adapter for frozen LLaMA

  • Language: English

  • Epochs: 1

  • Batch size: 128

  • Max Length: 2048

  • Learning rate: 5e-5

  • Lora r: 16

  • Lora Alpha: 32


Two special tokens are used to mark the beginning of user and assistant turns: <|prompter|> and <|assistant|>. Each turn ends with a <|endoftext|> token.

Input prompt example:

<|prompter|>What is a meme, and what's the history behind this word?</s><|assistant|>

The input ends with the <|assistant|> token to signal that the model should start generating the assistant reply.

Example Inference Code (Note several embeddings need to be loaded along with the LoRA weights):

from pathlib import Path

import torch
import transformers
from huggingface_hub import hf_hub_download
from peft import PeftModel
from transformers import GenerationConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16
repo_id = "jordiclive/Lora-llama-65B-pre-train-oasst"
base_model = "decapoda-research/llama-65b-hf"

# Model Loading
def add_embeddings(model, embed_path, tokenizer):
    old_embeddings = model.get_input_embeddings()
    old_num_tokens, old_embedding_dim = old_embeddings.weight.size()
    new_embeddings = torch.nn.Embedding(old_num_tokens, old_embedding_dim)
    new_embeddings.to(old_embeddings.weight.device, dtype=old_embeddings.weight.dtype)
    embed_weights = torch.load(embed_path, map_location=old_embeddings.weight.device)
    vocab_size = tokenizer.vocab_size
    new_embeddings.weight.data[:vocab_size, :] = old_embeddings.weight.data[:vocab_size, :]
    new_embeddings.weight.data[vocab_size : vocab_size + embed_weights.shape[0], :] = embed_weights.to(

def load_peft_model(model, peft_model_path, tokenizer):
    embed_weights = hf_hub_download(peft_model_path, "extra_embeddings.pt")
    model.resize_token_embeddings(tokenizer.vocab_size + torch.load(embed_weights).shape[0])
    model.config.eos_token_id = tokenizer.eos_token_id
    model.config.bos_token_id = tokenizer.bos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id
    model = PeftModel.from_pretrained(
    model.eos_token_id = tokenizer.eos_token_id
    add_embeddings(model, embed_weights, tokenizer)
    return model

tokenizer = transformers.AutoTokenizer.from_pretrained(repo_id)

model = transformers.AutoModelForCausalLM.from_pretrained(
    base_model, torch_dtype=dtype, trust_remote_code=True,
model = load_peft_model(model, repo_id, tokenizer)

# device  configuration
model = model.to(device)
if dtype == torch.float16:
    model = model.half()

# Choose Generation parameters

generation_config = GenerationConfig(

def format_system_prompt(prompt, eos_token="</s>"):
    return "{}{}{}{}".format("<|prompter|>", prompt, eos_token, "<|assistant|>")

def generate(prompt, generation_config=generation_config, max_new_tokens=2048, device=device):
    prompt = format_system_prompt(prompt)  # OpenAssistant Prompt Format expected
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        generation_output = model.generate(
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    print("Text generated:")
    return output

generate("What is a meme, and what's the history behind this word?")
generate("What's the Earth total population")
generate("Write a story about future of AI development")
