File size: 7,943 Bytes

---
license: apache-2.0
datasets:
- ai2_arc
- unalignment/spicy-3.1
- codeparrot/apps
- facebook/belebele
- boolq
- jondurbin/cinematika-v0.1
- drop
- lmsys/lmsys-chat-1m
- TIGER-Lab/MathInstruct
- cais/mmlu
- Muennighoff/natural-instructions
- openbookqa
- piqa
- Vezora/Tested-22k-Python-Alpaca
- cakiki/rosetta-code
- Open-Orca/SlimOrca
- spider
- squad_v2
- migtissera/Synthia-v1.3
- datasets/winogrande
- nvidia/HelpSteer
- Intel/orca_dpo_pairs
- Squish42/bluemoon-fandom-1-1-rp-cleaned
- LDJnr/Capybara
- JULIELab/EmoBank
- kingbri/PIPPA-shareGPT
---

# A bagel, with everything (except DPO)

![bagel](bagel.png)

## Overview

An experimental fine-tune of [mamba-2.8b-slimpj](https://hf.co/state-spaces/mamba-2.8b-slimpj) using [bagel](https://github.com/jondurbin/bagel)

Default recommended system prompt:
```
You are a helpful, unbiased, uncensored assistant.
```

Supports several prompt formats, but you can also use `tokenizer.apply_chat_template`

You probably want the [DPO version](https://hf.co/jondurbin/bagel-dpo-2.8b-v0.2) - it's much better.

## Example chat script

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("bagel-final-2.8b-v0.2")
model = MambaLMHeadModel.from_pretrained("bagel-final-2.8b-v0.2", device="cuda", dtype=torch.float32)

messages = [{"role": "system", "content": "You are a helpful, unbiased, uncensored assistant."}]
while True:
    user_message = input("[INST] ")
    messages.append({"role": "user", "content": user_message})
    input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    out = model.generate(input_ids=input_ids, max_length=2000, temperature=0.9, top_p=0.7, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.07)
    decoded = tokenizer.batch_decode(out)[0].split("[/INST]")[-1].replace("</s>", "").strip()
    messages.append({"role": "assistant", "content": decoded})
    print("[/INST]", decoded)
```

## SFT data sources

*Yes, you will see benchmark names in the list, but this only uses the train splits, and a decontamination by cosine similarity is performed at the end as a sanity check*

- [ai2_arc](https://huggingface.co/datasets/ai2_arc)
  - Abstraction and reasoning dataset, useful in measuring "intelligence" to a certain extent.
- [airoboros](https://huggingface.co/datasets/unalignment/spicy-3.1)
  - Variety of categories of synthetic instructions generated by gpt-4.
- [apps](https://huggingface.co/datasets/codeparrot/apps)
  - Python coding dataset with 10k problems.
- [belebele](https://huggingface.co/datasets/facebook/belebele)
  - Multi-lingual reading comprehension dataset.
- [bluemoon](https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned)
  - Roleplay data scraped from Bluemoon, then cleaned and formatted as ShareGPT.
- [boolq](https://huggingface.co/datasets/boolq)
  - Corpus of yes/no questions (which can be surprisingly difficult for AI to answer apparently?)
- [capybara](https://huggingface.co/datasets/LDJnr/Capybara)
  - Multi-turn dataset used to create the capybara models.
- [cinematika](https://huggingface.co/datasets/jondurbin/cinematika-v0.1) (instruction and plain text)
  - RP-style data synthesized from movie scripts so the model isn't quite as boring as it otherwise would be.
- [drop](https://huggingface.co/datasets/drop)
  - More reading comprehension.
- [emobank](https://github.com/JULIELab/EmoBank)
  - Emotion annotations using the Valence-Arousal-Domninance scheme.
- [gutenberg](https://www.gutenberg.org/) (plain text)
  - Books/plain text, again to make the model less boring, only a handful of examples supported by [chapterize](https://github.com/JonathanReeve/chapterize)
- [lmsys_chat_1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) (only gpt-4 items, also used for DPO)
  - Chats collected by the lmsys chat arena, containing a wide variety of chats with various models.
- [mathinstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
  - Composite dataset with a variety of math-related tasks and problem/question formats.
- [mmlu](https://huggingface.co/datasets/cais/mmlu)
  - Massive Multitask Language Understanding - a wide variety of questions about various subject matters.
- [natural_instructions](https://huggingface.co/datasets/Muennighoff/natural-instructions)
  - Millions of instructions from 1600+ task categories (sampled down substantially, stratified by task type)
- [openbookqa](https://huggingface.co/datasets/openbookqa)
  - Question answering dataset.
- [pippa](https://huggingface.co/datasets/kingbri/PIPPA-shareGPT)
  - Deduped version of [PIPPA](https://huggingface.co/datasets/PygmalionAI/PIPPA) in ShareGPT format.
- [piqa](https://huggingface.co/datasets/piqa)
  - Phyiscal interaction question answering.
- [python_alpaca](https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca)
  - Python instruction response pairs, validated as functional.
- [rosetta_code](https://huggingface.co/datasets/cakiki/rosetta-code)
  - Code problems and solutions in a variety of programming languages taken from rosettacode.org.
- [slimorca](https://huggingface.co/datasets/Open-Orca/SlimOrca)
  - Collection of ~500k gpt-4 verified chats from OpenOrca.
- [spider](https://huggingface.co/datasets/spider)
  - SQL-targeted dataset.
- [squad_v2](https://huggingface.co/datasets/squad_v2)
  - Contextual question answering (RAG).
- [synthia](https://huggingface.co/datasets/migtissera/Synthia-v1.3)
  - GPT-4 generated data using advanced prompting from Migel Tissera.
- [winogrande](https://huggingface.co/datasets/winogrande)
  - Fill in the blank style prompts.

Only the train splits were used (if a split was provided), and an additional pass of decontamination is performed using approximate nearest neighbor search (via faiss).

## Prompt formatting

In sticking with the theme of the bagel, I didn't want to use a single prompt format, so I used 4 - vicuna, llama-2, alpaca, and chat-ml (sorta).
I also didn't want to randomly select a single prompt format for each item (hoping each instruction would generalize more when used in a variety of prompt formats), so each instruction is actually converted into every prompt format.

This means each epoch of our fine-tune is really basically 4 epochs.  So, for the fine-tunes, I would recommend only doing 1 epoch (or 0.75 epochs).  I am testing with a single epoch using a relatively low learning rate.

### Alpaca (sort of)

```
Below is an instruction that describes a task.  Write a response that appropriately completes the request.

### Instruction:
{system prompt, if provided}
{instruction}

### Response:
```

The main difference here is that because of the dataset formatting and variety of data sources, it would have been much to tedious to add an `### Input:` block, so the inputs are just in the instruction section.

### Vicuna

```
{system prompt, if provided, randomly defaulting to "A chat between a user and an unbiased, uncensored assistant."}
USER: {instruction}
ASSISTANT: 
```

### ChatML (sort of)

I don't really understand the point of having special tokens for `<|im_start|>` and `<|im_end|>`, because in practice they just act as BOS and EOS tokens (but, please correct me if I'm wrong).

So, instead of:
```text
{bos}<|im_start|>{role}
{text}
<|im_end|>{eos}
```

I just changed it to:
```text
{bos}{role}
{text}
{eos}
```

If you *really* want to use `<|im_start|>` and `<|im_end|>`, just update your `tokenizer_config.json` to use `<|im_start|>` instead of `<s>` and `<|im_end|>` instead of `</s>` and when tokenizing.  And if you still don't like what I've done to this chat-ml-ish format, feel free to cry into your pillow or fork the code and do a new fine-tune.

### Llama-2 chat

```
[INST] <<SYS>>
{system}
<</SYS>>

{instruction} [/INST]
```