README.md · euclaise/Memphis-scribe-3B at main

metadata

license: cc-by-4.0
library_name: transformers
tags:
  - supertrainer2000
  - not-for-all-audiences
  - writing
  - roleplay
datasets:
  - euclaise/TinyCoT
  - euclaise/mathoverflow-accepted
  - euclaise/reddit-instruct-curated
  - euclaise/WritingPrompts_curated
  - sablo/oasst2_curated
  - euclaise/mathqa_programs
  - BEE-spoke-data/coedit-reworded-deduped
  - pszemraj/booksum-short
  - euclaise/reddit-instruct
  - euclaise/SciCoT
  - euirim/goodwiki
  - neulab/conala
  - squad
  - ropes
  - euclaise/logician
  - chargoddard/rpguild
  - lemonilia/LimaRP
base_model:
  - euclaise/Memphis-CoT-3B
language:
  - en

Memphis-scribe 3B is a finetune of Memphis-CoT 3B on more creative data, which itself is a finetune of StableLM 3B 4e1t.

It is trained further on TinyCoT, but also on

5000 comments from reddit-instruct-curated
20000 comments from writingprompts-curated
2000 examples of converting MathQA problems to Python snippets
2000 examples of shorter booksum cases (both chapter->summary and summary->chapter tasks)
2000 examples from mathoverflow-accepted comments with >10 upvotes
2000 examples from coedit-reworded-deduped
500 examples from SQuAD, for generating QA pairs given the context
500 examples from ROPES, for generating scenario+QA triplets given the context
conala
500 examples from logician
500 examples from goodwiki, for generating article given the title and description
2000 examples from rpguild
Curated subset of oasst2
LimaRP

Training procedure

I started from Memphis-CoT 3B, which used a novel iterative contrastive finetuning procedure to improve reasoning ability.

I first generated completions just as in each of the Memphis-CoT cycles.

Then, for each example in the dataset, I sampled a correct and incorrect completion. I applied the same ranking loss over these completions (with a weight of 0.2), but applied the cross-entropy loss over the example tokens instead of the completion tokens.

Finally, I averaged it with the Memphis-CoT model prior to the additional training, again with spherical linear interpolation, this time with a weight of 0.8.

Prompt formats

### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...

Alternatively:

### System:
[Insert system message here, focused on roleplay]
### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...

Benchmarks

This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff, especially for small models.

Model	GSM8K (5-shot)	AGIEval (English/Nous subset, acc_norm)	BIG Bench Hard (CoT, few-shot*)
StableLM 3B Base	2.05%	25.14%	36.75%
Memphis-CoT 3B	18.8%	27.22%	36.92%
Memphis-scribe 3B	9.55%	24.78%
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0