license: cc-by-4.0
library_name: transformers
tags:
- supertrainer2000
- not-for-all-audiences
- writing
- roleplay
datasets:
- euclaise/TinyCoT
- euclaise/mathoverflow-accepted
- euclaise/reddit-instruct-curated
- euclaise/WritingPrompts_curated
- sablo/oasst2_curated
- euclaise/mathqa_programs
- BEE-spoke-data/coedit-reworded-deduped
- pszemraj/booksum-short
- euclaise/reddit-instruct
- euclaise/SciCoT
- euirim/goodwiki
- neulab/conala
- squad
- ropes
- euclaise/logician
- chargoddard/rpguild
- lemonilia/LimaRP
base_model:
- euclaise/Memphis-CoT-3B
language:
- en
Memphis-scribe 3B is a finetune of Memphis-CoT 3B on more creative data, which itself is a finetune of StableLM 3B 4e1t.
It is trained further on TinyCoT, but also on
- 5000 comments from reddit-instruct-curated
- 20000 comments from writingprompts-curated
- 2000 examples of converting MathQA problems to Python snippets
- 2000 examples of shorter booksum cases (both chapter->summary and summary->chapter tasks)
- 2000 examples from mathoverflow-accepted comments with >10 upvotes
- 2000 examples from coedit-reworded-deduped
- 500 examples from SQuAD, for generating QA pairs given the context
- 500 examples from ROPES, for generating scenario+QA triplets given the context
- conala
- 500 examples from logician
- 500 examples from goodwiki, for generating article given the title and description
- 2000 examples from rpguild
- Curated subset of oasst2
- LimaRP
Training procedure
I started from Memphis-CoT 3B, which used a novel iterative contrastive finetuning procedure to improve reasoning ability.
I first generated completions just as in each of the Memphis-CoT cycles.
Then, for each example in the dataset, I sampled a correct and incorrect completion. I applied the same ranking loss over these completions (with a weight of 0.2), but applied the cross-entropy loss over the example tokens instead of the completion tokens.
Finally, I averaged it with the Memphis-CoT model prior to the additional training, again with spherical linear interpolation, this time with a weight of 0.8.
Prompt formats
### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...
Alternatively:
### System:
[Insert system message here, focused on roleplay]
### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...
Benchmarks
This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff, especially for small models.
Model | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
---|---|---|---|
StableLM 3B Base | 2.05% | 25.14% | 36.75% |
Memphis-CoT 3B | 18.8% | 27.22% | 36.92% |
Memphis-scribe 3B | 9.55% | 24.78% | |
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0 |