Memphis-scribe-3B / README.md
euclaise's picture
Update README.md
b39185d verified
metadata
license: cc-by-4.0
library_name: transformers
tags:
  - supertrainer2000
  - not-for-all-audiences
  - writing
  - roleplay
datasets:
  - euclaise/TinyCoT
  - euclaise/mathoverflow-accepted
  - euclaise/reddit-instruct-curated
  - euclaise/WritingPrompts_curated
  - sablo/oasst2_curated
  - euclaise/mathqa_programs
  - BEE-spoke-data/coedit-reworded-deduped
  - pszemraj/booksum-short
  - euclaise/reddit-instruct
  - euclaise/SciCoT
  - euirim/goodwiki
  - neulab/conala
  - squad
  - ropes
  - euclaise/logician
  - chargoddard/rpguild
  - lemonilia/LimaRP
base_model:
  - euclaise/Memphis-CoT-3B
language:
  - en

image/png

Memphis-scribe 3B is a finetune of Memphis-CoT 3B on more creative data, which itself is a finetune of StableLM 3B 4e1t.

It is trained further on TinyCoT, but also on

Training procedure

I started from Memphis-CoT 3B, which used a novel iterative contrastive finetuning procedure to improve reasoning ability.

I first generated completions just as in each of the Memphis-CoT cycles.

Then, for each example in the dataset, I sampled a correct and incorrect completion. I applied the same ranking loss over these completions (with a weight of 0.2), but applied the cross-entropy loss over the example tokens instead of the completion tokens.

Finally, I averaged it with the Memphis-CoT model prior to the additional training, again with spherical linear interpolation, this time with a weight of 0.8.

Prompt formats

### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...

Alternatively:

### System:
[Insert system message here, focused on roleplay]
### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...

Benchmarks

This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff, especially for small models.

Model GSM8K (5-shot) AGIEval (English/Nous subset, acc_norm) BIG Bench Hard (CoT, few-shot*)
StableLM 3B Base 2.05% 25.14% 36.75%
Memphis-CoT 3B 18.8% 27.22% 36.92%
Memphis-scribe 3B 9.55% 24.78%
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0