--- license: cc-by-4.0 library_name: transformers tags: - supertrainer2000 - not-for-all-audiences - writing - roleplay datasets: - euclaise/TinyCoT - euclaise/mathoverflow-accepted - euclaise/reddit-instruct-curated - euclaise/WritingPrompts_curated - sablo/oasst2_curated - euclaise/mathqa_programs - BEE-spoke-data/coedit-reworded-deduped - pszemraj/booksum-short - euclaise/reddit-instruct - euclaise/SciCoT - euirim/goodwiki - neulab/conala - squad - ropes - euclaise/logician - chargoddard/rpguild - lemonilia/LimaRP base_model: - euclaise/Memphis-CoT-3B language: - en --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/wEaKoLeJzidUdTWwQmA6k.png) Memphis-scribe 3B is a finetune of [Memphis-CoT 3B](https://huggingface.co/euclaise/Memphis-CoT-3B) on more creative data, which itself is a finetune of [StableLM 3B 4e1t](https://huggingface.co/stabilityai/stablelm-3b-4e1t/). It is trained further on TinyCoT, but also on - 5000 comments from [reddit-instruct-curated](https://hf.co/euclaise/reddit-instruct-curated) - 20000 comments from [writingprompts-curated](https://hf.co/euclaise/writingprompts-curated) - 2000 examples of [converting MathQA problems to Python snippets](https://hf.co/euclaise/mathqa_programs) - 2000 examples of [shorter booksum cases (both chapter->summary and summary->chapter tasks)](https://huggingface.co/datasets/pszemraj/booksum-short) - 2000 examples from [mathoverflow-accepted](https://hf.co/euclaise/mathoverflow-accepted) comments with >10 upvotes - 2000 examples from [coedit-reworded-deduped](https://huggingface.co/datasets/BEE-spoke-data/coedit-reworded-deduped) - 500 examples from [SQuAD](https://huggingface.co/datasets/squad), for generating QA pairs given the context - 500 examples from [ROPES](https://huggingface.co/datasets/ropes), for generating scenario+QA triplets given the context - [conala](https://huggingface.co/datasets/neulab/conala) - 500 examples from [logician](https://huggingface.co/datasets/euclaise/logician) - 500 examples from [goodwiki](https://huggingface.co/datasets/euirim/goodwiki), for generating article given the title and description - 2000 examples from [rpguild](https://huggingface.co/datasets/chargoddard/rpguild) - [Curated subset of oasst2](https://huggingface.co/datasets/sablo/oasst2_curated) - [LimaRP](https://huggingface.co/datasets/lemonilia/LimaRP) ## Training procedure I started from [Memphis-CoT 3B](https://huggingface.co/euclaise/Memphis-CoT-3B), which used a novel iterative contrastive finetuning procedure to improve reasoning ability. I first generated completions just as in each of the Memphis-CoT cycles. Then, for each example in the dataset, I sampled a correct and incorrect completion. I applied the same ranking loss over these completions (with a weight of 0.2), but applied the cross-entropy loss over the example tokens instead of the completion tokens. Finally, I averaged it with the Memphis-CoT model prior to the additional training, again with spherical linear interpolation, this time with a weight of 0.8. ## Prompt formats ``` ### User: [insert instruction here] ### Assistant: [insert response here] ### User: ... ``` Alternatively: ``` ### System: [Insert system message here, focused on roleplay] ### User: [insert instruction here] ### Assistant: [insert response here] ### User: ... ``` ## Benchmarks This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff, especially for small models. | Model | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) | |:---------------------------------------------------------------------------|:---------------|:----------------------------------------|:--------------------------------| | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-3b-4e1t) | 2.05% | 25.14% | 36.75% | | [Memphis-CoT 3B](https://hf.co/euclaise/Memphis-CoT-3B) | 18.8% | 27.22% | 36.92% | | [Memphis-scribe 3B](https://hf.co/euclaise/Memphis-scribe-3B) | 9.55% | 24.78% | | *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0