MT-Bench Results
I was curious so gave it a spin with MT-Bench (HF transformers). It's on par with gpt-3.5 (but so is mlabonne/OmniBeagle-7B so take these results for what they are).
########## First turn ##########
gpt-4 1 8.956250
Senku-70B-Full 1 8.387500
omnibeagle14-7b 1 8.325000
claude-v1 1 8.150000
orion-14b-chat 1 8.143750
gpt-3.5-turbo 1 8.075000
########## Second turn ##########
gpt-4 2 9.025000
claude-instant-v1 2 8.012658
gpt-3.5-turbo 2 7.812500
claude-v1 2 7.650000
Senku-70B-Full 2 7.600000
omnibeagle14-7b 2 7.587500
########## Average ##########
gpt-4 8.990625
Senku-70B-Full 7.993750
omnibeagle14-7b 7.956250
gpt-3.5-turbo 7.943750
And here's the spider plot. The major outlier is that the OmniBeagle-7B higher than Senku-70B or gpt-3.5 on "reasoning" which seems rather unlikely:
Still, looks like MT-Bench not too out of line with EQ-Bench and as a SlimOrca-only tune, it seems to point to MiQu still having a lot of untapped potential.
Very interesting, although it seems something odd is happening with reasoning (or prompting in general). I may need to do a bit more tweaking on that. I imagine the 7B model that is ahead of GPT 3.5 is probably overfit somehow.
https://huggingface.co/ShinojiResearch/Senku-70B-Full/discussions/3
I have a fair number of MT-Bench runs and will upload the .jsonl outputs soon so you can review it if you want. I used llama2
formatting btw (the chat_template enforces a format that breaks). I include a system prompt "You are a helpful assistant."
I'd agree that the 7Bs are likely overfitting (and the chances that a 7B are actually smarter than a 70B, I'd say is about 0.00%), although mlabonne's merges are primarily targeting Nous Suite, there are others that are purposely training for MT-Bench scores (this of course, makes it rather useless for comparison).
I do think it should be possible to improve "reasoning" style responses fairly easily while making sure that MT-Bench questions remain firmly out of distribution (so that it still remains as a useful yardstick).
@leonardlin great job as always Leonard!
I think the MT-bench shows that there is more untapped potential here as to how humans can perceive the mode.
Relying solely on MT-bench of course wouldn't be helpful, but if it falls short compared to some other model, it does mean there is areas for improvement in being better conversationalist.
Exactly. I am actively training another version that I think will fix the prompting issues (original axolotl config for Senku V1 used chatml, which I think is conflicting with the mistral one). Some other people are also working on similar finetunes.
can't wait!