Eval results
Its interesting to know how does this model performs compared to others in terms of CoT and world knowledge use ( mainly due to the expanded FF layer )
BBH :
chargoddard/llama2-22b 37.48
vicuna-13B v1.3 35.78
WizardLM-13B-V1.1 39.59
llama-v1-13b 36.52
Still running MMLU, but the all the sub tasks score does seems similar to llama-v2-13b
Updated MMLU scores:
WizardLM-13B-V1.1 49.95
vicuna-13B v1.3 52.1
llama-v1-13b 46.2
chargoddard/llama2-22b 53.60
llama-v2-13b 55.75
Thanks for running these! It’s great to have actual benchmark scores. I’d call this a win - the fact that the score is only slightly deteriorated from llama-v2-13b is very promising. The amount of rehabilitation training done to this model was fairly minimal. I’m hopeful that this will shine with some actual training.