Alignment Issues
The standard LLM tests are handy, but severely limited since they need to be easily and objectively graded (e.g. multiple choice tests).
I put LLMs through a set of tricky and varied tests, which includes largely overlooked things like pop culture, and grade them on degree of correctness. For example, which character did Meg Ryan play in Joe versus the Volcano, which is actually 3 different characters, so an LLM gets 1 point for a right name, and up to 3 points if it returns all 3.
Anyways, your AYT LLM performed notably better than any other Llama 2 13b in my testing, or Mistral 7b for that matter. However, this AYB version started failing randomly and miserably, and it seems to be primarily caused by misfiring alignment.
For example, questions that asked about characters from movies or shows were treated like real people simply because they contained a celebrity name as a hint for finding the correct answer. Or it wasn't even the prompt. The LLM's response would type something that trigger an alignment response mid-sentence.
In short, AYT went from being the best 13b Llama 2 LLM by far on my testing, to being below the pack with AYB.
Thanks for sharing your analysis. We are deveploping models considering many issues but I think this leaderboard isn't the best way to measure LLM's performance in practical aspects.
So, we're trying to make several models to match the standard of leaderboard and common people's.