TruthfulQA score look odd, no ?
There is really a big jump for this score vs all other which do not improve so much. Any idea ?
@vince62s A TruthfulQA bump of 10 points isn't that unusual.
However, due to the nature of the TruthfulQA test gaining any more than about 5 points over the foundational model means one of two things. Either the addition of test contamination or an increase in truth denialism. That is, avoiding telling falsehoods like the earth is flat by being overly cynical, and consequently, also denying countless true things are true, such as Tom Hanks wasn't in the movie Forrest Gump. In this case, I think it's primarily the later.
I tested all the top Phi-2's including this, super, DPO... and I suggest you use the DPO version.
Phi-2 is just too small, and frankly stupid, to handle the amount of data they used. To their credit they filtered the data as much as possible (e.g. de-duped), but it wasn't enough. It not only keeps saying true things are false to avoid saying a few false things are true (boosting TruthfulQA), but it also keeps rambling on and on, repeats itself, goes off on weird tangents, start moralizing (e.g. respect celebrity privacy) and so on. In short, orange2 keeps going off the rails.
I hope someone takes the approach Einstein v4 took with Mistral and applies it to Phi-2. Namely, a diverse group of small unaligned training data sets for 1.5 epochs, just to nudge Phi-2 in the right direction without having the fine-tuning take over and ramble on, enact truth denialism, go on tangents, moralize... like orange2 does.
Good question @vince62s and very interesting comments @Phil337.
The phi-2-orange range of finetunes are an ongoing experiment to see how much a 2.7b model can generalize. We know it's too small to encode as much factual information as larger models and that it does work well when finetuned on a very specific task. But what are the limits as a general model? The different versions of phi-2-orange all use the same broad set of training data, but try out different training approaches in each version in order to compare and contrast.
For this v2, it is indeed very chatty without any prompting. For the TruthfulQA result, the jump was quite high - maybe attributable to the tendency for lengthy output? We know that prompting is very important for the model, and looking at the TruthfulQA eval source it shows 6 Q&A example pairs in the prompt before each eval question (even though it's normally listed as a zero-shot task).
The prompt could also be a factor in the experience you're seeing with this and other phi-2 finetunes. For example, taking your Tom Hanks question, without any prompt, phi-2-orange will indeed ramble on and start hallucinating - even attributing Robert Redford as an actor in Forrest Gump (and it's sequel?!?) in an output I just ran.
With a system prompt in ChatML format (e.g. asking the model to be short and direct), it gives a more reasonable output:
<|im_start|>system
You are a helpful assistant. Be short and direct in your answers.<|im_end|>
<|im_start|>user
Was Tom Hanks in the movie Forrest Gump? If so, who did he play and give details of the plot.<|im_end|>
<|im_start|>assistant
gives the output:
Yes, Tom Hanks starred in the movie Forrest Gump. He played the role of Forrest Gump, a kind-hearted, simple-minded
man with a low IQ, who unwittingly becomes involved in significant historical events while forming a deep bond with
his childhood friend Jenny. The plot revolves around Forrest's life journey, filled with love, friendship, and unexpected
adventures, as he meets various people and experiences pivotal moments in history. Tom Hanks' performance in this
film earned him an Academy Award for Best Actor.<|im_end|>
Future versions will potentially look at other model structures alongside different training approaches, such as lora-based MoE, to see if it can better generalize / be more succinct / etc. but still remain within a small model size. Feedback always appreciated on those when released!