Is Meditron-7b better than LLaMA2-7b?
#9
by
sean0042
- opened
Hello,
In general, it's very difficult for us to analyze the results given the limited information of the evaluation settings:
- Did you finetune the models or are you using zero-shot prompting with the base models?
- What kind of inference mode are you using?
- How are you parsing the answers?
- Are you using in-context learning?
- If 4 is true, are you running multiple runs with different in-context examples sampled with different random seeds? For example, PubMedQA has very large variance (15 - 50) under different in-context examples.
We refer to our reported in-context learning results from the paper:
As you can see, on MedQA-5 and MedMCQA, Meditron-7B underperforms Llama-2-7b. The performances of these two models on MMLU-Medical and MedQA-4 are close. It is after fine-tuning on the datasets we observe a large performance gain.