Evaluating same model on different splits of same dataset creates ambiguous evaluation

#17
by MoritzLaurer HF staff - opened

I've just auto-evaluated one of my models and it worked well, it returned basically the same results as my private evaluation.
Issue: I've evaluated on ANLI, which has 3 different test splits (test_r1, test_r2, test_r3). In the pull request, the evaluation for test_r3 is only called "anli" and the evaluation on test_r2 is also just called "anli". This means that now it is unclear which split the numbers refer to, see here: https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
=> could you update the autogenerated pull-request so that it includes the split into the dataset title?
(the same issue will happen with multi_nli, which also has two different test splits)

Evaluation on the Hub org

Great suggestion @MoritzLaurer - we'll certainly add it!

Sign up or log in to comment