Request for reproduction
Thanks for sharing the dataset and pre-trained models!
I am trying to reproduce the results using the released parameters of SteamSHP-flan-t5-large. I've followed the instructions of the model card and description of the SHP dataset, but I fail to reach the reported test accuracy. For example, in askscience subreddit, I observed 64.50% of test accuracy while the reported one is 69.26%. I think there are several (unreleased) details required for exact reproduction. For example, you mentioned that 18621 out of 20753 test examples are used for evaluation, and only (5 examples per unique post_id) are used for training.
Could you provide those details including the accurate subset indices of the train/test set?
Thanks for your interest @jpark1 ! Here's the code for training SteamSHP models: https://github.com/kawine/steamshp
- First run 'python format.py' to generate the train/validation/test files across all domains in the data directory. There should be roughly 125K training examples and 18621 test examples across all domains (Anthropic included).
- Then modify 'finetune_t5_parallel.sh' as needed to train a flan-t5-large model.
- Use preference_model_dist.eval to evaluate the preference model the standard way. Use preference_model_dist.eval_regress to evaluate the reward model-style usage.
You should be able to get the reported accuracies.