Add Reports Based on "Llemma: An Open Language Model For Mathematics"
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s):
hendrycks/competition_math
gsm8k
Contaminated model(s):
EleutherAI/llemma_7b
EleutherAI/llemma_34b
Contaminated corpora:
EleutherAI/proof-pile-2
Contaminated split(s):
hendrycks/competition_math
: 7.72 (%) oftest
splitgsm8k
: 0.15 (%) oftest
split
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
Data-based approaches
According to Section 3.5 of Azerbayev et al. (2024), the authors inspect whether any 30-gram in a test sequence (either an input problem or an output solution) occurs in any document of the pre-training corpus Proof-Pile-2
, which they use to train LLEMMA
models. Base on their exact numbers reported in the left part of Table 6, we can estimate the worst case (assuming non-overlapping instances of input problem and output solution) that the percentage of MATH
test split contaminated would be 386 (348 + 34 + 3 + 1) / 5000 = 7.72 (%); and the percentage of GSM8k
test split contaminated would be 2 (2 + 0 + 0 + 0) / 1319 = 0.15 (%).
Citation
URL:
https://openreview.net/pdf?id=4WnqRR915j
Citation:
@inproceedings{
azerbayev2024llemma,
title={Llemma: An Open Language Model for Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen Marcus McAleer and Albert Q. Jiang and Jia Deng and Stella
Biderman and Sean Welleck},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=4WnqRR915j}
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
1.
- Full name: Wei-Lin Chen
- Institution: National Taiwan University, University of Virginia
- Email: [email protected]
2.
- Full name: Yu-Min Tseng
- Institution: National Taiwan University
- Email: [email protected]