Likely FLORES contamination for Claude 3 Opus

#29

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.

  • facebook/flores

Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).

  • Claude 3 Opus

Contaminated corpora: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. CohereForAI/aya_dataset)

Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.
Unclear.

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

From https://arxiv.org/abs/2404.13813:

Claude shows signs of data contamination on the FLORES-200 dataset in both translation directions. It is likely that Claude has seen the FLORES data during its training, but it remains unclear whether this measurably affects Claude’s performance on the benchmark. We investigate this question by comparing the results on FLORES versus BBC News. Because the FLORES and BBC datasets may vary in difficulty and quality, we cannot directly compare the raw chrF++ scores of each model across the datasets. However, we expect that some model has no dataset contamination relative to another model if the relative performance between the models is similar between the dataset in question and unseen data. In Figure 1, we visualize this difference. In the xxx->eng direction, we observe that Google and NLLB have very similar performance across the FLORES and BBC datasets, indicating little-to-no contamination of either dataset for either model. However, we observe substantial increase in performance of Claude on FLORES compared to BBC relative to either Google or NLLB, which suggests that Claude has overfit the FLORES dataset, with its performance overrepresented by 1-2 percentage points. This analysis calls into question the validity of evaluating Claude on FLORES.

TL;DR: Compared to NLLB which is not trained on FLORES data, Claude shows much worse relative performance on the (newly introduced, unseen) BBC MT test dataset, indicating that Claude has overfit on FLORES.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/abs/2404.13813
Citation:



@article

	{enis2024llm,
  title={From LLM to NMT: Advancing Low-Resource Machine Translation with Claude},
  author={Enis, Maxim and Hopkins, Mark},
  journal={arXiv preprint arXiv:2404.13813},
  year={2024}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Workshop on Data Contamination org

Hi @davidstap !

Thank you for your contribution. I will merge it :D

Iker changed pull request status to merged

Sign up or log in to comment