|
GUIDELINES = """ |
|
# Contribution Guidelines |
|
|
|
The Data Contamination Database is a community-driven project and we welcome contributions from everyone. This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024. Please check the workshop website for more information. |
|
|
|
|
|
We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes. |
|
|
|
If you wish to contribute to the project by reporting a data contamination case, please open a pull request in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/discussions). Your [pull request](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/discussions?new_pr=true) should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file and add a new row with the details of the contamination case, or evidence of lack of contamination. Please edit the following template with the details of the contamination case. Pull Requests that do not follow the template won't be accepted. |
|
|
|
As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper. If you have any questions, please contact us at [email protected] or open a discussion in the space itself. |
|
|
|
# Template for reporting data contamination |
|
|
|
```markdown |
|
## What are you reporting: |
|
- [ ] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile) |
|
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI) |
|
|
|
**Evaluation dataset(s)**: Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise provide a link to a paper, GitHub or dataset-card. |
|
|
|
**Contaminated model(s)**: Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. `allenai/OLMo-7B`). |
|
|
|
**Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`) |
|
|
|
**Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%. |
|
|
|
> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%. |
|
|
|
## Briefly describe your method to detect data contamination |
|
|
|
- [ ] Data-based approach |
|
- [ ] Model-based approach |
|
|
|
Description of your method, 3-4 sentences. Evidence of data contamination (Read below): |
|
|
|
#### Data-based approaches |
|
Data-based approaches identify evidence of data contamination in a pre-training corpus by directly examining the dataset for instances of the evaluation data. This method involves algorithmically searching through a large pre-training dataset to find occurrences of the evaluation data. You should provide evidence of data contamination in the form: "dataset X appears in line N of corpus Y," "dataset X appears N times in corpus Y," or "N examples from dataset X appear in corpus Y." |
|
|
|
#### Model-based approaches |
|
|
|
Model-based approaches, on the other hand, utilize heuristic algorithms to infer the presence of data contamination in a pre-trained model. These methods do not directly analyze the data but instead assess the model's behavior to predict data contamination. Examples include prompting the model to reproduce elements of an evaluation dataset to demonstrate memorization (i.e https://hitz-zentroa.github.io/lm-contamination/blog/) or using perplexity measures to estimate data contamination (). You should provide evidence of data contamination in the form of evaluation results of the algorithm from research papers, screenshots of model outputs that demonstrate memorization of a pre-training dataset, or any other form of evaluation that substantiates the method's effectiveness in detecting data contamination. You can provide a confidence score in your predictions. |
|
|
|
## Citation |
|
|
|
Is there a paper that reports the data contamination or describes the method used to detect data contamination? |
|
|
|
URL: `https://aclanthology.org/2023.findings-emnlp.722/` |
|
Citation: `@inproceedings{...` |
|
|
|
|
|
*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request. |
|
- Full name: |
|
- Institution: |
|
- Email: |
|
``` |
|
--- |
|
|
|
### How to update the contamination_report.csv file |
|
|
|
The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns: |
|
- **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise proviede the name of the dataset. |
|
- **Subset**: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`. |
|
- **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset. |
|
- **Train split**: Percentage of the train split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split. |
|
- **Development split**: Percentage of the development split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. |
|
- **Train split**: Percentage of the test split contaminated. 0 means no contamination. 100 means that the dataset has been fully compromised. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split. |
|
- **Approach**: data-based or model-based approach. See above for more information. |
|
- **Reference**: If there is paper or any other resource describing how you have detected this contamination example, provide the URL. |
|
- **PR Link**: Leave it blank, we will update it after you create the Pull Request. |
|
""".strip() |
|
|
|
|
|
PANEL_MARKDOWN = """ |
|
# Data Contamination Database |
|
The Data Contamination Database is a community-driven project and we welcome contributions from everyone. This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024. Please check the workshop website for more information. |
|
|
|
We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes. |
|
|
|
If you wish to contribute to the project by reporting a data contamination case, please read the Contribution Guidelines tab. |
|
|
|
Here is a description of each column in the table below: |
|
|
|
- **Evaluation Dataset:** Name of the evaluation dataset that has (not) been compromised. |
|
- **Contaminated Source:** Name of the model that has been trained with the evaluation dataset or name of the pre-training corpora that contains the evaluation dataset. |
|
- **Train Split:** Percentage of the train split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised. |
|
- **Development Split:** Percentage of the development split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised. |
|
- **Test Split:** Percentage of the test split contaminated. 0 means no contamination; 100 means that the dataset has been fully compromised. |
|
- **Approach:** Data-based or model-based approach. Data-based approaches search in publicly available data instances of evaluation benchmarks. Model-based approaches attempt to detect data contamination in already pre-trained models. |
|
- **Reference:** Paper or any other resource describing how this contamination case has been detected. |
|
- **PR Link:** Link to the PR in which the contamination case was described. |
|
|
|
""".strip() |
|
|