EarlyBERTs

Random Seed 0 | Steps 10 – 40,000

🐤 EarlyBERTs reproduces the MultiBERTs (Sellam et al., 2022), and introduces more granular checkpoints covering the initial and critical learning phases. In "The Subspace Chronicles" (Müller-Eberstein et al., 2023), we leverage these checkpoints to study their early learning dynamics.

This suite builds on MultiBERTs and the underlying BERT architecture, covering seeds 0 – 4 for which intermediate checkpoints were originallt released. For each seed, we provide 31 additional checkpoints for steps 10, 100, 200, ..., 1,000, 2,000, ..., 20,000, 40,000, which are stored as respective model revisions (e.g., revision=step11000).

Model Details

Model Developers

Max Müller-Eberstein as part of the NLPnorth research unit at the IT University of Copenhagen, Denmark.

Variations

EarlyBERTs cover seeds 0–4 (in respective repositories) and steps 10–40,000 (in respective model revision branches).

Input

Text only.

Output

Text and/or embeddings of the input.

Additionally, the CLS-classification head is trained on next sentence prediction as in Devlin et al. (2019).

Model Architecture

EarlyBERTs are based on the original BERT architecture (Devlin et al., 2019), and loads the respective MultiBERTs seed at step 0 as initialization.

Research Paper

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training (Müller-Eberstein et al., 2023).

Training

Data

As both the original BERT as well as the MultiBERTs pre-training data are not publicly available, we gather a corresponding corpus using fully public versions of both the English Wikipedia and BookCorpus. Scripts to re-create the exact data ordering, sentence pairing and subword masking can be found in the project repository.

Hyperparameters

We replicate the exact training hyperparameters as in MultiBERTs, and document them in our research paper. Code to reproduce our training procedure can be found in the project repository.

Usage

Loading the intermediate checkpoint for a specific seed and step follows the standard HF API:

from transformers import AutoTokenizer, AutoModel

seed, step = 0, 7000

tokenizer = AutoTokenizer.from_pretrained(f'personads/earlyberts-seed{seed}')
model = AutoModel.from_pretrained(f'personads/earlyberts-seed{seed}', revision=f'step{step}')

Citation

If you find these models useful, please cite this, as well as the original MultiBERTs works:

@inproceedings{muller-eberstein-etal-2023-subspace,
    title = "Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training",
    author = {M{\"u}ller-Eberstein, Max  and
      van der Goot, Rob  and
      Plank, Barbara  and
      Titov, Ivan},
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.879",
    doi = "10.18653/v1/2023.findings-emnlp.879",
    pages = "13190--13208"
}

@inproceedings{
  sellam2022the,
  title={The Multi{BERT}s: {BERT} Reproductions for Robustness Analysis},
  author={Thibault Sellam and Steve Yadlowsky and Ian Tenney and Jason Wei and Naomi Saphra and Alexander D'Amour and Tal Linzen and Jasmijn Bastings and Iulia Raluca Turc and Jacob Eisenstein and Dipanjan Das and Ellie Pavlick},
  booktitle={International Conference on Learning Representations},
  year={2022},
  url={https://openreview.net/forum?id=K0E_F0gFDgA}
}

personads
/

earlyberts-seed0

EarlyBERTs

Model Details

Training

Usage

Citation

Model tree for personads/earlyberts-seed0

Datasets used to train personads/earlyberts-seed0