Regarding the training data and replicability

#59

by siarez - opened Oct 26, 2023

Oct 26, 2023

Are the checkpoint here from Google and trained with Google's data (which they never shared)? Or do the checkpoints actually come from training on the Wikipedia and BookCorpus that is publicly available on HuggingFace ?
In order words, will I be able to replicate this checkpoint by training on https://huggingface.co/datasets/wikipedia and https://huggingface.co/datasets/bookcorpus?

siarez changed discussion title from Regarding the training data to Regarding the training data and replicability Oct 26, 2023

lysandre

BERT community org Oct 31, 2023

This is the original checkpoint released by Google in their paper.

Under 3.1 we see the following:

siarez

Oct 31, 2023

@lysandre yes, but there is also this excerpt (image below) which suggests the Wiki and BookCorpus that are on HuggingFace are not identical to the one the Google team used; hence the question about replicability.

julien-c

BERT community org Nov 1, 2023

@siarez yes you are right. This model is not fully replicable by https://huggingface.co/datasets/wikipedia and https://huggingface.co/datasets/bookcorpus which are not the datasets pre-processed by Google

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment