Regarding the training data and replicability

#59
by siarez - opened

Are the checkpoint here from Google and trained with Google's data (which they never shared)? Or do the checkpoints actually come from training on the Wikipedia and BookCorpus that is publicly available on HuggingFace ?
In order words, will I be able to replicate this checkpoint by training on https://huggingface.co/datasets/wikipedia and https://huggingface.co/datasets/bookcorpus?

siarez changed discussion title from Regarding the training data to Regarding the training data and replicability
BERT community org

This is the original checkpoint released by Google in their paper.

Under 3.1 we see the following:

image.png

@lysandre yes, but there is also this excerpt (image below) which suggests the Wiki and BookCorpus that are on HuggingFace are not identical to the one the Google team used; hence the question about replicability.
Screen Shot 2023-10-31 at 10.06.50 AM.png

BERT community org

@siarez yes you are right. This model is not fully replicable by https://huggingface.co/datasets/wikipedia and https://huggingface.co/datasets/bookcorpus which are not the datasets pre-processed by Google

Sign up or log in to comment