microsoft/phi-1_5 · Plans to release the training dataset?

Oct 6, 2023

At time of writing, all community efforts to create synthetic datasets like the one in Phi-1.5 fall short, either in the quality of the synthetic generations or the sheer size of the synthetic corpus.
Releasing the data used to train Phi-1.5 would be greatly beneficial for further research into the impact of synthetic datasets on large language models.
Would love to hear a response from one of the authors of the Phi-1.5 technical report about whether the community can expect to see the dataset or a subset of it released under any license or usage conditions.

gugarosa

Microsoft org Oct 30, 2023

Hello @monology !

Unfortunately, we are not able to release the dataset at the moment, however, there are some amazing attempts to create public versions, such as https://huggingface.co/datasets/nampdn-ai/tiny-textbooks and https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need.

gugarosa changed discussion status to closed Oct 30, 2023

RaphaelKalandadze

Dec 29, 2023

any updates on that topic?

Adhishtanaka

Feb 13

•

edited Feb 13

any updates on this topic?