Can you explain the purpose of merged_all.json?
To me the axolotl config already looks like it includes all relevant data sources. After looking at previous Einstein models I suspect that the merged_all.json still contains data from those, in addition to being merged with all other datasets. But Is it still relevant? Wouldn't it be more efficient to exclude it from the training process?
merged_all.json is merged data of many alpaca format datasets. The other datasets in the data folder is mainly in sharegpt format. So merged_all.json doesn't contain any of the other data that's in the data folder.
Oh ok. Thanks for the info. Does it simply contain all the other datasets mentioned in the README datasets list but not the axolotl config?
Yes, you got it right!
Note that I filtered some of them :)
Cool, Thanks for the info and thank you for this new version of Einstein :)
@nlpguy , if you are more interested in the datasets I use, you can have a look at this link:
https://huggingface.co/datasets/Weyaxi/sci-datasets/tree/main
It may be slightly outdated for 1-2 datasets, but that's the main repository I use.