Dolley 2 dataset
Some datasets like alpaca are for research only. It would be good to have ravens which can be used for commercial ends, too.
Dolley 2 dataset has a clean license, I suppose
https://github.com/databrickslabs/dolly/tree/master/data
There are more foss instruction tuning datasets, I suppose
gpt-3 and gpt-4 might give great training data but they spoil the license / applications of your model
I +1 the idea! Fine-tuning on their dataset might lead to great results without potentially poisoning the license.
Here is the link to the article:
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
will add to v10 :)
I'm confused. gpt4allv2, based on gpt-j has apache2 license after tuning on openai api output.
Either they made a mistake or it is no problem at all to fine tune foss models on openai api output.
Don't risk it. You're highly strategic. Play save.