Please document pretraining datasets
#49
by
markding
- opened
It is surprisingly hard to find which datasets were used in pretraining. Could you provide more details? The dataset statement on the Cohere site looks like it applies to GPT models that are now several years old; it offers no details either: https://docs.cohere.com/docs/data-statement
At https://opening-up-chatgpt.github.io/ we're tracking degrees of openness for instruction-tuned LLMs that are made openly available in some form (in the case of Command R+, the model weights are made available). FWIW, Command R+ joined the bottom 5 (out of >30 models currently tracked), just below Llama2.
Providing more details on this will soon be required under the EU AI Act.