Disclosure of training data needed
#1
by
markding
- opened
What did the "4.5T tokens of high-quality training data during the training phase" exactly consist of? Knowing this is important to interpret evaluation results but also to understand potential legal aspects of deployment.
Names of specific datasets and identification of languages would be very useful.
Thank you for your work!
ZekeWang
changed discussion status to
closed
Closed without even a comment?