Chargen's potential, Your labour and T5
Hi: I write comments and then let social anxiety ensure I never read the replies so don't expect one but:
First of all: Thank you. This is really cool and novel. You're doing something actually interesting. You're programing with your own code - that's what a dataset does, it's a program for a learner right?
"800 hours to grade these cards, or just over 2 months. That was not exciting."
Thank you! You worked hard. I think you might want to look at what you did as create two datasets here. or a metadata set. You've not just got a refined up dataset: you've got a MUCH LARGER corpus of bad data and the "correct answers" generated entrely by copy-editing (attention heads LOVE copying! In fact you may well have noticed them repeating themselves(!)
I think your good dataset may well lead to better chargens. But it doesn't seem proportionate to the work: most of what you did was modify the information such that it's not present in chargen's training!
Maybe the reason you feel it was a huge effort that wasn't exciting is because you've not withdrawn your winnings yet.
Dataset cleaning (or censoring or reductionism if you prefer) is to T5 what (compelling, fascinating, meaningful) bullshit is to llama2, examples of utility T5s are far more abundant and interpretable than trying to figure out wtf your average mergebrained llamaslerping finecoomer is up to. (You can go on Arxiv to learn more about a lot of the T5s on HF, Arxiv! Actual* science!)
You can't release any dimension of your dataset. Bad whole or just the good refined bit. How about a bad-to-good transformer trained in your ways? It might offer you the most utility of all (give you know exactly what it'd be trained on). On averrage, it would take a card in and bascially reject it. Returning nothing or perhaps a classifier output instead of bothering going generative.
Your options are quite broad, see the right hand side.
FLAN T5 may well even be capable of this with very little/no training. T5's coming to llama.cpp. Aphrodite too I think. Both lightning fast. Food for thought. Think embeddings. THink prefixes. Think systems of smaller t5s. IDK. I couldn't train my own neural network to tie my shoelaces.
I might be being rude for assuming you don't already know all of this. Hard to judge someone's internal knowledge by the content of their huggingface profile, you know?
Thanks again. sorry if you do reply and I don't read it :s
This comment was inspired by my conceptual failiure with my first attempt with chargen: I gave it my best and longest writing - but it's trained to expand, speculate and confabulate, not paraphrase and distill. So sure enough (because I'm a better writer than the average coomer*) it just copied and reordered things. But I know what it's useful for and I will definitely get good use out of it.
*citations omitted
You're correct in your assumption that CharGen is made for expand and speculate driven by your prompt. That's how llama models work, anyway :)
It was made to take a short prompt (~100-200 tokens) and expand that into a character. If you feed it a large amount of beautifully written text - it will consider its "job" to be already done for it and will most likely just rearrange bit and pieces of it. Remember that it can't write better than you, unless you write super-bad (I sure do!).
I'm going to look into t5-based models, thank you for suggestion! T5 models were used for grammar correction when making CharGen, and they were nothing short of amazing.
Regarding your concerns to my mental health: first, thank you. I am pretty happy with how CharGen was welcomed in the community, and I have absolutely no regrets of time spent making it. Yeah, the immediate experience of filtering the dataset was not exciting, but the outcome and feedback absolutely is. I would totally do it again (I'm already working on some CharGen-related things π)!