Amazing model, and this (the original) is better than laser or dpo versions

#11
by froggeric - opened

I normally do not use 7b models, apart from benchmarking, as I find them inferior to bigger models. However, I have been proved wrong with this one. The results are truly amazing for this size. I found it excels at understanding and following instructions, keeps a good context, writes beautifully and with lots of details.

I have started running it through my benchmark, which consists of 24 prompts (half sfw, and the other half nsfw) encompassing a wide variety of tasks and topics, not focused on RP. I am about 60% through benchmarking it, but according to my preliminary results, it ranks in 4th place among 49 models ranging from 7b to 20b:

  1. PsyMedRP-v1-20B-GGUF (q4_ks)
  2. SnowyRP-FinalV1-L2-13B-GGUF (q4_ks)
  3. LLaMA2-13B-Estopia-GGUF (q4_ks)
  4. WestLake-7B-v2-GGUF mistral (q8_0)
  5. solarized-18B-dpo-GGUF (q4_k)
  6. SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF (q6_k)

Out of interest, I am also benchmarking a few variants finetuned by others, the original is definitely better:

WestLake-7B-v2-laser-GGUF : I was surprised that this LASER tuning is actually worse than the original. The responses are in general less detailed, and I have not come across a single test which performed better. It is still fairly close though, within 10% in the overall score.

WestLake-7B-v2-laser-truthy-dpo-GGUF : big drop in quality (-25% benchmark score), and I was often disappointed by the poor quality of the responses.

I cannot wait to see what you have in store with v3, and I would love to see the same training applied to llama2 13b and solar 10.7b

@froggeric I really appreciate your feedback and interest in the model and providing the custom benchmark details.
Apparently it does best in size with EQ evals as well.
As for the variants, they're probably over-fitting to squeeze out a few extra points.
As everyone races to the top HF benchmarking doesn't seem to equate to usefulness in some models.

What would you like to see improved in v3?

Yes, I find most benchmarks do not reflect well the models usefulness. This is why I had to make my own, where I manually evaluate the results, and if a model dramatically fails, I try to make a few changes to give it a chance.

In term of what could be improved with v3, I have noticed 2 shortcomings of v2 in my testing:

Censorship / guardrails

v2 censorship still kicks on some subjects. It is possible to bypass it, with prompt engineering and other tricks, but even then, it still has a hard time staying uncensored over longer conversations. I have found models using the following dataset for uncensoring to give good results:
unalignment/toxic-dpo-v0.2
Another option is one of the ToxicQA dataset from NobodyExistsOnTheInternet, but I do not have much experience with it. The only model I tested using it did not give great results (but it could be due to something else).

GPTisms

Still there unfortunately. However, just a couple of days ago, I came across a model which is implementing a way to get rid of them. If I can find the details again, I will let you know.
edit: it was TeeZee/Kyllene-34B-v1.1

Great suggestions, thanks.

Sign up or log in to comment