Very good model
After trying it for a bit, it gives LZLV a run for its money. I would have given up on it quickly after a lot of weirdness with chat formatting and speaking out of turn but the provided preset works quite well for keeping it in check, thank you for providing it in a convenient format.
Some things I've noticed and tried to tweak the preset for, though:
-Failure to add asterisks for story descriptions, the preset only specifies asterisks for internal dialogue, but not stuff happening in the story. This may cause the model to leave them off, especially at the start of a reply.
-Generating multiple replies in a row. The model will produce a full reply without issues but proceed to generate a second reply from char, including the {{char}}:. I think this is fixed by telling the model to only produce a single reply.
-Copying character cards verbatim for replies or very nearly so
I'm glad you're enjoying the model. It's funny you should mention LZLV because I'm close to releasing a new merge that uses that model in the blend instead of Xwin and it's really good. Expect something in the next week.
I haven't spent much time using the 70B version of this model, but I've gotten better results from the 103B version by playing around with temp and min-p. I didn't understand how SillyTavern measures min-p when I made my recommendation initially. It's inverted from what I put, so try a min-p setting of 0.8 instead of 0.2 and see if you get better results. You may also want to lower temp a little bit, but the higher you go with min-p, the higher you can push the temperature and still get good results. I'll update the model card to reflect some of these changes. Hopefully that resolves some of the problems you've had.
All that being said, this model is fundamentally not perfect. You'll probably always have some issues with it around the edges. Stay tuned for its successor.
I've tried the updated gen preset but have been getting very repetitive results. It will repeat almost verbatim the same phrase several times over several paragraphs.
How are you loading the model and what context length are you using? I've only ever tested the 70B model using my own exllama2 quant at 4.85bpw. It's possible that other quantizations make this problem worse. If you can tell me your exact model version and sampler settings, I'll give it a try on my end and see what's up with it.
I'm using llamacpp_HF at q3_K_M GGUF
I tried TheBloke's q3_K_M GGUF quant using the llamacpp_HF loader and it worked well for me at 4096 context, at least in a quick test scenario where there was already context for the model to draw upon. (In other words, I ran it in the middle of a chat I already had going.) Did you encounter problems near the start of a chat where there isn't much context to guide the model? I recommend experimenting with the sampler settings and the system prompt, and providing additional contextual guidance to the model, to see if that alleviates the issues you reported. As best I can tell, the issues you encountered should be solvable, partially or fully, by experimenting with those settings.
Also, what are you using for the tokenizer? I don't mess around with llamacpp very often, and I noticed that the llamacpp_hf loader presents you with two options: 1) place copies of the model's tokenizer files into a directory with the .gguf file or 2) use a generic tokenizer provided by oobabooga. Since I have the full model on my local drive, I chose option 1 and copied over the four tokenizer files. If you're using option 2, that could be another factor.
Just throwing out ideas. If you make a breakthrough, I'd be curious to know what solved the problems for you.
I'm using the tokenizer included in your repo, and the loader picks those up. After more testing, it appears that the repetition issue I described was a fluke, but has led me to finding that certain scenarios cause the model to have a single almost certain output, with very minor variations between regens, and very resistant to changes due to generation settings. I think this is what I observed with the highly repetitive outputs before, that situation caused the model to produce an almost identical reply for that given case regardless of settings, which just happened to be a bad reply with high repetition. The model still has issues with generating user dialogue despite my best efforts to dissuade it in the system prompt and negative CFG, but neither of them seem to have great influence on "stuck" outputs. I'll keep messing with it.
I'm glad the repetition issue seems to be limited. When you describe it like that, I remember seeing that behavior myself a few times. It seems to be the model's knee jerk reaction when it doesn't know what to do to continue the chat from that point. I seem to recall I was able to break it out of that state by providing it with a system message to clarify what I wanted it to do next.
As for the tendency to write text as you, the user, I feel you there. I would say that is a known weakness of the model. You can suppress it mostly with prompting and manipulating sampler settings, but you'll probably still encounter it sometimes.
Thanks for sharing your experiences with the model. This feedback helps me identify which issues are affecting people, and I take that into consideration when tuning future merges.