High Precision quant of 🚀Reflection-Llama-3.1-70B🚀

This gets 99.96% perplexity at 50gb filesize whereas fp8 (not tested on this model) is known to be 97-98.8%

Only posting one quant because it's really annoying to make these and I haven't automated it yet, takes 30+ iterations of models as I have to recompile llama.cpp every build/test step until the lowest perplexity loss per weight quantization configs are found. End result is... saves 5gb of space vs regular q6_k

🐧 To download faster on Linux sudo apt install -y aria2 🍎 On Mac brew install aria2

These links will download 9x faster, feel free to paste them all in or one at a time

aria2c -x 9 -o reflection-70b-precisequant-6bpw-00001-of-00002.gguf https://huggingface.co/nisten/Reflection-70b-PreciseQuant-6bpw-gguf/resolve/main/reflection-70b-precisequant-6bpw-00001-of-00002.gguf

aria2c -x 9 -o reflection-70b-precisequant-6bpw-00002-of-00002.gguf https://huggingface.co/nisten/Reflection-70b-PreciseQuant-6bpw-gguf/resolve/main/reflection-70b-precisequant-6bpw-00002-of-00002.gguf

Prompt file with correct template

🐧 make a file called reflectionprompt.txt and just copy paste this in, change as needed

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.<|eot_id|><|start_header_id|>user<|end_header_id|>
}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To run the model on commandline terminal with multiline input find the location of the first 00001 gguf file then do

./llama-cli -ngl 81 -m reflection-70b-precisequant-6bpw-00001-of-00002.gguf -f reflectionprompt.txt --prompt-cache random.cache --keep -1 -fa -cnv -c 32000 -co -e -mli --temp 0 -ngl 99

Perplexity benchmarks as you can see accuracy of the quant is 5.2416/5.2468 = 99.96% +-0.02%

Float16 -143GB - perplexity: calculating perplexity over 64 chunks, n_ctx=512, batch_size=2048, n_seq=4
16.92 seconds per pass - ETA 4.50 minutes
[1]4.0486,[2]4.6471,[3]3.9394,[4]3.4698,[5]3.2290,[6]3.0391,[7]3.1640,[8]3.1819,[9]3.2073,[10]3.3374,[11]3.5247,[12]3.7371,[13]3.9944,[14]4.0065,[15]4.1234,[16]4.1503,[17]4.2893,[18]4.4968,[19]4.4347,[20]4.4439,[21]4.5403,[22]4.4419,[23]4.2888,[24]4.2224,[25]4.1259,[26]4.0495,[27]4.0324,[28]4.0221,[29]4.0838,[30]4.1170,[31]4.1588,[32]4.1664,[33]4.2095,[34]4.2723,[35]4.3194,[36]4.4006,[37]4.4192,[38]4.4598,[39]4.4861,[40]4.5294,[41]4.5674,[42]4.5571,[43]4.6098,[44]4.6025,[45]4.7148,[46]4.7590,[47]4.7303,[48]4.6854,[49]4.6778,[50]4.7118,[51]4.7762,[52]4.7682,[53]4.8604,[54]4.8778,[55]4.9023,[56]4.9398,[57]4.9594,[58]4.9813,[59]4.9653,[60]5.0095,[61]5.0626,[62]5.1179,[63]5.1774,[64]5.2416,
Final estimate: PPL = 5.2416 +/- 0.09238

6bpw - 50GB - perplexity: calculating perplexity over 64 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 23.59 seconds per pass - ETA 6.28 minutes
[1]4.0767,[2]4.6657,[3]3.9513,[4]3.4823,[5]3.2487,[6]3.0724,[7]3.1902,[8]3.2125,[9]3.2384,[10]3.3744,[11]3.5567,[12]3.7686,[13]4.0223,[14]4.0309,[15]4.1456,[16]4.1740,[17]4.3123,[18]4.5194,[19]4.4535,[20]4.4623,[21]4.5580,[22]4.4580,[23]4.3051,[24]4.2390,[25]4.1393,[26]4.0586,[27]4.0414,[28]4.0307,[29]4.0909,[30]4.1243,[31]4.1653,[32]4.1725,[33]4.2153,[34]4.2791,[35]4.3258,[36]4.4072,[37]4.4263,[38]4.4676,[39]4.4944,[40]4.5377,[41]4.5755,[42]4.5648,[43]4.6176,[44]4.6105,[45]4.7227,[46]4.7669,[47]4.7393,[48]4.6918,[49]4.6836,[50]4.7175,[51]4.7818,[52]4.7738,[53]4.8659,[54]4.8834,[55]4.9086,[56]4.9452,[57]4.9649,[58]4.9874,[59]4.9718,[60]5.0159,[61]5.0686,[62]5.1238,[63]5.1833,[64]5.2468,
Final estimate: PPL = 5.2468 +/- 0.09258

nisten
/

Reflection-70b-PreciseQuant-6bpw-gguf

High Precision quant of 🚀Reflection-Llama-3.1-70B🚀

This gets 99.96% perplexity at 50gb filesize whereas fp8 (not tested on this model) is known to be 97-98.8%

Prompt file with correct template

To run the model on commandline terminal with multiline input find the location of the first 00001 gguf file then do

Perplexity benchmarks as you can see accuracy of the quant is 5.2416/5.2468 = 99.96% +-0.02%

Model tree for nisten/Reflection-70b-PreciseQuant-6bpw-gguf