The first GGUF that works with long context on llama.cpp!
All the previous attempts tend to get stuck outputting repeating strings of characters and gibberish. This one is holding up with reasonable outputs > ~100k tokens.
If anyone else is too impatient to wait for PR#8650, something like this will get you going quickly.
## update your existing llama.cpp git repo
cd llama.cpp
git pull
## fetch the pending PR#8650 patches
git remote add jmorganca [email protected]:jmorganca/llama.cpp.git
git fetch jmorganca
# apply patches and rebuild
git rebase jmorganca/master
make clean && time GGML_CUDA=1 make -j$(nproc)
# run the server with desired context length e.g. 100k
./llama-server \
--model "../models/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf" \
--n-gpu-layers 33 \
--ctx-size 102400 \
--cache-type-k f16 \
--cache-type-v f16 \
--threads 4 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
For my llama-cpp-api-client config keep your temp low and use repeat penalty e.g.
{
"temperature": 0.0,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
"repeat_penalty": 1.1,
"n_predict": -1,
"seed": -1
}
On my 3090TI w/ 24GB VRAM and above settings I'm getting prompt eval time of ~1467 tok/sec and generation time of ~27 tok/sec with input size of ~80k tokens.
Cheers and thanks for all the quants!
Thanks for the documentation for people unfamilar with git foo. I'll add a link from the README.md :)
All the previous attempts tend to get stuck outputting repeating strings of characters and gibberish. This one is holding up with reasonable outputs > ~100k tokens.
If anyone else is too impatient to wait for PR#8650, something like this will get you going quickly.
## update your existing llama.cpp git repo cd llama.cpp git pull ## fetch the pending PR#8650 patches git remote add jmorganca [email protected]:jmorganca/llama.cpp.git git fetch jmorganca # apply patches and rebuild git rebase jmorganca/master make clean && time GGML_CUDA=1 make -j$(nproc) # run the server with desired context length e.g. 100k ./llama-server \ --model "../models/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf" \ --n-gpu-layers 33 \ --ctx-size 102400 \ --cache-type-k f16 \ --cache-type-v f16 \ --threads 4 \ --flash-attn \ --mlock \ --n-predict -1 \ --host 127.0.0.1 \ --port 8080
For my llama-cpp-api-client config keep your temp low and use repeat penalty e.g.
{ "temperature": 0.0, "top_k": 40, "top_p": 0.95, "min_p": 0.05, "repeat_penalty": 1.1, "n_predict": -1, "seed": -1 }
On my 3090TI w/ 24GB VRAM and above settings I'm getting prompt eval time of ~1467 tok/sec and generation time of ~27 tok/sec with input size of ~80k tokens.
Cheers and thanks for all the quants!
Are you sure about repeat_penalty": 1.1 ?
I heavy tested and repeat_penalty": 1.0 seems works far more better with llama 3 ..1 8b
@mirek190 Hey, yes I see you did lots of testing over in PR#8650! Looks like the Llama-3.1 fixes were just added in PR#8676, gonna try that soon.
As for repeat_penalty a value of 1.0
would essentially disable the feature. In my very brief testing I don't personally see a huge difference as 1.1
is fairly modest and only effects the previous 64 tokens by default.
But you bring up a good point, for some types of generation e.g. coding/programming languages, using 1.0
would probably be best as there are typically many repeating characters.
Use whatever you feel is best for your use case and thanks for sharing your findings!