Not compatible with llama.cpp

#2
by AmazingTurtle - opened

I'm running latest llama.cpp (11f3ca06b8c66b0427aab0a472479da22553b472) on with LLAMA_CUBLAS=1 and get the following error when loading the model

$ ./main -m ./models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin
main: build = 928 (11f3ca0)
main: seed  = 1690751016
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
llama.cpp: loading model from ./models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin
error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin'
main: error: unable to load model

I have confirmed that the downloaded ggml models sha256sum is identical to remote.

Hi, Could you please try to run it with parameters mentioned in linked discussion?
https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML/discussions/1#64c6666aa684146b1c02389b
To convert I was using ggml v3 version.
Thanks.

apparently llama.cpp does not support ggmlv3. I tried the rope frequency params etc. but no success.

$ ./main -t 8 -ngl 0 -m ./models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin --grammar-file ./grammars/doc-describe.gbnf -n 1024 -f prompts/classify.txt -c 32768 --temp 0 --top-k 1024 --top-p 0.9 --color -b 1024  --rope-freq-scale 0.0625 --rope-freq-base 30000 --keep 1
main: warning: changing RoPE frequency base to 30000 (default 10000.0)
main: warning: scaling RoPE frequency by 0,0625 (default 1.0)
main: warning: base model only supports context sizes no greater than 2048 tokens (32768 specified)
main: build = 928 (11f3ca0)
main: seed  = 1690796156
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
llama.cpp: loading model from ./models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin
error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/llama-2-7b-32k-ggml/LLaMA-2-7B-32K.ggmlv3.q5_1.bin'
main: error: unable to load model

Seems so, yes,

main.exe -t 1 -n 65536 -c 32768 -m d:\Files\LLAMA2-7B-32k\LLaMA-2-7B-32K.ggmlv3.q8_0.bin -p "Once upon a time,"
main: warning: base model only supports context sizes no greater than 2048 tokens (32768 specified)

I am running this with llama.cpp on a macbook m2 successfully.

it looks like the context size is still only 2048 for llama.cpp...

it looks like the context size is still only 2048 for llama.cpp...

actually it supports 4096 already. Just need to tweak the RoPE parameters depending on the context size of the model.
That being said, the model doesn't seem to be compatible with llama.cpp or llama-cpp-python.
Are there any other specific parameters needed for this model ?

This is my output when I try to load the model:

llama.cpp: loading model from models/LLaMA-2-7B-32K.ggmlv3.q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 32768
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 0.125
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 6359.77 MB (+ 16384.00 MB per state)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

it is able to run, but it simply gave the following message:

Instruction: Write a story about llamas\n### Response:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

----- more information -----
./main -t 10 -m "LLaMA-2-7B-32K.ggmlv3.q8_0.bin" --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 977 (b19edd5)
main: seed = 1691930243
llama.cpp: loading model from /home/xujun/Downloads/LLaMA-2-7B-32K.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 6798.46 MB (+ 2048.00 MB per state)
llama_new_context_with_model: kv self size = 2048.00 MB
llama_new_context_with_model: compute buffer total size = 281.35 MB

system_info: n_threads = 10 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0

it works with some non-default parameters as stated in https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML/discussions/1

Sign up or log in to comment