Issue with Mixtral-8x7B-Instruct-v0.1-GGUF Model: 'blk.0.ffn_gate.weight' Tensor Not Found
Hi,
I've encountered an issue while using the Mixtral-8x7B-Instruct-v0.1-GGUF model. Below are the details of my execution command and the error I received.
Command Used:
Using llamafile
llamafile -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<s>[INST] {prompt} [/INST]"
Using llama.cpp:
~/Tools/llama.cpp/main -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<s>[INST] {prompt} [/INST]"
Error Received:
Both gave this error message
error: create_tensor: tensor 'blk.0.ffn_gate.weight' not found
This error occurs at the conclusion of the process. I'm uncertain if this is an issue with the llamafile
command or a problem inherent to the Mixtral model itself.
Could you provide guidance on how to resolve this issue? Any help would be greatly appreciated.
Thanks,
LW
Hi
@littleworth
!
For llama.cpp, perhaps you forgot to check out the mixtral
branch? Then git checkout mixtral
would fix it.
$ git clone https://github.com/ggerganov/llama.cpp
$ # Mixtral support is not merged to `main` yet - see PR 4406
$ # https://github.com/ggerganov/llama.cpp/pull/4406
$ git checkout mixtral
$ make LLAMA_CUBLAS=1
I can't speak for llamafile
, but I guess it does not support Mixtral at the moment - the issue seems to be tracked here.
Hi @littleworth !
For llama.cpp, perhaps you forgot to check out themixtral
branch? Thengit checkout mixtral
would fix it.
$ git clone https://github.com/ggerganov/llama.cpp $ # Mixtral support is not merged to `main` yet - see PR 4406 $ # https://github.com/ggerganov/llama.cpp/pull/4406 $ git checkout mixtral $ make LLAMA_CUBLAS=1
I can't speak for
llamafile
, but I guess it does not support Mixtral at the moment - the issue seems to be tracked here.
Hi,
Thanks for your reply. After following your instruction, I got this error:
$ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native -c common/console.cpp -o console.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:457: ggml-cuda.o] Error 1
Any idea how to solve?
LW
Cuda must be 12.2
Check version installed
On Windows 10 x64, switching to the mixtral
branch worked for me.
After running cmake
, recompiled in Visual Studio 2022 Community, and success:
main -m models/mixtral-8x7b-v0.1.Q5_K_M.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
...
llm_load_tensors: mem required = 30735.89 MiB
llama_new_context_with_model: compute buffer total size = 117.85 MiB
...
system_info: n_threads = 28 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
== Running in interactive mode. ==
User:Meaning of life.
Bob: The meaning of life is to be happy and find joy in each day.
To get the .gguf
file, I ran LM Studio, downloaded the model, then symlinked under Windows from the model file to the models
dir in this git repo.