Failed to load model (with the latest version 17 hours ago )

#3
by omnibookxp - opened

I just tried to use "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf" with LM Studio 0.2.28

Failed to load model

Error message:
"llama.cpp error: 'done_getting_tensors: wrong number of tensors; expected 292, got 291'"

Diagnostics info:
{
"memory": {
"ram_capacity": "32.00 GB",
"ram_unused": "10.00 GB"
},
"gpu": {
"gpu_names": [
"Apple Silicon"
],
"vram_recommended_capacity": "21.33 GB",
"vram_unused": "9.10 GB"
},
"os": {
"platform": "darwin",
"version": "14.5"
},
"app": {
"version": "0.2.28",
"downloadsDir": "/Users/maxm1/.cache/lm-studio/models"
},
"model": {}
}

Same here with Ollama.

ollama run Meta-Llama-3.1-8B-Instruct-Q8_0:latest
Error: llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

@omnibookxp

lmstudio just got updated to 0.2.29 which adds support for llama 3.1 with the rope fixes, go grab it :D

https://lmstudio.ai/

The error is fixed with the new version of LM Studio 0.2.29

In my case (using latest llama-server) the VRAM requirement for Q8_0.gguf was unexpectedly large (when the model was started with the 128k tokens context window). For a 8-bit quant I was expecting VRAM memory req. to be similar to the GGUF file size (which was the case with Llama 3.0), but this model version required 4 times more VRAM... (32911MiB / 81920MiB on an otherwise empty A100 80GB GPU). It looks like a bug not a feature that increasing the context window 16 times would takes up 4x more VRAM... for other models like Qwen 2 the jump in memory usage with similar increases in context windows wasn't that dramatic (double-digit, not triple-digit percentage increases). Here reducing the size of the context window to 8k tokens brings back VRAM use to the levels from the previous model version (Llama 3.0 8B: 9683MiB / 81920MiB for the 8-bit quant).

This is a feature not a bug, 128k context is an insane amount and will need to allocate a TON of memory. in fact, I would assume a much more than 4x increase in VRAM if context went up 16x

I updated the latest version, but the issue is still consistent.

Which issue? Latest version of what?

I am having the same issue with llama-cpp-python. I tried updating the latter, as suggested on some forums, but no improvement.

Which issue? Latest version of what?
LMstudio. It's fixed now. Thanks.

Sign up or log in to comment