https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct

#105
by nicoboss - opened

DeepSeek-Coder-V2-Instruct is an MoE model with 236B parameters allegedly comparable to GPT4-Turbo in code-specific tasks and exceeding it in standard benchmark evaluations. This model might slightly exceed 400 GiB but let's try imatrix calculation anyways - I temporary increased the memory limit to 450 GiB.

It still would not fit. Without tricks, it would just stream continuously from disk, which would take many days.

Anyway, it's in the queue, but I currently have a lock in the script that refuses ggufs > 400GB. I would propose to quant to Q8. Alternatively, I could try to lock most of the file in memory, or try to load a dozen GB or so into VRAM (which requires trial and error because there is no option for that), all of which require modifications to the process.

Wow crazy. How many GiB is it? I increased limit to 500 GiB and made host use less than 3 GiB. No idea if this is enough but more isn't possible. If it is still not enough let’s just run imatrix on the largest quantized GGUF that fits. I don't think it would make that much of a difference.

The raw tensors alone are 472GB (not GiB), plus whatever overheads imatrix needs (a few more gb at least), so 500GiB is more than enough. We could try with 460GiB (not GB) but I would feel more comfortable with 480GiB.

Also, I would normally use a Q8_0 - I am just waiting for a model that needs it so I can code something for it :) (There is one other 470GB one in the queue at the moment). But you made it too easy this time...

Let's set it to 500 GiB for now and see how it goes. llama.cpp uses mmap instead of malloc so I hope it will cache all of it despite using almost all memory but it should as I have min_free_kbytes is set to 91875 on the host. Feel free to use this opportunity to imatrix other large models as well. Going beyond 400 GiB is not that hard. I just need to turn off everything else and set zfs_arc_max to a low value. I might soon automate those steps in a script similar to freeResources which you could run to temporary increase the limit yourself when needed.

Yesterday you ran T-800-8B while all the GPUs where in use which led to it running out of GPU memory. When it got retried 1 hour later it failed due to a lack of training input in stdin. I recommend fixing this if you haven't already or just run /root/freeResources.sh before running imatrix to ensure nothing else is using the GPU.

We hopefully won't need to wait much longer for a model that for sure needs to be quantized prior to imatrix calculation. Rumors say Llama-3-405B might get released in July. There also is
Nemotron-4-340B-Instruct which got released a few days ago but that one first needs to be made safetensor and then llama.cpp compatible.

mmap does not mean it won't load all data in memory. what mmap means (in this context) is that, for each chunk, linux will stream the data from disk into memory, and when it runs out of memory, it will faithfully forget the oldest data (beginning of file), so when the next chunk is used, it will again read everything from disk. not using mmap would mean it would crash instead. there is no meaningful caching going on. with "lock into memory" i mean to mlock the beginning of the file into memory (say, the first 450GB), and then only the remaining data will need to be loaded from disk on each chunk.

Ah, I was not able to properly test the retry when the cards are in use. My problem is that I have no meaningful rationale for when to run freeResources.sh - I mean, why would I ever run it - I don't know why the memory is in use, and it's your machine. I can solve one side of the problem (when imatrix crashes due to OOM), but it would be cool if I could solve the other side (i.e. you being able to stop the imatrix process that you want to use resources). So, we'd probably need the reverse of freeResources - something that kills imatrix in a way so I can retry later (would not be hard to provide). In my experience, no matter how generous you are, these things work best when you can use your computer whenever you wish, i.e. you come in least contact with it :-)

PS: I wanted to mail you about these issues, but while we are at it... :) Since you commented on the electricity not being such a big issue, I adopted a policy where I run small models that somebody is explicitly waiting on at night (or rather, immediately), while the rest waits till 7am. I hope that is OK with you? In other news, the solution we came up with is feeling absolutely great here - I no longer have to worry as much about whether a model should have i-quants or not (and consequently, my boxes are actually building a queue sometimes), because no manual scheduling from my side is required anymore. And I haven't even started about the fact that 99.9% of the models can get their imatrix done in maximum fidelity. While doing the imatrix calculations on my box was not a big deal in practise, it was quite the stress factor for me.

PPS: regarding mmap, I once had a very busy webserver (around 1998), and it ran into a similar issue - once there were about a thousand simultaneous connections, performance broke down, because the server would go through all connections, read 256kb, then send it to the client. linux would then read-ahead another 256kb. and eventually the memory got exhausted and linux would throw away the read-ahead data, effectively halving performance because all read-ahead was wasted. My report caused linux to gain configurable read-ahead, and this solved it for me (by disabling read-ahead completely). Things get suboptimal when memory gets tight.

Please always run freeResources.sh before starting the imatrix calculation. All it does is to tell text generation web UI to use CUDA_VISIBLE_DEVICES=1,2 so GPU 0 is no longer used. I couldn’t care less as most normal-sized models still fit into GPU memory and others I can just partially run-on RAM. freeResources.sh will not do anything if the GPU is used for anything important. This is why I created a script notifying the other LXC container instead of telling you to hard reset the GPU.

Having some inverse control would be great. You could check if there is a file in /tmp/pause similar to how you check for /tmp/ignoretime. If it exists finish the current imatrix task but avoid starting any new ones until /tmp/pause is deleted. /tmp is mounted to /spool/subvol-108-disk-0/tmp on the host and so adding/removing files there is quite convenient. I don't think I will often use this but will be useful in case I ever need to reserve all GPUs for benchmarks or finetuning.

Don't worry about the SSD. You should be able to write around 2000 TB to it before it wears out and so will last for years - way longer that I'm willing to wait until replacing it. It has error correction so poor it experiences random bit rot for files not read for more than half a year and so it can only really be used for your exact use case anyways.

While I have not yet needed to reboot the host since we created this setup make sure everything would still work after a reboot which inevitably will happen in around 5 weeks if not earlier.

I like your policy. Running smaller models during nighttime is absolutely no problem for me. Feel free to also run large ones if they are urgent. I really like the current schedule.

Our assumption that imatrix is PCIe bandwidth bottlenecked might be wrong. I measured the PICe bandwidth utilization 15 minutes into you running: /root/imatrix -ofreq 10 -t 1 -ngl 0 -mg 0 -m /tmp/Venus-103b-v1.0.gguf -o /tmp/Venus-103b-v1.0.imatrix~ -f /dev/fd/0
nvidia-smi dmon -s t
gpu rxpci txpci
Idx MB/s MB/s
0 23 3
0 24 7
0 15 2
0 23 3
0 23 5
If it is PCIe bandwidth bottlenecked I would expect this to show significantly more.

I'm extremely satisfied with the solution we came up with. It just does its thing without causing any issues. Everything works and doesn't impact me in any way. Using the PC while its doing imatrix calculation feels identical (no lag, no audio crackles etc.). Even the internet feels identical despite downloading the GGUF files at full speed.

Currently working on a quant here https://huggingface.co/legraphista/DeepSeek-Coder-V2-Instruct-IMat-GGUF

But be aware, I had to requant from q8. I didn't have enough storage unfortunately to go from HF to bf16 using temp files (filled up 1.5TB)

@legraphista i wouldn't worry a second about using a q8_0., imatrices are such a blunt tool. i would worry more about reducing precision going from f16 to bf16.

@nicoboss running freeresources now, and I am honoring /tmp/pause (basically, while -e; do sleep 300). also, the deepseek models are through, and the 470GB model collapsed to 235GB after converting it to gguf. I get the suspicion that convert-hf-to-gguf does no longer very intelligently choose output vector types and just converts everything to f16 (or bf16 if lucky).

Ah, and I measured memory usage, and it was only ever slightly above 450GiB, and surprisingly not during the actual calculations, where it was at 442GiB - quite a bit less than the tensor data would suggest.

And regarding pcie bottleneck, if the units were GB/s, it would be almost exactly pcie-bottlenecked, and if it really is MB/s, your measurements are wrong, because at that speed, deepseek imatrix calculations would take two months instead instead of the ~120m it actually took. The numbers are pretty much off by a factor of 1000.

mradermacher changed discussion status to closed

Ah, and regarding reboots, I have no clue how to handle those gracefully, but don't worry, a failed job per month is not bothering me. It's three command s(cat to chekc what might have happened, rm to delete the log/status files and push to restart).

@legraphista i wouldn't worry a second about using a q8_0., imatrices are such a blunt tool. i would worry more about reducing precision going from f16 to bf16.

no precision loss, the model is already bfloat16

@legraphista ah, good. i was confused by the model name (hf is normally f16).

@mradermacher Most of the models I've seen lately are all bf16, I'm curious if there's a tool to easily compute the std between f16 and bf16 weights.

also, sorry, I meant HF as in hugging face format, not half float 😅

@legraphista uh, that never occurred to me. :-)

Sign up or log in to comment