VRAM consumption when using GPU (CUDA)

#37
by Sunjay353 - opened

I noticed that the VRAM usage increases by around the model size when loading the model, which is expected. However, it then increases again by roughly twice the model size during inference. This means the VRAM consumption is approximately three times the model size overall. Furthermore, this additional utilization is not released after inference, only at model unload. Is this normal and expected behavior?

Yes, it's normal and expected. Transformers consume memory proportional to the square of the tokens number in sequence.

Sign up or log in to comment