How to estimate memory usage?
I would like to use sentence-transformers in a low-end machine (CPU-only) to load pre-trained models, such as paraphrase-multilingual-MiniLM-L12-v2, and compute a sentence's embedding.
How to estimate memory usage? Is there any guideline to describe the minimum system requirements for loading pre-trained models?
Hello!
For embedding models, the memory requirement during inference actually fairly closely matches the size of the weight file (assuming that you're using the model in fp32, i.e. the default for loading, and the model is saved in fp32, i.e. the default for saving). For example:
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device="cuda")
print(f"{torch.cuda.max_memory_allocated() / 1024**3:.3f}GB in use after loading model")
0.438GB in use after loading model
I recognize that this is on GPU and not on CPU, but the memory usage for the model itself should be the same between them. So, you can look at the weight size here:
You can also load a model on Google Colab with just CPU to see if it works well there. Those are fairly low-end machines as far as I know. E.g.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
- Tom Aarsen