intfloat/multilingual-e5-large · Memory leak(memory increasing slowly in each inference in cpu)

Nov 3, 2023

Hello,
I used transformers library of intfloat/multilingual-e5-large this model. I used the same code that is shared in its model card. I dockerized it and started to use. It increases the memory in each inference and then it exceeding my memory limit after a while.

Here is my code,

import torch.nn.functional as F
from torch import Tensor, no_grad, cuda, device
from transformers import AutoTokenizer, AutoModel
import gc

class Model():

def __init__(self, path='resources/intfloat_multilingual-e5-large'):
    self.tokenizer = AutoTokenizer.from_pretrained(path)
    self.model = AutoModel.from_pretrained(path)
    dvc = device('cpu')
    self.model.to(dvc)
    self.model.eval()

def average_pool(self, last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def inference(self, texts): 
    with no_grad():
        batch_dict = self.tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
        #print(batch_dict)
        #print(self.model.config)
        outputs = self.model(**batch_dict)
        embeddings = self.average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
        del outputs
        embeddings = F.normalize(embeddings, p=2, dim=1)
        embeddings = embeddings.numpy().tolist()
        gc.collect()
        cuda.empty_cache()
    return embeddings

model = Model()

Here is my docker stats. Its initially uses around 2.6gb ram in memory. But in each iteration it increases slowly.
Please let me know if I can clear the cache of the memory or in any way I can stop this memory leak.

Thanks

intfloat

Owner Nov 4, 2023

This looks strange, python and pytorch should do GC automatically. Is it possible that you store too many embedding vectors that cause the OOM issue?

skapadia-zalando

16 days ago

I can confirm this issue - my input_texts are only about 100MB in memory but getting embeddings for them takes reaches 200GB (which is where my server crashes)