Getting Error : No operator found for `memory_efficient_attention_forward`

#11
by iabhishekofficial - opened

Getting this error
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 14074, 16, 64) (torch.float32)
key : shape=(1, 14074, 16, 64) (torch.float32)
value : shape=(1, 14074, 16, 64) (torch.float32)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'>
p : 0.0
ckF is not supported because:
device=cpu (supported: {'cuda'})
dtype=torch.float32 (supported: {torch.bfloat16, torch.float16})
operator wasn't built - see python -m xformers.info for more info

一个小小的400M的模型,为啥还用xformers?!!!
it is a small embedding model, but it use the xformers which is a library working only on CUDA!!!
only working around solutions not solving it directly:

  1. use colab of google
  2. fake a gpu and cuda support( i am lazy, i would rather switch model. sorry for that)
  3. build or compile xformers yourself for cpu. For the same reason, I give up.

So, good luck for you iabhishekofficial.

StellaEncoder org

@eric0591
如果能回到过去,我肯定不用了,没想到这玩意这么麻烦,也着实把我恶心到了。
当时选型模型的时候只顾着好了,忘记考虑易用性了,等v6版本吧

Has anyone solve this issue?

@infgrad
thanks for your quick and responsable reply.
looking forward to your next update! Good luck to you!
@mohit0928
to solve this, either modify the sourcce code of this model or re-complie xformers which is the dependent package of this model.
but, unfortunately, neither of them is an easy work!
let us wait for the next update of the author.

StellaEncoder org

@eric0591 @iabhishekofficial @mohit0928

hi, I have solved this problem. The new codes has been updated and you can see them on the model card.

The memory_efficient_attention and unpad_inputs make the inference hard, you can disable them.

Here is the example:

from sentence_transformers import SentenceTransformer

# This model supports two prompts: "s2p_query" and "s2s_query" for sentence-to-passage and sentence-to-sentence tasks, respectively.
# They are defined in `config_sentence_transformers.json`
query_prompt_name = "s2p_query"
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# docs do not need any prompts
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# !The default dimension is 1024, if you need other dimensions, please clone the model and modify `modules.json` to replace `2_Dense_1024` with another dimension, e.g. `2_Dense_256` or `2_Dense_8192` !
# on gpu
model = SentenceTransformer("dunzhang/stella_en_400M_v5", trust_remote_code=True).cuda()
# you can also use this model without the features of `use_memory_efficient_attention` and `unpad_inputs`. It can be worked in CPU.
# model = SentenceTransformer(
#     "dunzhang/stella_en_400M_v5",
#     trust_remote_code=True,
#     device="cpu",
#     config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False}
# )
query_embeddings = model.encode(queries, prompt_name=query_prompt_name)
doc_embeddings = model.encode(docs)
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 1024) (2, 1024)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.8398, 0.2990],
#         [0.3282, 0.8095]])

Waitting for your feedback.

there are 5 modifies should be added, let me update the transformers code for you.


import os
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import normalize

query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
queries = [query_prompt + query for query in queries]
# docs do not need any prompts
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# The path of your model after cloning it
model_dir = "{Your MODEL_PATH}"

vector_dim = 1024
vector_linear_directory = f"2_Dense_{vector_dim}"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
# you can also use this model without the features of `use_memory_efficient_attention` and `unpad_inputs` and ***delete the cuda()***. It can be worked in CPU.
# model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,use_memory_efficient_attention=False,unpad_inputs=False).eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
    k.replace("linear.", ""): v for k, v in
    torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin")).items()
}
# to run on CPU you must add the para of map_location=torch.device('cpu')
'''
vector_linear_dict = {
    k.replace("linear.", ""): v for k, v in
    torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin"),map_location=torch.device('cpu')).items()
}
'''
vector_linear.load_state_dict(vector_linear_dict)
vector_linear.cuda()  # to run on CPU ,you should disable this line

# Embed the queries
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}  # to run on CPU ,you should disable this line
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    query_vectors = normalize(vector_linear(query_vectors).cpu().numpy())

# Embed the documents
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    input_data = {k: v.cuda() for k, v in input_data.items()}  # to run on CPU ,you should disable this line
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    docs_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    docs_vectors = normalize(vector_linear(docs_vectors).cpu().numpy())

print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

I tried using Stella en_400M_v5 embedding through the HuggingFaceEmbedding function on a MacBook M2 Pro but cannot get it to work because of the xformers dependency.

To begin with, installing xformer on Apple Silicon is a real pain (you have to install llvm and fiddle a bit with compilation options to make it work, see here and there), but it is feasible.

Then, when trying to use Stella, xformers crashes with an error like NotImplementedError: No operator found for memory_efficient_attention_forward ... device=mps (supported: {'cuda'}) ... which is similar to this issue as well as to the above discussion, and is due to the fact that only MPS can be used on Apple Silicon and not CUDA (from what I understood).

I had a quick look at the modeling.py file of this model, and the line 446-453 seems to be the culprit:

        if use_memory_efficient_attention is None:
            use_memory_efficient_attention = self.config.use_memory_efficient_attention
        self.use_memory_efficient_attention = use_memory_efficient_attention
        self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
        if self.use_memory_efficient_attention:
            assert self.memory_efficient_attention is not None, 'please install xformers'
        if self.config.unpad_inputs:
            assert self.config.use_memory_efficient_attention, 'unpad only with xformers'

Would there be a way to change this to not use xformers?

I'm in the same camp, though I was able to install xformers without much trouble, but stella doesn't seem to be using the kwargs passed in config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False} as I get the same error as without it, or it just hangs when trying to pass device="mps"

Sign up or log in to comment