Run on cpu

#2
by sharjeel103 - opened

It is using the flash attention which is not supported by colab free gpu. Is there any way to run this on cpu?

Unfortunately, they do not have a backup attention.

        attn_output = flash_attn_func(
            q=query_states.transpose(1, 2).to(torch.bfloat16),
            k=key_states.transpose(1, 2).to(torch.bfloat16),
            v=value_states.transpose(1, 2).to(torch.bfloat16),
            causal=True)

Maybe you can manually replace it with normal attention and see if it works in the modeling file (https://huggingface.co/AhmadMustafa/MobiLLama-Urdu-Article-Generation/blob/main/modelling_mobillama.py)

I pasted the standard attention code but it seems that that only changing that part will not work. Is there any other way to run this. As I am searching for the urdu text model to run on my raspberry pi 5. But if this model does not run on CPU then it will not work on R Pi.

I would encourage you to reach out to the original authors of MobiLLama (https://huggingface.co/MBZUAI/MobiLlama-05B), once you figure how to run that model, my model is just build upon that.

sharjeel103 changed discussion status to closed

Sign up or log in to comment