Run on cpu
It is using the flash attention which is not supported by colab free gpu. Is there any way to run this on cpu?
Unfortunately, they do not have a backup attention.
attn_output = flash_attn_func(
q=query_states.transpose(1, 2).to(torch.bfloat16),
k=key_states.transpose(1, 2).to(torch.bfloat16),
v=value_states.transpose(1, 2).to(torch.bfloat16),
causal=True)
Maybe you can manually replace it with normal attention and see if it works in the modeling file (https://huggingface.co/AhmadMustafa/MobiLLama-Urdu-Article-Generation/blob/main/modelling_mobillama.py)
I pasted the standard attention code but it seems that that only changing that part will not work. Is there any other way to run this. As I am searching for the urdu text model to run on my raspberry pi 5. But if this model does not run on CPU then it will not work on R Pi.
I would encourage you to reach out to the original authors of MobiLLama (https://huggingface.co/MBZUAI/MobiLlama-05B), once you figure how to run that model, my model is just build upon that.