Run on cpu

by sharjeel103 - opened Sep 10

Discussion

sharjeel103

Sep 10

It is using the flash attention which is not supported by colab free gpu. Is there any way to run this on cpu?

AhmadMustafa

Owner Sep 10

Unfortunately, they do not have a backup attention.

        attn_output = flash_attn_func(
            q=query_states.transpose(1, 2).to(torch.bfloat16),
            k=key_states.transpose(1, 2).to(torch.bfloat16),
            v=value_states.transpose(1, 2).to(torch.bfloat16),
            causal=True)

Maybe you can manually replace it with normal attention and see if it works in the modeling file (https://huggingface.co/AhmadMustafa/MobiLLama-Urdu-Article-Generation/blob/main/modelling_mobillama.py)

sharjeel103

Sep 10

I pasted the standard attention code but it seems that that only changing that part will not work. Is there any other way to run this. As I am searching for the urdu text model to run on my raspberry pi 5. But if this model does not run on CPU then it will not work on R Pi.

AhmadMustafa

Owner Sep 10

I would encourage you to reach out to the original authors of MobiLLama (https://huggingface.co/MBZUAI/MobiLlama-05B), once you figure how to run that model, my model is just build upon that.

sharjeel103 changed discussion status to closed Sep 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment