compatible with Llama

#29
by cArlIcon - opened
No description provided.
richardllin changed pull request status to open
richardllin changed pull request status to merged

Yi-34B's generation became 10x slower on 4xA10 GPUs after replacing YiForCausalLM with LlamaForCausalLM.
Any idea why?

Hi @rodrigo-nogueira not sure what's the root cause, but do you want to give Flash Attention a try by invoking the model with use_flash_attention_2=True?

More context can be found from:
https://huggingface.co/docs/transformers/v4.35.2/en/perf_infer_gpu_one#Flash-Attention-2

Thank you very much, it is much faster now.

Sign up or log in to comment