How to quantify and accelerate this model
I tried faster-transformer and failed on that. Any ideas?
year, any ideas to accelerate the model? Translation in google colab (CPU and GPU) are extremely slow. Longer than I can translate manually.
Hi @gembird
Currently the only way to accelerate inference on CPU & GPU is to use the BetterTransformer
API for the encoder part of MBart, I believe this will not accelerate much the translation as most of the bottleneck happens on the decoder side I believe.
On GPU, in case you are running your translations with batch_size=1 you can try your hands with quantization and fast kernel from bitsandbytes
, by making sure you load your model with bnb_4bit_compute_dtype=torch.float16
import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", quantization_config=quantization_config)
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(
**encoded_hi,
forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire dans la Syrie."
# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(
**encoded_ar,
forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
More generally we are currently migrating attention layers to use torch.scaled_dot_product_attention
in transformers core, which should lead to much faster inference. Please have a look at https://github.com/huggingface/transformers/pull/26572 for further details and make sure you can test that feature directly once the support is going to be added on most architecutres, including MBart
Hi @ybelkada , I gave that quantization example you provided a shot and I'm getting a weird result. GPU inference without quantization works fine but when I add the quantization config I'm now getting something like ['okay', 'okay'] when I run inference with a sample sentence. Seems to be just random tokens so I'm wondering if there's an issue with the quantization configuration.