How to improve inference runtime performance?
I've attempted several methods including https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47 and https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert but it seems like bettertransformer doesnt work with mpt-7b yet. So was wondering if anyone here has had success or additional suggestions on how to improve inference speed. Thanks
name = 'mosaicml/mpt-7b-instruct'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:6'
model_name = 'mosaicml/mpt-7b-instruct'
model = AutoModelForCausalLM.from_pretrained(
model_name,
config=config,
trust_remote_code=True,
torch_dtype=bfloat16,
max_seq_len=512
)
generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
use_fast = True,
stopping_criteria=stopping_criteria,
temperature=.5,
top_p=0,
top_k=0,
max_new_tokens=1250,
repetition_penalty=1.0,
)
Yeah, I'm facing the same issue. Generation rates in Google Colab with a 15 GB GPU are only about 1 token per second. That's really terrible. I'm using 4 bit quantisation, which means
I think it may be partly because of not using triton
config.attn_config['attn_impl'] = 'triton'
However, using triton fails when I try - see here.
BTW, here is the config that is giving me 1 tok/s:
# Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.
config.max_seq_len = 1024
model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=cache_dir) # for inference use 'auto', for training us device_map={"":0}
On the other hand, back of the envelope is that T4s have got 8 TFLOPS of compute, and we need 1,000 prompt tokens x 7B params x2 (for multiplication + addition) x ~1/2 (for quantisation benefit) = 7T floating point operations required per token of output. So maybe 1 tok/s is about right? I'd be interested in whether Triton helps more (quantisation down to 4 bit should give 4x improvement, not 2x like above).
Yeah, I'm facing the same issue. Generation rates in Google Colab with a 15 GB GPU are only about 1 token per second. That's really terrible. I'm using 4 bit quantisation, which means
I think it may be partly because of not using triton
config.attn_config['attn_impl'] = 'triton'
However, using triton fails when I try - see here.
BTW, here is the config that is giving me 1 tok/s:
# Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True) config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map. config.max_seq_len = 1024 model = AutoModelForCausalLM.from_pretrained(model_id, config=config, quantization_config=bnb_config, device_map='auto', trust_remote_code=True, cache_dir=cache_dir) # for inference use 'auto', for training us device_map={"":0}
On the other hand, back of the envelope is that T4s have got 8 TFLOPS of compute, and we need 1,000 prompt tokens x 7B params x2 (for multiplication + addition) x ~1/2 (for quantisation benefit) = 7T floating point operations required per token of output. So maybe 1 tok/s is about right? I'd be interested in whether Triton helps more (quantisation down to 4 bit should give 4x improvement, not 2x like above).
Do you know why my model goes crazy after using config.attn_config['attn_impl'] = 'triton'? My output is ��� exceedsельельителельителしているしている性しているしているしているしているâしているしている性
@sam-mosaic any tips here? Appreciate it, Ronan