Text Generation
Transformers
PyTorch
English
llama
text-generation-inference
Inference Endpoints

Time complexity

#8
by wiccanmind - opened

First of all, thank you for your valuable contribution with this amazing model.
I'm currently experiencing an issue regarding the inference latency between two types of data. I've attempted to load the model using the from_pretrained method as shown below on 2 Titan RTX 24G GPUs:
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.bfloat16,
#load_in_8bit = True,
device_map="auto",
)
The problem arises when I utilize the load_in_8bit parameter on just 1 GPU; the inference time with the same configuration seems to be roughly twice as long compared to loading with bf16. To be more specific:

  • Original bf16 Time Cost: 10.947212934494019 s
  • load_in_8bit Time Cost: 26.15874981880188 s
    Of course, in both scenarios, the performance appears to be similar. Could you please help me understand where I might be mistaken?

LLM.Int8 is substantially slower than bf16 at this model scale. If you really want to prioritize inference speed with low VRAM usage, you could try 4-bit, which lies somewhere in between in terms of inference speed.

I have tried running inference with 4-bit Quantization ( this model), but the time cost is even higher (75s) with the same generated configuration and prompt. The model only utilizes around 11GB of GPU memory. I'm using AutoGPTQForCausalLM as shown below, following the instructions from this
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ",
use_safetensors=True,
trust_remote_code=False,
device='cuda:0',
use_triton=False,
quantize_config=None)
Could this be an issue with the library, the model itself, or the GPU architecture? I haven't found any similar questions, could you possibly suggest a solution? Thank you so much.

Sign up or log in to comment