Why inference is very slow?
#17
by
hanswang73
- opened
Nvidia A40, 48GB-GPU-Mem, 80GB-CPU-Mem
cuda 11.8
transformers == 4.31.0
8bit quantization
use TextIteratorStreamer for inference
the speed is about 1 token per second