Text Generation
Transformers
PyTorch
Safetensors
English
llama
text-generation-inference
Inference Endpoints

Why do you set `use_cache=False`? Removing it will speed up generation

#8
by borzunov - opened

Hi,

I wonder why do you set use_cache=False in config.json?

As far as I understand, this gives identical results to use_cache=True for autoregressive models but runs the O(n^3) generation algorithm instead of the O(n^2) one (i.e., re-runs prefix for generating every new token). I think you can significantly speed up generation for this model by removing this line from the config.

borzunov changed discussion title from Why do you set `use_cache=False`? to Why do you set `use_cache=False`? Removing it will speed up generation

It is needed in train proccess. To my mind, you can chage to True in inference.

Sign up or log in to comment