Optimum-neuron-cache for inference?
I know this feature is designed for use with training, but it seems like the same process could be used for inference.
As it is, if I want to use a pre-compiled model, I need to create a "model" on Hugging Face for every compilation option and core count that I could want.
For example, meta-llama/Llama-2-7b-hf is the main model, but we have compiled versions aws-neuron/Llama-2-7b-hf-neuron-budget and aws-neuron/Llama-2-7b-hf-neuron-throughput, all the same model, just with a different compiled batch size and number of cores.
It sure would be sweet if we could just reference the original model with the arguments we want and it would grab it precompiled if it was already there. Otherwise, we are going to end up with a lot of different "models" for compilation options on Llama-2 versions, CodeLama versions, Mistral versions...
Apparently, life is sweet. Thank you.
https://huggingface.co/docs/optimum-neuron/pr_429/en/guides/cache_system
You're welcome. This was indeed a much needed feature for inference also.