Inference Performance Optimization for Large Language Models on CPUs
Abstract
Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.
Community
Would these performance gains be useful on a single CPU with a batch size of 1, or would that have insignificant gains compared to MULTI-CPU high batch count. Cheers
Yes, you are right, this solution can benefit both single CPU and CPU server clusters. if your model is small, you can just leverage one socket, if your model is big, like 70B, and you have a good network connection, you can leverage the whole CPU cluster to do the inference across multi-servers. BTW, the CPU server has a large memory capacity so that large batch can be supported too.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Inference Optimization of Foundation Models on AI Accelerators (2024)
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (2024)
- The Solution for the AIGC Inference Performance Optimization Competition (2024)
- New Solutions on LLM Acceleration, Optimization, and Application (2024)
- PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper