KV Cache for compress_kv or key-value states

#1
by House-99 - opened

From tech report, when it comes to inference, the kv cache should be compress_kv. But from the modeling_deepseek.py, I notice that the key state and value state still like llama2. Something wrongs right?

This comment has been hidden

Does the file modeling_deepseek.py merely contain executable code accompanying open-source weights, without the actual implementation of compressed kv? I've also noticed that the training implementation for DeepseekV2MoE lacks supporting of ep.

DeepSeek org

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Does the file modeling_deepseek.py merely contain executable code accompanying open-source weights, without the actual implementation of compressed kv? I've also noticed that the training implementation for DeepseekV2MoE lacks supporting of ep.

DeepseekV2 opensource their training implementation? I do not find the link yet.

DeepseekV2 opensource their training implementation? I do not find the link yet.

Apologies for the confusion, I mean the implementation in modeling_deepseek.py

@msr2000 Thank you for your contribution to the OS community.
I am working on a project where we'd like to use DS-2 as the base model. Inference speed has become a bottleneck for our project
We wonder if you have any etas on the timeline for open sourcing efficient vllm inference code?

Sign up or log in to comment