DeepSeekV2 is a big deal. Not only because its significant improvements to both key components of Transformer: the Attention layer and FFN layer.
It has also completed disrupted the Chines LLM market and forcing the competitors to drop the price to 1% of the original price.
---
There are two key components in Transformer architecture: the self-attention layer, which captures relationships between tokens in context, and the Feed-Forward Network (FFN) layer, which stores knowledge.
DeepSeek V2 introduces optimizations to both:
Attention layer normally uses KV Cache to reduce repetitive compute, but it consumes significant GPU RAM, limiting concurrent requests. DeepSeek V2 introduces Multi-head Latent Attention (MLA), which stores only a small latent representation, resulting in substantial RAM savings.
DeepSeek V2 utilizes 162 experts instead of the usual 8 as in Mixtral. This approach segments experts into finer granularity for higher specialization and more accurate knowledge acquisition. Activating only a small subset of experts for each token, leads to efficient processing.
It disrupted the market by dropping API prices to $0.14 per 1M tokens. This dramatic reduction forced competitors like GLM, Ernie, and QWen to follow suit, lowering their prices to 1% of their original offerings. Now, users can access these APIs at 1/35th the cost of ChatGPT-4o.