[PaperReading] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA篇)

从Multi-head Latent Attention(MLA)开始,梳理LLM使用的主流Attention框架。

Title

Overview

3W1H: What challenge? Why this? What intuition? How to do?

  • What challenge - huge spacial-temporal cost in attention due to KV cache.
  • Why this - other solutions have to sacrifice inference performance to reduce the KV cache.
  • What intuition - key-value vectors could be jointly compressed via a low-rank latent vactor.
  • How to do - compression projection

The Architecture of DeepSeek-V2

MLA vs. Other Attention Frameworks

Attentions

Preliminary: Traditional MHA

Let be the embedding dimension (for each token), be the number of attention heads, be the dimension per head, and be the attention input of the -th token at an attention layer. To produce the query, key and value vectors in MHA:

Then are sliced into heads for the multi-head computation. Let the output of multiple heads is . The output of the current attention is

The Core of MLA: Low-Rank Key-Value Joint Compression

In MLA, the key and vector information saved in are compressed into . Let . Then the information is compressed. For example, let , , then we need parameters for traditional MHA, but only parameters for MLA.

A similar policy for compressing query vectors is also applied in the training of DeepSeek V2, even if it cannot reduce KV-cache:

In this way, is the compressed representation of the token embedding.

Finally, we present a comparison between popular attention structures.

Table. A comparison between various attention structures.
Property MHA GQA MQA MLA
#Key to #Query per head 1 to 1 1 to groups 1 to all 1 to 1
How to reduce KV cache / Reduce #KV Largely reduce #KV Compress KV
KV cache per token
Capability Strong Moderate Weak Stronger

One More Thing: Traditional Methods for Low-Rank Approximation of Matrices

Here I list some traditional numerical methods for low-rank approximation of matrices.

  • Singular Value Decomposition (SVD): The fundamental approach that decomposes a matrix into , where and are orthogonal matrices and contains singular values. Truncating to the largest singular values gives the optimal rank- approximation.
  • QR Decomposition with Column Pivoting: Factors A into A = QR where Q is orthogonal and R is upper triangular. The pivoting strategy identifies dominant columns for the approximation.
  • Randomized SVD: Uses random sampling to identify a subspace capturing the range of A, then projects the matrix onto this subspace to compute a smaller SVD.
  • CUR Decomposition: Approximates A ≈ CUR where C consists of selected columns, R consists of selected rows, and U is typically computed as a pseudoinverse.

Why cannot we apply these methods to MLA? - I suppose the reason is that these methods cannot be integrated with data fitting, traditional vs. MLA is like SVD vs. Autoencoder. The later compresses information in order to maximize task goals.

Reference

  1. DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. In Arxiv, 2024. https://arxiv.org/abs/2405.04434