从Multi-head Latent Attention(MLA)开始,梳理LLM使用的主流Attention框架。
Overview
3W1H: What challenge? Why this? What intuition? How to do?
- What challenge - huge spacial-temporal cost in attention due to KV cache.
- Why this - other solutions have to sacrifice inference performance to reduce the KV cache.
- What intuition - key-value vectors could be jointly compressed via a low-rank latent vactor.
- How to do - compression
projection
MLA vs. Other Attention Frameworks
Preliminary: Traditional MHA
Let
Then
The Core of MLA: Low-Rank Key-Value Joint Compression
In MLA, the key and vector information saved in
A similar policy for compressing query vectors is also applied in the training of DeepSeek V2, even if it cannot reduce KV-cache:
In this way,
Finally, we present a comparison between popular attention structures.
Property | MHA | GQA | MQA | MLA |
---|---|---|---|---|
#Key to #Query per head | 1 to 1 | 1 to groups | 1 to all | 1 to 1 |
How to reduce KV cache | / | Reduce #KV | Largely reduce #KV | Compress KV |
KV cache per token | ||||
Capability | Strong | Moderate | Weak | Stronger |
One More Thing: Traditional Methods for Low-Rank Approximation of Matrices
Here I list some traditional numerical methods for low-rank approximation of matrices.
- Singular Value Decomposition (SVD): The fundamental approach that decomposes a matrix
into , where and are orthogonal matrices and contains singular values. Truncating to the largest singular values gives the optimal rank- approximation. - QR Decomposition with Column Pivoting: Factors A into A = QR where Q is orthogonal and R is upper triangular. The pivoting strategy identifies dominant columns for the approximation.
- Randomized SVD: Uses random sampling to identify a subspace capturing the range of A, then projects the matrix onto this subspace to compute a smaller SVD.
- CUR Decomposition: Approximates A ≈ CUR where C consists of selected columns, R consists of selected rows, and U is typically computed as a pseudoinverse.
Why cannot we apply these methods to MLA? - I suppose the reason is that these methods cannot be integrated with data fitting, traditional vs. MLA is like SVD vs. Autoencoder. The later compresses information in order to maximize task goals.
Reference
- DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. In Arxiv, 2024. https://arxiv.org/abs/2405.04434