[PaperReading] MoBA: Mixture of Block Attention For Long-context LLMs

阅读月之暗面Kimi团队在LLM Foundations领域的新作。只写方法论解析。

Overview

3W1H: What challenge? Why this? What intuition? How to do?

What challenge: flaws of traditional attention mechanism in long-context inference.
Why this: draw inspirations from Mixure of Experts (MOE)
Intuition: language context consists of a series of ‘blocks’. The model could and should focus on a only subset of the blocks when making inference given a very long context.
How to do: autonomous gated selection of blocks of attention parameters.

Methodology

Traditional attention mechanism: given a single query token to the key and value tokens, :
$\text{Attn}(q,K,V) = \text{Softmax}(qK^\top) V$
MoBA attention mechanism: divide and into unions of ‘blocks’, then select only a subset of blocks for computation:

$\text{MoBA}(q,K,V) = \text{Softmax}(qK[I]^\top) V[I]$

here is the set of selected keys and values, represented as the list of row indices for selected components. An example is presented in Figure 1 of the paper.

Details of block partitioning

Let be the number of blocks. For convenience, let the context length is divisible by the number of blocks . Let denotes the size of each block. Then the range of the -th block is

$I_i = [(i-1)\times B+1, i\times B]$

Then MoBA applies top- selection on blocks to enable the attention to focus on only a subset of blocks rather than the entire context, i.e.,

$I = \bigcup_{i: g_i > 0}I_i$

Here is the output of the gating mechanism, deciding whether the -th block is selected for computation.

Details of gating mechanism

To obtain , the gating value of the -th block, MoBA first computes an affinity score , measuring the relevance between the query and the -th block, and applies a top-k gating among all blocks (lijt: this could be done via sorting or on-time updating). Specifically,

$s_i = <q, \overline{K[I_i]}>$

Here is the operator of relevance, e.g., cosine, correlation coeefficient, etc.

Then we get using:

$g_i = \mathbf{I}(s_i\in\text{TopK}(\{s_j|j\in[n]\},k))$

Summarization

Overall, I suppose MoBA is intrinsically a hierarchical computatation method for simplifying unnecessary computation in long-context attention calculation. Let’s make a comparison:

Traditional: query token key tokens (A tree with depth 2)
MoBA: query token key blocks key tokens (a tree with depth 3)

During this process, the pooling in a key block essentially condenses information of key tokens, thus simplifies the repeated computation on relatively useless tokens.

参考文献

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang,Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu. MoBA: Mixture of Block Attention for Long-Context LLMs. In Arxiv, 2025. https://arxiv.org/pdf/2502.13189