Article source: Late LatePost
Image source: Generated by AI
On February 18, Kimi and DeepSeek released new developments on the same day, namely MoBA and NSA, both of which are improvements to the “Attention Mechanism.”
Today, Andrew Lu, one of MoBA’s main R & D classmates, posted in Zhihu, describing three pitfalls in the R & D process, which he called “Three Insights Thinking of Overcoming Cliff.” His signature in Zhihu is “New LLM Trainer”.
One comment under this answer was: “Starting from open source papers and open source code, it has now evolved into an open source thinking chain.”
The attention mechanism is important because it is the core mechanism of the current Big Language Model (LLM). Back in June 2017, the eight sub-paper of Transformer that launched the LLM revolution, titled: Attention Is All You Need, the paper has been cited 153,000 times so far.
The attention mechanism allows AI models to know, like humans, what to “focus on” and what to “ignore” when processing information, and grasp the most critical part of the information.
Attention mechanisms come into play during both the training and use (reasoning) phases of a large model. Its general working principle is that when a piece of data is input, such as “I like to eat apples”, the large model calculates the relationship between each word (Token) in the sentence and other words, thereby understanding semantic and other information.
As the context that large models need to deal with becomes longer and longer, the Full Attention mechanism originally adopted by the standard Transformer becomes intolerable to consume computing resources, because the original process is that the importance scores of all input words need to be calculated, and then weighted to get the most important words. Its computational complexity will increase square (non-linearly) as the text becomes longer. As written in the “Abstract” section of the MoBA paper:
“The square increase in computational complexity inherent in traditional attention mechanisms brings prohibitive computational overhead.”
At the same time, researchers are pursuing that the context handled by the large model can be long enough-multiple rounds of dialogue, complex reasoning, memory capabilities… These assumed characteristics that AGI should have all require longer and longer context capabilities.
How to find an attention mechanism optimization method that does not consume so much computing resources and memory without sacrificing model performance has become an important topic in large model research.
This is the technical background of several companies focusing their attention on “attention”.
In addition to DeepSeek NSA and Kimi MoBA, in mid-January this year, MiniMax, another large China model startup, also implemented a new attention mechanism on a large scale in its first open source model, MiniMax-01. MiniMax founder Yan Junjie told us at the time that this was one of the most important innovations of MiniMax-01.
Liu Zhiyuan, co-founder of Wall Wall Intelligence and associate professor of the Department of Computer Science at Tsinghua University, also published InfLLM in 2024, which also involved an improvement in sparse attention. The paper was quoted in an NSA paper.
Among these results, the attention mechanisms in NSA, MoBA, and InfLLm all belong to the “Sparse Attention Mechanism”; while the attempt by MiniMax-01 mainly goes in another direction: “Linear Attention”.
Cao Shijie, one of the authors of SeerAttention and a senior researcher at Microsoft Research Institute, told us: Overall, the linear attention mechanism makes more and more radical changes to the standard attention mechanism. He wants to directly solve the problem of the square explosion of computational degree (so it is nonlinear) as the text becomes longer. One possible cost is that it will lose the capture of complex dependencies on long contexts; The sparse attention mechanism uses the inherent sparsity of attention to try to find a more robust optimization method.
At the same time, I would like to recommend Teacher Cao Shijie’s high praise on the attention mechanism on Zhihu: www.zhihu.com/people/cao-shi-jie-67/answers
(He replied,”What information is worth paying attention to in the new DeepSeek paper NSA attention mechanism that Liang Wenfeng participated in? What impact will it have?” This question.)
Fu Tianyu, a Ph.D. in the NICS-EFC Laboratory of Tsinghua University, co-authored by MoA (Mixture of Sparse Attention), said that in the general direction of sparse attention mechanisms: “Both NSA and MoBA have introduced dynamic attention methods, which can dynamically select KV Cache blocks that need to calculate fine-grained attention can improve model performance compared to some sparse attention mechanisms that use static methods. Both methods also introduce sparse attention during model training, rather than just during reasoning, which further improves model performance.”
(Note: The KV Cache block is a cache that stores the previously calculated Key tag and Value value; the Key tag refers to the identification tag used to identify information such as data characteristics or data location in the calculation related to the attention mechanism, so that it can be matched and associated with other data when calculating the attention weight. The Value value corresponds to the Key tag and usually contains the actual data content to be processed, such as the semantic vector of a word or phrase and other information.)
At the same time, in addition to publishing detailed MoBA technical papers, Dark Side also released MoBA engineering code on the GitHub project website. This set of code has been used on Dark Side’s own product Kimi midline for more than a year.
* The following is Andrew Lu’s self-statement on Zhihu, which has been authorized by the author. There are many AI terms in the original text, and the gray text explanations in () are all editors ‘notes.Original post link: www.zhihu.com/people/deer-andrew
Andrew Lu’s research and development self-report
At the invitation of Teacher Zhang (Assistant Professor Zhang Mingxing at Tsinghua University), I came to answer the ups and downs of doing MoBA before. I jokingly called it “Three Insights Thinking of Crossing the Cliff”. (Andrew Lu answered the question: “How to evaluate Kimi’s open source sparse attention framework MoBA? What are the highlights of each of them compared to DeepSeek’s NSA?”)
The beginning of MoBA
The MoBA project started very early. At the end of May 2023, not long after the Dark Side of the Moon was established, Tim (co-founder of Dark Side of the Moon Zhou Xinyu) pulled him into a small room on the day of registration, and started Long Context Training with Teacher Qiu (Zhejiang University/Zhijiang Laboratory Qiu Jiezhong, the initiator of the MoBA idea) and Dylan (Dark Side of the Moon researcher). First of all, I would like to thank Tim for his patience and teaching. He has high hopes for a novice LLM and is willing to cultivate him. Among the big shots who develop various online models and model-related technologies, many people, like me, have basically been exposed to LLM from scratch.
At that time, the general level in the industry was not very high. Everyone was pre-training in 4K (the input and output length that the model can handle is about 4000 Tokens and thousands of Chinese characters). The project was initially called 16K on 16B, which means that it can be done on 16B (model parameters 16 billion). Of course, this demand quickly became the need to support Pre-train under 128K in August. This is also the first requirement when MoBA was designed. It can quickly train a model that can support a length of 128K From Scratch. At this time, there is no need for Continue Training.
An interesting question also arises here. In May/June of 2023, the industry generally believed that training was long, and the effect of end-to-end training long text (directly using long text to train the model) was better than training a shorter model. Find a way to lengthen it. This perception only changed when long Llama (a large model developed by Meta that supports long text processing) appeared in the second half of 2023. We have also conducted strict verification ourselves, and in fact, short text training + length activation has better token efficiency (the amount of effective information contributed by each token is increased, which means that the model can complete higher-quality tasks with fewer tokens). The first function in natural MoBA design became a tear of the times.
During this period, MoBA’s structural design was also more “radical”. Compared with the current “extremely simplified” results, the initially proposed MoBA was a cross-attention approach. The two-layer attention mechanism serial scheme (an attention mechanism that handles the relationship between two different pieces of text data), the gate (which controls how the input data allocates weights between various expert networks) itself is a parameterless structure (no parameters, no data training is needed), but in order to better learn historical tokens, we added an inter-machine cross attention and corresponding parameters to each Transformer layer (which can better remember historical information). The MoBA design at this time has combined the idea of Context Parallel, which is well known to everyone later (The complete context sequence is stored on different nodes and is collected together when calculation is needed.) We tile the entire context sequence between data parallel nodes, and treat the context in each data parallel node as a MoE The expert in the Mixture of Experts sends the token that needs attention to the corresponding expert for cross attention and communicates the result back. We integrated the work of fastmoe, an early MoE training framework, into Megatron-LM, a now-universal large model training framework from Nvidia, to support inter-expert communication capabilities.
We call this idea MoBA v0.5.
(Editor’s note: MoBA is inspired by the current mainstream MoE structure. MoE means that when working on a large model, only some experts ‘parameters are activated at a time, not all, thereby saving computing power; the core idea of MoBA is to “only look at the most relevant context at a time, rather than all contexts, thereby saving computing and cost”.)
As time progressed to early August 2023, the main model Pre-Train had trained a large number of tokens, and it would be costly to do it again. MoBA, which has significantly changed its structure and added additional parameters, has entered the cliff for the first time.
A very simple schematic of MoBA v0.5
Editor’s note:
History Tokens-In scenarios such as natural language processing, represents a collection of previously processed text units.
Gate-In neural networks, structures used to control the flow of information
Input–Data or information received by the model
V (Value)-In the attention mechanism, it contains the data content to actually process or pay attention to, such as semantic vectors
K (Key tag)-In the attention mechanism, an identification tag used to identify information such as data characteristics or location so that it can be matched and associated with other data
Q (Querry query)-In the attention mechanism, a vector used to retrieve relevant information from key-value pairs
Cross Attention-An attention mechanism that focuses on input from different sources, such as associating input with historical information
Self Attention-An attention mechanism in which the model pays attention to its own input and captures dependencies within the input
When I enter, I miss the cliff
Entering the Cliff of Thinking is of course a jokingly joke, a time to stop and find improvement plans, and a time to deeply understand the new structure. This was the first time I entered Siguo Cliff to gain enlightenment. I entered quickly and came out quickly. As the dark side idea king of the moon, Tim came up with new improvement ideas, changing MoBA from a serial two-layer attention solution to a parallel single-layer attention solution. MoBA no longer adds additional model parameters, but uses existing attention mechanism parameters to simultaneously learn all information in a sequence, so that Continue Training can be carried out without changing the current structure as much as possible.
We call this idea MoBA v1.
MoBA v1 is actually a product of the Sparse Attention Context Parallel. When Context Parallel was not popular at the time, MoBA v1 reflected extremely high end-to-end acceleration capabilities. After we verified that it was effective on both 3B and 7B, we hit a wall at a larger model scale level, and a very large loss spike occurred during training (an anomaly that occurred during model training). The way our first version merged the block attention output (the result output after the attention module finished processing the data) was too crude. It was just a simple accumulation, which made it impossible to debug with Full Attention at all. Debugging without ground truth (standard answer, here refers to the result of Full Attention) was extremely difficult, and we exhausted all the stability methods available at the time to solve it. Due to problems with training on larger models, MoBA has been thinking about mistakes at this point.
A very simple schematic of MoBA v1
Editor’s note:
Self Attention to History-An attention mechanism in which the model focuses on historical markers and captures the dependence between current input and historical information
Share weights-Different parts of the neural network use the same weight parameters to reduce the number of parameters and improve model generalization
FFN (Feed-Forward Neural Network)-A basic neural network structure where data flows in a single direction from the input layer through the hidden layer to the output layer
Weighted Sum-The operation of summing multiple values according to their respective weights
Erru Si Crossing Cliff
The second time I stayed at Siguo Cliff was relatively long, starting from September 2023, and it was already early 24th when I left Siguo Cliff. But being on the Cliff of Thinking does not mean I have been abandoned. I was able to experience the second major feature of working on the dark side of the moon, saturated rescue.
In addition to Tim and Teacher Qiu, who have been exerting strong output, Su Shen (Su Jianlin, researcher on the Dark Side of the Moon), Jingyuan Liu (researcher on the Dark Side of the Moon) and various big shots all participated in the heated discussion and began to disassemble and correct MoBA. The first thing that was corrected was the simple Weighted Sum (weighted sum) superposition. After we tried various superposition methods of multiplying and adding the Gate Matrix here, Tim took out Online Softmax from the pile of old paper (you can’t calculate it by seeing all the data, but processing one by one) and said that this should work. One of the biggest benefits is that after using Online Softmax, we can strictly debug against a mathematically equivalent Full Attention by reducing the sparsity to 0 (selecting all blocks), which solves most of the problems encountered in implementations. However, this design of context splitting between data parallel nodes will still cause imbalance problems. After a data sample is tiled among data parallelities, the top few tokens on the first data parallel rank will be sent by the following numerous Q’s to attend (attention calculation process), which brings extremely poor balance and slows down acceleration efficiency. This phenomenon also has a more widely known name-Attention Sink.
At this time, Teacher Zhang visited and after listening to our ideas, he put forward new ideas to separate the Context Parallel ability from MoBA. Context Parallel is Context Parallel, MoBA is MoBA, and MoBA returns to a Sparse Attention itself rather than a distributed Sparse Attention training framework. As long as the video memory is available, it is possible to process all contexts on a single machine, use MoBA to accelerate calculations, and organize and transfer contexts between machines through Context Parallel. So we re-implemented MoBA v2, which is basically the same as the MoBA that everyone sees so far.
Current MoBA design
Editor’s note:
MoBA Gating-a specific gating mechanism in MoBA
RoPE (Rotary Position Embedding)-a technique to add position information to a sequence
Partition to blocks-Divide data into different blocks
Mean Pooling-An operation that downsamples data in deep learning to calculate the average of the data within an area
MatMul (Matrix -Multiple Matrix Multiply)-A mathematical operation used to calculate the product of two matrices
TopK Gating-A gating mechanism that selects the top K important elements and other operations
Selected Block Index-Represents the number of the selected block
Index Select-Select the corresponding element from the data based on the index
Varlen Flash-Attention-An attention mechanism suitable for variable-length sequences and is computationally efficient
Attention Output-The calculated output of the attention mechanism
MoBA v2 is stable and trainable, short text and Full Attention can be fully aligned, Scaling Law seems very reliable, and its smoother support is extended to online models. Therefore, we added more resources. After going through a series of debugging and consuming n hair from students in the infra group, we can make the Pretrain model activated by MoBA all green in the needle test (the large model handles long text ability test passed the standard). At this step, we already feel very good and start going online.
But what was least unexpected was only unexpected. SFT (Supervise fine-tuning, based on the pre-trained model, further train the model for specific tasks to improve the model’s performance on that task) Part of the data at the stage is carried with a very sparse loss mask (So that only 1% or less of the tokens have a training gradient)(loss mask refers to the technique of selecting which parts to participate in measuring model predictions and standard answer calculations), which causes MoBA to perform well on most SFT tasks, but the longer the summary type task, the more sparse the loss mask is, and the lower the learning efficiency is reflected. MoBA was pressed the pause button during the quasi-launch process and entered the Cliff of Sickness for the third time.
Three Entering Thinking Cliff
The third time I entered Siguo Cliff was actually the most nervous. At this time, the entire project had huge sunk costs, and the company had spent a lot of computing resources and human resources. If there were problems in the end-to-end long-text application scenario, then the preliminary research is close to being wasted. Fortunately, due to the excellent mathematical nature of MoBA itself, in the new round of saturated rescue experiments ablation (In the ablation experiment, we studied the impact on model performance by removing certain parts of the model or changing certain settings.), we found that the performance was very good if the loss mask was removed, and the performance was not satisfactory if the loss mask was worn. Then we realized that it was with gradient.(Gradient, a value used to update the direction and step size of model parameters in machine learning) The tokens are too sparse in the SFT stage, which brings about the problem of low learning efficiency. Therefore, by modifying the last few layers to Full Attention, the density of gradient tokens during backpropagation is increased and the learning efficiency of specific tasks is improved. Other subsequent experiments have proved that this switching does not significantly affect the Sparse Attention effect of the switching back, and is equal to the indicators of Full Attention with the same structure in the length of 1M (1 million). MoBA has once again returned from the cliff and successfully launched its services to users.
Finally, I would like to thank all the great gods for their help, and to thank the company for its strong support and huge graphics cards. What we are now open up is the code we use online. It is a Sparse Attention structure that has been verified for a long time. Due to actual needs, various additional designs have been cut off, maintaining a minimalist structure but at the same time having sufficient effects. I hope that MoBA and its CoT (Chain of Thought) can bring some help and value to everyone.
FAQ
By the way, I would like to take advantage of a place to answer some questions that have been asked frequently in the past two days. I have basically trouble Teacher Zhang and Su Shendang customer service to answer questions in the past two days. I really feel sorry, so I have extracted a few common questions and answered them together.
1. MoBA versus Decoding (Text generation process during the model reasoning stage)Is it invalid?
MoBA is effective for Decoding, very effective for MHA (Multi-Head Attention), has a reduced effect on GQA (Grouped Query Attention), and has the worst effect on MQA (Multi-Query Attention). The principle is actually very simple. In the case of MHA, each Q has its own corresponding KV cache, so the gate of MoBA can ideally be calculated in prefill through amortization.(The calculation stage when processing inputs for the first time) Calculate and store the representative token of each block (data block). This token will not change in the future, so all IO (input and output operations) can basically only come from KV cache after index select (operation of selecting data by index). In this case, the sparseness of MoBA determines the degree of IO reduction.
But for GQA and MQA, since a group of Q Heads actually shares the same KV cache, if each Q Head can freely choose the Block of interest, it is likely that it will fill up the IO optimization brought by sparsity. For example, if we think about a scenario: the MQAs of 16 Q heads, MoBA just splits the entire sequence into 16, which means that when the worst case scenario is that each Q head is interested in each context block with sequence numbers from 1 to 16, the advantages of saving IO will be smoothed out. The more Q Heads you can freely choose KV Blocks, the worse the effect.
Due to the existence of the phenomenon of “Q Head that freely chooses KV Block”, the natural improvement idea is to merge. Assuming that everyone chooses the same Block, wouldn’t it be a net profit for IO optimization? Yes, but under our actual tests, especially for pre-trained models that have already paid a lot of costs, each Q head has its own unique “taste”, and it is better to forcibly merge than retrain it from scratch.
2. MoBA is mandatory by default(Self-attention mechanism), so will self’s neighbors choose it?
We don’t have to choose. This is a known place that will cause some doubts. We finally chose to believe in SGD (Stochastic Gradient Descent). The current implementation of MoBA gate is very straightforward. Interested students can simply modify the gate so that it must have a chunk, but we tested the benefits of this change ourselves to a margin.
3. MoBA has Triton (Framework for writing high-performance GPU code, developed by OpenAI)Is it achieved?
We have implemented a version that improves end-to-end performance by 10%+, but the cost of continuous maintenance and keeping up with the main line is relatively high, so we postponed further optimization after multiple iterations.
* The project addresses of several achievements mentioned at the beginning of the article (the GitHub page all contains technical paper links, and DeepSeek has not yet launched the NSA GitHub page):
MoBA GitHub page: github.com/MoonshotAI/MoBA
NSA Technical Paper: arxiv.org/abs/2502.11089
MiniMax-01 GitHub page: github.com/MiniMax-AI/MiniMax-01
InfLLM GitHub page: github.com/thunlp/InfLLM? tab=readme-ov-file
SeerAttention GitHub page: github.com/microsoft/SeerAttention