The Evolution of Attention: From Transformers to Mamba and the O(n) Future

In 2017, Vaswani et al. declared "Attention Is All You Need." They were right, but they omitted a crucial fine-print warning: Attention is all you need, but it will cost you a fortune in GPU memory.

The core issue lies in how standard attention works. To understand the context of a sequence, every token must "attend" to every other token. If you have a sequence of $n$ tokens, the model calculates an attention matrix of size $n \times n$ .

The villain of our story is the $QK^T$ term in the canonical formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Think of it like a dinner party. In a Recurrent Neural Network (RNN), guests say hello to the person next to them. It’s efficient ( $O(n)$ ), but by the time the message reaches the end of the table, "Pass the salt" has become "The reactor is melting."

In a Transformer, everyone stands up and toasts everyone else individually. It ensures perfect communication, but if you double the guest list, the number of toasts doesn't double—it quadruples. This is the $O(n^2)$ bottleneck. For a 2,000-word essay, it’s manageable. For a 2-million-token codebase, it’s a computational supernova.

The Engineering Fix: FlashAttention & FlashAttention-2

Or: "How to organize the seating chart without changing the guest list."

Before researchers could fix the math, engineers fixed the memory.

FlashAttention-1: The Memory Hack

The primary bottleneck in attention isn't actually the multiplication; it's the memory bandwidth. Moving that massive $n \times n$ matrix from the GPU’s slow High Bandwidth Memory (HBM) to its fast on-chip SRAM is agonizingly slow.

Enter FlashAttention (Dao et al., 2022). It doesn’t change the math—it still computes the exact $O(n^2)$ attention. However, it uses a clever "tiling" algorithm to fuse operations, ensuring the massive matrix is never explicitly materialized in the slow memory.

FlashAttention-2: The Parallelism Hack

In 2023, the sequel arrived. If V1 was about memory, V2 was about pure parallelism.

The original version had a flaw: it parallelized computation over the "batch" and "heads," but not the sequence length. This meant that for very long sequences (exactly when you need FlashAttention), GPU threads weren't fully utilized.

FlashAttention-2 fixes this by parallelizing across the sequence length dimension and optimizing "warp" (thread group) communication. The result is a 2x speedup, effectively setting the speed limit for modern production LLMs like GPT-4 and Llama 3.

The Approximations: "Cheating" for Speed

If you can't calculate everything faster, calculate less. This led to two families of $O(n)$ approximations.

Sparse Attention (Sliding Windows)

Used in models like Mistral 7B and Longformer, this approach gives each token "tunnel vision." A token only attends to its nearest neighbors (e.g., the last 4,096 tokens).

It turns the global dinner party into a series of intimate conversations. It’s fast ( $O(n)$ ), but it loses global context. If the answer to your question was in Chapter 1, and you're reading Chapter 10, the model has effectively forgotten it exists.

Linear Attention

This family attempts to reorder the math. By replacing the non-linear softmax with a kernel function $\phi$ , we can invoke the associative property of matrix multiplication.

Here is the "magic trick" in Python pseudo-code:

python
# Standard Attention (Quadratic)
# We MUST compute the (N x N) matrix first because of Softmax
Attn_Matrix = Softmax(Q @ K.T)  # Shape: (N, N) -- The Bottleneck!
Output = Attn_Matrix @ V        # Shape: (N, D)

# Linear Attention (Linear)
# We define a kernel function phi() and reorder
# We compute K.T @ V first, creating a "Global Context" matrix
Global_Context = phi(K).T @ V   # Shape: (D, D) -- Small and fixed size!
Output = phi(Q) @ Global_Context # Shape: (N, D)

We avoid the $n \times n$ matrix entirely. The catch? A "precision tax"—often a 3-5% accuracy drop because the kernel approximation isn't perfect.

The Precursor: The "S4" Era

Or: "The math that ran so Mamba could walk."

Before we talk about Mamba, we must talk about its grandfather: S4 (Structured State Space Sequence model).

For years, researchers tried to revive RNNs. The dream was a model that had a "state" (memory) like an RNN but could train in parallel like a Transformer. The breakthrough came with the HiPPO Matrix.

Gu et al. (2020) mathematically proved there is an optimal way to compress history using orthogonal polynomials (HiPPO). S4 used this to "memorize" extremely long sequences. However, S4 was a Linear Time Invariant (LTI) system. The rules of the model were fixed. It treated the word "the" with the exact same importance as the word "nuclear." It was polite, but it wasn't smart.

The Challenger: Mamba

Or: "S4 stops being polite and starts getting selective."

This leads us to Mamba (Gu & Dao, 2023). Mamba is essentially S4, but it breaks the "Invariant" rule.

Mamba introduces Selectivity. It makes the state transition parameters (the "decay" $\Delta$ ) data-dependent.

In S4: The decay rate is fixed.
In Mamba: The decay rate is calculated per token.
- If the model sees noise (e.g., whitespace), it increases the decay: "Forget this immediately."
- If it sees a key fact, it decreases the decay: "Hold onto this forever."

This allows Mamba to have the massive context of S4, but the high-fidelity reasoning of a Transformer. And thanks to a hardware-aware Parallel Scan algorithm (a manufacturing hack to run recurrent math in parallel), it trains just as fast as a Transformer.

Figure: The Selective State Space mechanism in action.

The Future: Hybrid Architectures

Or: "Why choose when you can have both?"

We are now entering the era of the Hybrid Model. Engineers realized that $O(n)$ linear efficiency is great for the "heavy lifting" of processing massive text, but $O(n^2)$ attention is still unbeaten for precise, high-fidelity recall.

Jamba (Samba): Interleaves Mamba layers (for bulk processing) with Sliding Window Attention layers (for local precision).
Kimi Linear: A new architecture from Moonshot AI that mixes a highly efficient linear attention (Kimi Delta Attention) with periodic full-attention layers to ensure no detail is lost.

Conclusion: A New Fellowship

For seven years, the industry treated Vaswani’s paper like a prophecy, searching for a single perfect architecture to handle every task. But as 2025 approaches, it is clear that there is no "One Attention to rule them all."

Instead, we have a powerful ecosystem of specialized tools: FlashAttention provides the raw training density, Mamba delivers linear scalability for massive contexts, and Hybrid Architectures offer the best of both worlds.

For the AI Solution Engineer, the job is no longer just "scaling up." It is about selecting the right mechanism for the specific trade-off at hand. The era of the quadratic bottleneck is ending; the era of the hybrid architecture has begun.