SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Abstract
SSA, a unified training framework for sparse attention in LLMs, achieves state-of-the-art performance by aligning sparse attention with full attention, improving long-context processing and extrapolation.
The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
Community
We observe that sparse-attention–trained models such as NSA and MoBA often produce less sparse attention distributions than fully-trained full-attention models. This contradicts the original motivation for using sparse attention as a proxy for full attention during training, and suggests that insufficient sparsity may limit the performance of existing sparse-attention approaches. To address this issue, we introduce SSA (Sparse Sparse Attention), a training framework designed to explicitly encourage sparser attention distributions within sparse-attention–trained models. SSA leads to consistent improvements across a range of commonsense reasoning benchmarks under both sparse and full-attention inference, adapts robustly to different levels of enforced sparsity at test time, and further enhances length extrapolation performance.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies (2025)
- Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference (2025)
- ProxyAttn: Guided Sparse Attention via Representative Heads (2025)
- InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation (2025)
- OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs (2025)
- Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models (2025)
- Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
