arxiv:2502.08363

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Published on Feb 12

Authors:

Abstract

Top-Theta Attention is a training-free method for sparsifying transformer attention during inference, achieving significant reductions in V-cache usage and attention elements with minimal accuracy loss.

AI-generated summary

We present Top-Theta (Top-theta) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-theta achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

View arXiv page View PDF GitHub 2 Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.08363 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.08363 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.08363 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.