stereoplegic 's Collections Attention
updated
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
• 2309.06180
• Published
• 38
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language
Models
Paper
• 2308.16137
• Published
• 41
Scaling Transformer to 1M tokens and beyond with RMT
Paper
• 2304.11062
• Published
• 3
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
Paper
• 2309.14509
• Published
• 20
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills
Paper
• 2308.16369
• Published
• 1
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Paper
• 2309.12307
• Published
• 90
PoSE: Efficient Context Window Extension of LLMs via Positional
Skip-wise Training
Paper
• 2309.10400
• Published
• 26
Efficient Streaming Language Models with Attention Sinks
Paper
• 2309.17453
• Published
• 14
Replacing softmax with ReLU in Vision Transformers
Paper
• 2309.08586
• Published
• 19
Adapting Language Models to Compress Contexts
Paper
• 2305.14788
• Published
• 1
In-context Autoencoder for Context Compression in a Large Language Model
Paper
• 2307.06945
• Published
• 29
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Paper
• 2310.12109
• Published
• 1
Linformer: Self-Attention with Linear Complexity
Paper
• 2006.04768
• Published
• 2
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
• 2305.13245
• Published
• 6
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Paper
• 2309.11568
• Published
• 11
The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression
Paper
• 2304.13276
• Published
• 1
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
• 2306.06000
• Published
• 1
Self-slimmed Vision Transformer
Paper
• 2111.12624
• Published
• 1
Robustifying Token Attention for Vision Transformers
Paper
• 2303.11126
• Published
• 1
Combiner: Full Attention Transformer with Sparse Computation Cost
Paper
• 2107.05768
• Published
• 1
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
• 2302.06218
• Published
• 1
Attention Bottlenecks for Multimodal Fusion
Paper
• 2107.00135
• Published
• 1
Blockwise Self-Attention for Long Document Understanding
Paper
• 1911.02972
• Published
• 1
LSG Attention: Extrapolation of pretrained Transformers to long
sequences
Paper
• 2210.15497
• Published
• 1
Cure the headache of Transformers via Collinear Constrained Attention
Paper
• 2309.08646
• Published
• 14
VSA: Learning Varied-Size Window Attention in Vision Transformers
Paper
• 2204.08446
• Published
• 1
Bird-Eye Transformers for Text Generation Models
Paper
• 2210.03985
• Published
• 1
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
• 2303.13755
• Published
• 1
TRAMS: Training-free Memory Selection for Long-range Language Modeling
Paper
• 2310.15494
• Published
• 2
Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention
Paper
• 2310.07911
• Published
• 1
Memoria: Resolving Fateful Forgetting Problem through Human-Inspired Memory Architecture
Paper
• 2310.03052
• Published
• 3
FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness
Paper
• 2205.14135
• Published
• 15
Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation
Paper
• 2309.14174
• Published
• 1
Attention Is Not All You Need Anymore
Paper
• 2308.07661
• Published
• 1
Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
Paper
• 2103.03404
• Published
• 1
Semantics-aware Attention Improves Neural Machine Translation
Paper
• 2110.06920
• Published
• 1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
for Efficient Vision Transformers
Paper
• 2211.11315
• Published
• 1
Attention Is All You Need
Paper
• 1706.03762
• Published
• 115
Ultra-Long Sequence Distributed Transformer
Paper
• 2311.02382
• Published
• 6
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
Paper
• 2311.02262
• Published
• 14
Attention or Convolution: Transformer Encoders in Audio Language Models
for Inference Efficiency
Paper
• 2311.02772
• Published
• 8
MiniLMv2: Multi-Head Self-Attention Relation Distillation for
Compressing Pretrained Transformers
Paper
• 2012.15828
• Published
• 1
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
Paper
• 2002.10957
• Published
• 2
ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose
Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention
Paper
• 2304.02147
• Published
• 1
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Paper
• 2311.01927
• Published
• 1
Improving Transformers with Probabilistic Attention Keys
Paper
• 2110.08678
• Published
• 1
Wide Attention Is The Way Forward For Transformers?
Paper
• 2210.00640
• Published
• 1
A Practical Survey on Faster and Lighter Transformers
Paper
• 2103.14636
• Published
• 1
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
• 2306.12929
• Published
• 13
Scaling TransNormer to 175 Billion Parameters
Paper
• 2307.14995
• Published
• 23
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
• 2306.06446
• Published
• 1
Are Sixteen Heads Really Better than One?
Paper
• 1905.10650
• Published
• 2
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
• 2311.05908
• Published
• 14
Hiformer: Heterogeneous Feature Interactions Learning with Transformers
for Recommender Systems
Paper
• 2311.05884
• Published
• 9
Exemplar-free Continual Learning of Vision Transformers via Gated
Class-Attention and Cascaded Feature Drift Compensation
Paper
• 2211.12292
• Published
• 1
Latency Adjustable Transformer Encoder for Language Understanding
Paper
• 2201.03327
• Published
• 1
AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models
Paper
• 2010.03688
• Published
• 1
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention
Graph in Pre-Trained Transformers
Paper
• 2305.17328
• Published
• 2
Human Guided Exploitation of Interpretable Attention Patterns in
Summarization and Topic Segmentation
Paper
• 2112.05364
• Published
• 1
Alleviating the Inequality of Attention Heads for Neural Machine
Translation
Paper
• 2009.09672
• Published
• 1
CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models
for Programming Language Attend Code Structure
Paper
• 2210.04633
• Published
• 1
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse
Paper
• 2311.07468
• Published
• 1
Shifting Attention to Relevance: Towards the Uncertainty Estimation of
Large Language Models
Paper
• 2307.01379
• Published
• 1
The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
Paper
• 2306.01705
• Published
• 1
Relaxed Attention for Transformer Models
Paper
• 2209.09735
• Published
• 1
System 2 Attention (is something you might need too)
Paper
• 2311.11829
• Published
• 43
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published
• 25
Superiority of Softmax: Unveiling the Performance Edge Over Linear
Attention
Paper
• 2310.11685
• Published
• 1
Attention Sorting Combats Recency Bias In Long Context Language Models
Paper
• 2310.01427
• Published
• 1
Gated recurrent neural networks discover attention
Paper
• 2309.01775
• Published
• 10
Your Transformer May Not be as Powerful as You Expect
Paper
• 2205.13401
• Published
• 1
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition
Paper
• 2209.15176
• Published
• 1
Low Rank Factorization for Compact Multi-Head Self-Attention
Paper
• 1912.00835
• Published
• 1
Linear Self-Attention Approximation via Trainable Feedforward Kernel
Paper
• 2211.04076
• Published
• 1
Low-Rank Bottleneck in Multi-head Attention Models
Paper
• 2002.07028
• Published
• 1
EfficientFormer: Vision Transformers at MobileNet Speed
Paper
• 2206.01191
• Published
• 1
Transformer in Transformer
Paper
• 2103.00112
• Published
• 1
COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
Paper
• 2305.17235
• Published
• 2
CoLT5: Faster Long-Range Transformers with Conditional Computation
Paper
• 2303.09752
• Published
• 2
Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator
Paper
• 2305.15099
• Published
• 1
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
• 2312.07987
• Published
• 41
Efficient Monotonic Multihead Attention
Paper
• 2312.04515
• Published
• 8
SCCA: Shifted Cross Chunk Attention for long contextual semantic
expansion
Paper
• 2312.07305
• Published
• 1
Zebra: Extending Context Window with Layerwise Grouped Local-Global
Attention
Paper
• 2312.08618
• Published
• 13
Is Model Attention Aligned with Human Attention? An Empirical Study on
Large Language Models for Code Generation
Paper
• 2306.01220
• Published
• 1
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
• 2210.05144
• Published
• 2
LKCA: Large Kernel Convolutional Attention
Paper
• 2401.05738
• Published
• 1
HyperAttention: Long-context Attention in Near-Linear Time
Paper
• 2310.05869
• Published
• 2
Rethinking Attention with Performers
Paper
• 2009.14794
• Published
• 1
Attention Lens: A Tool for Mechanistically Interpreting the Attention
Head Information Retrieval Mechanism
Paper
• 2310.16270
• Published
• 1
Softmax-free Linear Transformers
Paper
• 2207.03341
• Published
• 1
Gated Linear Attention Transformers with Hardware-Efficient Training
Paper
• 2312.06635
• Published
• 9
Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models
Paper
• 2112.00029
• Published
• 1
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
• 2402.04248
• Published
• 32
A Quantitative Review on Language Model Efficiency Research
Paper
• 2306.01768
• Published
• 2
Agent Attention: On the Integration of Softmax and Linear Attention
Paper
• 2312.08874
• Published
• 2
FLatten Transformer: Vision Transformer using Focused Linear Attention
Paper
• 2308.00442
• Published
• 1
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
• 2402.10644
• Published
• 81
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
• 2402.19427
• Published
• 56
Simple linear attention language models balance the recall-throughput
tradeoff
Paper
• 2402.18668
• Published
• 20
Linear Transformers are Versatile In-Context Learners
Paper
• 2402.14180
• Published
• 7
Attention Approximates Sparse Distributed Memory
Paper
• 2111.05498
• Published
Multi-Scale Self-Attention for Text Classification
Paper
• 1912.00544
• Published
Scattered Mixture-of-Experts Implementation
Paper
• 2403.08245
• Published
• 1
Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
Paper
• 2312.08614
• Published
• 1
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
• 2404.07413
• Published
• 38
SLAB: Efficient Transformers with Simplified Linear Attention and
Progressive Re-parameterized Batch Normalization
Paper
• 2405.11582
• Published
• 17
Yuan 2.0-M32: Mixture of Experts with Attention Router
Paper
• 2405.17976
• Published
• 21
LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Paper
• 2402.10685
• Published
• 1
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
• 2406.07522
• Published
• 40
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context
Large Language Models
Paper
• 2406.05678
• Published
• 1
Various Lengths, Constant Speed: Efficient Language Modeling with
Lightning Attention
Paper
• 2405.17381
• Published
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
Paper
• 2402.04347
• Published
• 15
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
• 2407.12077
• Published
• 57
SampleAttention: Near-Lossless Acceleration of Long Context LLM
Inference with Adaptive Structured Sparse Attention
Paper
• 2406.15486
• Published
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Paper
• 2407.15891
• Published
Tree Attention: Topology-aware Decoding for Long-Context Attention on
GPU clusters
Paper
• 2408.04093
• Published
• 4
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Paper
• 2409.04431
• Published
• 2
Weighted Grouped Query Attention in Transformers
Paper
• 2407.10855
• Published
On the Benefits of Rank in Attention Layers
Paper
• 2407.16153
• Published
Beyond KV Caching: Shared Attention for Efficient LLMs
Paper
• 2407.12866
• Published
• 1
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Paper
• 2408.08454
• Published
Efficient LLM Training and Serving with Heterogeneous Context Sharding
among Attention Heads
Paper
• 2407.17678
• Published
Post-Training Sparse Attention with Double Sparsity
Paper
• 2408.07092
• Published
Palu: Compressing KV-Cache with Low-Rank Projection
Paper
• 2407.21118
• Published
• 1
Inference-Friendly Models With MixAttention
Paper
• 2409.15012
• Published
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for
Long-Term Expressive Symbolic Music Generation
Paper
• 2411.08307
• Published
• 7