new architecture
updated
Blending Is All You Need: Cheaper, Better Alternative to
Trillion-Parameters LLM
Paper
• 2401.02994
• Published
• 52
MambaByte: Token-free Selective State Space Model
Paper
• 2401.13660
• Published
• 60
Repeat After Me: Transformers are Better than State Space Models at
Copying
Paper
• 2402.01032
• Published
• 24
BlackMamba: Mixture of Experts for State-Space Models
Paper
• 2402.01771
• Published
• 25
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
• 2402.04248
• Published
• 32
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published
• 116
Zamba: A Compact 7B SSM Hybrid Model
Paper
• 2405.16712
• Published
• 25
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published
• 68
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
• 2406.02657
• Published
• 41
Breaking the Attention Bottleneck
Paper
• 2406.10906
• Published
• 4
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Paper
• 2407.04620
• Published
• 34
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
• 2408.12570
• Published
• 32
A Comprehensive Survey of Mamba Architectures for Medical Image
Analysis: Classification, Segmentation, Restoration and Beyond
Paper
• 2410.02362
• Published
• 18
Paper
• 2410.05258
• Published
• 180
GPT or BERT: why not both?
Paper
• 2410.24159
• Published
• 14
Relaxed Recursive Transformers: Effective Parameter Sharing with
Layer-wise LoRA
Paper
• 2410.20672
• Published
• 6
SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba
State Space Models
Paper
• 2411.00233
• Published
• 7
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published
• 47
Gated Delta Networks: Improving Mamba2 with Delta Rule
Paper
• 2412.06464
• Published
• 15
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published
• 153
Deep Residual Echo State Networks: exploring residual orthogonal
connections in untrained Recurrent Neural Networks
Paper
• 2508.21172
• Published
• 2
Gated Associative Memory: A Parallel O(N) Architecture for Efficient
Sequence Modeling
Paper
• 2509.00605
• Published
• 43
Less is More: Recursive Reasoning with Tiny Networks
Paper
• 2510.04871
• Published
• 509
Paper
• 2601.00417
• Published
• 34
Nested Learning: The Illusion of Deep Learning Architectures
Paper
• 2512.24695
• Published
• 44
Recursive Language Models
Paper
• 2512.24601
• Published
• 90