Forgetting Transformer: Softmax Attention with a Forget Gate Paper • 2503.02130 • Published Mar 3, 2025 • 32
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling Paper • 2503.04725 • Published Mar 6, 2025 • 21
I-Con: A Unifying Framework for Representation Learning Paper • 2504.16929 • Published Apr 23, 2025 • 29
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights Paper • 2510.04800 • Published Oct 6, 2025 • 36