MLPMemory-gpt2-large

Model Description

This checkpoint contains the GPT2-large MLP Memory trained on WikiText-103, as described in our paper.

Paper: MLP Memory: A Retriever-Pretrained Memory for Large Language Models
GitHub: https://github.com/Rubin-Wei/MLPMemory
Model Size: 774M parameters
Base Model: Rubin-Wei/gpt2-large-finetuned-wikitext103
Training Dataset: WikiText-103

Overview

MLP Memory introduces a retriever-pretrained parametric memory designed to bridge retrieval-augmented generation (RAG) and traditional parametric fine-tuning. Instead of explicitly retrieving documents during inference, this model internalizes retrieval behavior by pretraining an MLP to imitate kNN retrievers across the full pretraining corpus. When combined with the base model, the resulting architecture demonstrates scaling characteristics that surpass conventional Decoder-only language models, highlighting the potential of retrieval-inspired parametric augmentation as a new scaling frontier for large language models. It provides:

End-to-End Differentiable — Fully parameterized; supports gradient flow for joint optimization with the base model.
Highly Compressed Knowledge — Condenses multi-terabyte retrieval stores into a compact 1B-parameter MLP (~4 GB).
Efficient Inference — Eliminates retrieval latency with constant-speed decoding independent of corpus size.
Long-Term Memory — Encodes durable corpus-level knowledge beyond short context limits.
Scalable — Exhibits stronger scaling behavior than standard decoder-only architectures.

Performance on WikiText-103

Model	#Params	PPL
GPT2-large-vanilla	774M	15.80
GPT2-large-finetuned	774M	10.42
GPT2-xl-finetuned	1.5B	10.16
GPT2-large-finetuned + MLPMem	774M + 774M	9.58

Training Details

Training Data: WikiText-103
Training Objective: Hybrid KL divergence and language modeling loss
Supervision Signal: kNN distributions from GPT2-xl, it is suggested to use the finetuned version of GPT2-xl here.
Hyperparameters:
- Learning rate: 5e-4
- Optimizer: AdamW
- Alpha (loss balance): 0.5
- Training Epoch: 50
- LR Scheduler: Linear
- Warmup Steps: 2000

Citation

@inproceedings{Wei2025MLPMA,
  title={MLP Memory: A Retriever-Pretrained Memory for Large Language Models},
  author={Rubin Wei and Jiaqi Cao and Jiarui Wang and Jushi Kai and Qipeng Guo and Bowen Zhou and Zhouhan Lin},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:281658735}
}

Contact

For questions and support: [email protected]

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Rubin-Wei/MLPMemory-gpt2-large

MLPMemory

Collection

5 items • Updated 9 days ago