MLPMemory-gpt2-large
Model Description
This checkpoint contains the GPT2-large MLP Memory trained on WikiText-103, as described in our paper.
- Paper: MLP Memory: A Retriever-Pretrained Memory for Large Language Models
- GitHub: https://github.com/Rubin-Wei/MLPMemory
- Model Size: 774M parameters
- Base Model: Rubin-Wei/gpt2-large-finetuned-wikitext103
- Training Dataset: WikiText-103
Overview
MLP Memory introduces a retriever-pretrained parametric memory designed to bridge retrieval-augmented generation (RAG) and traditional parametric fine-tuning. Instead of explicitly retrieving documents during inference, this model internalizes retrieval behavior by pretraining an MLP to imitate kNN retrievers across the full pretraining corpus. When combined with the base model, the resulting architecture demonstrates scaling characteristics that surpass conventional Decoder-only language models, highlighting the potential of retrieval-inspired parametric augmentation as a new scaling frontier for large language models. It provides:
- End-to-End Differentiable β Fully parameterized; supports gradient flow for joint optimization with the base model.
- Highly Compressed Knowledge β Condenses multi-terabyte retrieval stores into a compact 1B-parameter MLP (~4 GB).
- Efficient Inference β Eliminates retrieval latency with constant-speed decoding independent of corpus size.
- Long-Term Memory β Encodes durable corpus-level knowledge beyond short context limits.
- Scalable β Exhibits stronger scaling behavior than standard decoder-only architectures.
Performance on WikiText-103
| Model | #Params | PPL |
|---|---|---|
| GPT2-large-vanilla | 774M | 15.80 |
| GPT2-large-finetuned | 774M | 10.42 |
| GPT2-xl-finetuned | 1.5B | 10.16 |
| GPT2-large-finetuned + MLPMem | 774M + 774M | 9.58 |
Training Details
- Training Data: WikiText-103
- Training Objective: Hybrid KL divergence and language modeling loss
- Supervision Signal: kNN distributions from GPT2-xl, it is suggested to use the finetuned version of GPT2-xl here.
- Hyperparameters:
- Learning rate: 5e-4
- Optimizer: AdamW
- Alpha (loss balance): 0.5
- Training Epoch: 50
- LR Scheduler: Linear
- Warmup Steps: 2000
Citation
@inproceedings{Wei2025MLPMA,
title={MLP Memory: A Retriever-Pretrained Memory for Large Language Models},
author={Rubin Wei and Jiaqi Cao and Jiarui Wang and Jushi Kai and Qipeng Guo and Bowen Zhou and Zhouhan Lin},
year={2025},
url={https://api.semanticscholar.org/CorpusID:281658735}
}
Contact
For questions and support: [email protected]
- Downloads last month
- 12