Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Abstract
A scalable data engine converts large-scale pre-training documents into diverse question-answer pairs for reinforcement learning, significantly improving model performance and efficiency.
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100times fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
Community
RL for LLMs has been bottlenecked by tiny datasets (<10B tokens) vs pretraining (>1T).
Our Webscale-RL pipeline converts pretraining text into diverse RL-ready QA data — scaling RL to pretraining levels!
All codes and datasets are open-source!
HF🤗: https://huggingface.co/datasets/Salesforce/Webscale-RL
Github 🤖: https://github.com/SalesforceAIResearch/PretrainRL-pipeline
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reinforcement Learning on Pre-Training Data (2025)
- MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes (2025)
- MedGR2: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning (2025)
- Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation (2025)
- PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning (2025)
- Empowering Lightweight MLLMs with Reasoning via Long CoT SFT (2025)
- Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper