Abstract
An iterative sampling algorithm enhances reasoning capabilities in base models without additional training, matching or outperforming reinforcement learning on single-shot tasks.
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
Community
This paper explores a sampling-based method for LLM reasoning to improve reasoning performance, surpassing the results of RL training using GPRO with a very low cost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning (2025)
- Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models (2025)
- Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning (2025)
- Variational Reasoning for Language Models (2025)
- Representation-Based Exploration for Language Models: From Test-Time to Post-Training (2025)
- Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards (2025)
- Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
🥺 Please also see our NeurIPS 2025 paper "A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning", which introduces a theoretical framework for sampling-based test-time scaling methods.
Interesting work!
It seems that the base model can achieve higher performance even without extra training process, i have a question, have you tried this method on other VLM tasks, such as Grounding or Video understanding?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper