Abstract
Pretrained language models can execute layers dynamically through flexible program-of-layers strategies that improve accuracy while reducing computational overhead compared to standard fixed-depth inference.
Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.
Community
This paper asks whether LLM inference really needs to follow the same fixed layer order for every input.
We introduce PoLar, a program-of-layers framework that treats frozen transformer layers as reusable functions. Instead of always executing all layers in the default order, PoLar learns input-specific execution programs that can skip, keep, or repeat layer segments without modifying the pretrained LLM.
A key finding is that fixed-depth inference is only one path through a richer latent computation space. Many inputs admit alternative valid programs: 75.5% of already-correct inputs have shorter valid programs, and 36.2% of originally-wrong inputs admit shorter correcting programs.
PoLar replaces expensive per-input search with a lightweight predictor that directly outputs layer programs, improving accuracy over standard inference and prior dynamic-depth methods while adding only ~0.8% inference overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference (2026)
- LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models (2026)
- Sparse Layers are Critical to Scaling Looped Language Models (2026)
- LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction (2026)
- Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers (2026)
- Post-Trained MoE Can Skip Half Experts via Self-Distillation (2026)
- End-to-End Context Compression at Scale (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.06574 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
