arxiv:2510.22849

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Published on Oct 26

· Submitted by

Authors:

Abstract

Per-Instance Program Synthesis (PIPS) enhances LLM performance by generating and refining instance-level programs with structural feedback, improving accuracy and reducing undesirable solutions.

AI-generated summary

Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

View arXiv page View PDF GitHub 1 Add to collection

Community

steinad

Paper submitter about 1 hour ago

Why do LLMs (and LLM agents) still struggle on hard reasoning problems that should be solvable by writing and executing code? We find a simple culprit: zero-shot “programs” often don’t compute anything—they just hard-code the answer. In fact, zero-shot Program-of-Thought (PoT) produces trivial programs >50% of the time on Gemini-2.0-Flash across 30 tasks.

To address the limitations of existing per-instance program synthesis methods, PIPS tackles three core challenges:

Open-domain routing: When should the model code vs. think in text?
No task specs: How to write useful code without examples/specs?
Unstructured inputs: How to turn raw text/images into executable signals?

What PIPS does:

Confidence-based routing. A calibrated classifier (10 criteria) chooses between program synthesis and Chain-of-Thought (CoT)—and tracks closely with actual synthesis success.
Iterative refinement with structural feedback. We penalize trivial, hard-coded programs and push the model toward correct, general code.
Explicit symbol extraction. Before any code is written, PIPS extracts symbolic inputs from raw text/images—removing this step significantly degrades performance.

Results (30 tasks: Big Bench Extra Hard, OmniMath, neuro-symbolic) 🚀:

Up to +8.6% absolute improvement in harmonic-mean accuracy over PoT
+9.4% over CoT
On algorithmic tasks, >65% reduction in trivial, hard-coded programs → more correct and verifiable reasoning

Learn more:

Demo: https://huggingface.co/spaces/steinad/PIPS-demo
Code: https://github.com/adaminsky/pips
Paper: https://arxiv.org/abs/2510.22849

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.22849 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.22849 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.