Papers
arxiv:2510.24211

MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

Published on Oct 28
· Submitted by Junhyuk So on Oct 30
Authors:
,
,

Abstract

A new parallel decoding framework, MC-SJD, accelerates autoregressive visual generation by improving token stability and acceptance rate, achieving significant speedups without quality loss.

AI-generated summary

While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

Community

Paper author Paper submitter

MC-SJD is a novel acceleration technique for next-token-prediction-based autoregressive (AR) visual generation. By combining the ideas of Speculative Decoding, Fixed-Point Iteration, and Coupled Sampling, MC-SJD achieves actual speedups of ~4x in image generation and over ~13x in video generation. It theoretically maintains generation quality exactly identical to the original AR decoding and is plug-and-play applicable to all AR models without requiring any additional training.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.24211 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.24211 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.24211 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.