arxiv:2510.20766

DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Published on Oct 23

· Submitted by

Guy Yariv on Oct 24

· The Hebrew University of Jerusalem

Upvote

Authors:

Noam Issachar ,

Guy Yariv ,

Abstract

Dynamic Position Extrapolation (DyPE) enhances ultra-high-resolution image generation by dynamically adjusting positional encodings in pre-trained diffusion transformers, achieving state-of-the-art fidelity without additional sampling cost.

AI-generated summary

Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

View arXiv page View PDF Project page GitHub 135 Add to collection

Community

GuyYariv

Paper author Paper submitter 5 days ago

DyPE (Dynamic Position Extrapolation) enables pre-trained diffusion transformers to generate ultra-high-resolution images far beyond their training scale. It dynamically adjusts positional encodings during denoising to match evolving frequency content—achieving faithful 4K × 4K results without retraining or extra sampling cost.

Project page: https://noamissachar.github.io/DyPE/
Code: https://github.com/guyyariv/DyPE

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ArthurPT

2 days ago

Congrats! I've been looking for this. High levels of detail in these images give them a artistic aesthetic, the kind that invites people to glanse a second time with a will of discovery and exploration. Thanks for helping us making textures and depth to pull the viewer in.