DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Abstract
Dynamic Position Extrapolation (DyPE) enhances ultra-high-resolution image generation by dynamically adjusting positional encodings in pre-trained diffusion transformers, achieving state-of-the-art fidelity without additional sampling cost.
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
Community
DyPE (Dynamic Position Extrapolation) enables pre-trained diffusion transformers to generate ultra-high-resolution images far beyond their training scale. It dynamically adjusts positional encodings during denoising to match evolving frequency content—achieving faithful 4K × 4K results without retraining or extra sampling cost.
Project page: https://noamissachar.github.io/DyPE/
Code: https://github.com/guyyariv/DyPE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention (2025)
- UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset (2025)
- NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation (2025)
- Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation (2025)
- FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (2025)
- HilbertA: Hilbert Attention for Image Generation with Diffusion Models (2025)
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Congrats! I've been looking for this. High levels of detail in these images give them a artistic aesthetic, the kind that invites people to glanse a second time with a will of discovery and exploration. Thanks for helping us making textures and depth to pull the viewer in.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper