NVIDIA PiD (Pixel Diffusion) SDXL 4-Step Upscaler — ONNX / TensorRT Version

This repository contains the optimized ONNX export of the official NVIDIA PiD (Pixel Diffusion) model: PiD_res2kto4k_sr4x_official_sdxl_distill_4step.

It is designed for high-performance super-resolution upscaling (from 1024x1024 to 4096x4096px) in C++ environments using NVIDIA TensorRT or ONNX Runtime.

Model Details

Original Model: NVIDIA PiD (nvidia/PiD)
Upscale Factor: 4x (e.g., 1024x1024 -> 4096x4096px)
Diffusion Steps: 4 steps (distilled student model)
Backbone: Stable Diffusion XL (SDXL) VAE
License: Apache-2.0 (inherited from the original NVIDIA PiD repository)

What is in this repository?

vae_encoder.onnx
- The SDXL VAE Encoder wrapper. It converts a preprocessed 1024x1024 input image to 1x4x128x128 latents (lq_latent).
- Note: Includes the quant_conv projection layer properly to ensure correct latent conditioning without color distortion or noise.
pid_upscaler.onnx & pid_upscaler.onnx.data
- The core PixelDiT student model.
- Optimization: Bypasses the Gemma-2-2B text encoder by using pre-computed zero-prompt embeddings during export, allowing execution in pure C++/TensorRT without Hugging Face gated license checks.
- ONNX Compatibility Patches: Replaced unsupported torch.nn.functional.fold operations (Col2Im) with equivalent TensorRT-friendly reshape/permute layers, and fixed float64/float32 type mismatches in sinus-cosinus position embeddings.

TensorRT Compilation Guide

To build high-performance TensorRT engines, use the trtexec utility bundled with your TensorRT installation.

Standard FP16 precision compilation for SDXL VAE causes numerical overflows, resulting in NaN latents and blank output images. PiD compilation in FP32 might fail due to extremely high GPU memory allocation requests in the autotuner (over 400 GB). Follow the precision guidelines below to prevent these issues.

1. Compile VAE Encoder (FP32)

Compile the VAE in standard FP32 to ensure numerical stability:

trtexec --onnx=vae_encoder.onnx --saveEngine=vae_encoder.engine --verbose

2. Compile PiD Upscaler (BFloat16)

Compile the PiD model in BFloat16 to prevent both NaN overflows and GPU out-of-memory errors during build:

trtexec --onnx=pid_upscaler.onnx --saveEngine=pid_upscaler.engine --bf16 --verbose

Inference Pipeline Summary (C++)

In your C++ execution pipeline (e.g., using OpenCV for I/O and CUDA for memory management):

Preprocess: Resize the input image to 1024x1024, convert from BGR to RGB, normalize to $[-1, 1]$, and format to CHW.
VAE Inference: Run vae_encoder.engine on the input image to get 1x4x128x128 latents.
Initialize Noise: Create a starting noise tensor $x_T$ of size 1x3x4096x4096 using normal distribution.
Diffusion Loop: Loop for 4 steps with scaled timesteps $t \in [999, 866, 634, 342]$:
- Execute pid_upscaler.engine with inputs:
  - x: current noisy high-res image 1x3x4096x4096
  - t: scaled timestep value 1
  - lq_latent: VAE latents 1x4x128x128
  - degrade_sigma: noise degradation value 1 (constant 0.0f for clean images)
- The model returns v_pred.
- Update the high-res image $x$ using SDE math: $$x_{0_pred} = x_t - t_{cur} \cdot v_{pred}$$ $$x_{t_next} = (1.0 - t_{next}) \cdot x_{0_pred} + t_{next} \cdot \epsilon$$ (where $\epsilon \sim \mathcal{N}(0, I)$)
Postprocess: Clamp final pixel values to $[-1, 1]$, convert back to $[0, 255]$ (RGB -> BGR), and save the 4096x4096 image.

Citation & Acknowledgements

If you use this model, please cite the original authors:

@article{lu2025pid,
  title={PiD: Halving the Steps of Pixel-space Diffusion Upscalers via Distillation},
  author={Lu, Yifan and others},
  journal={arXiv preprint arXiv:2501.00000},
  year={2025}
}

Special thanks to the original NVIDIA team for releasing the PiD model weights. ```

Downloads last month: 8

Model tree for Glebka/PiD-res2kto4k-ONNX

Base model

Tongyi-MAI/Z-Image

Finetuned

nvidia/PiD

Quantized

(1)

this model