AutoencoderRAE

The Representation Autoencoder (RAE) model introduced in Diffusion Transformers with Representation Autoencoders by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx.

RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation).

The following RAE models are released and supported in Diffusers:

Model	Encoder	Latent shape (224px input)
`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`	DINOv2-base	768 x 16 x 16
`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`	DINOv2-base (512px)	768 x 32 x 32
`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`	DINOv2-small	384 x 16 x 16
`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`	DINOv2-large	1024 x 16 x 16
`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`	SigLIP2-base	768 x 16 x 16
`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`	MAE-base	768 x 16 x 16

Loading a pretrained model

from diffusers import AutoencoderRAE

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

Encoding and decoding a real image

import torch
from diffusers import AutoencoderRAE
from diffusers.utils import load_image
from torchvision.transforms.functional import to_tensor, to_pil_image

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
image = image.convert("RGB").resize((224, 224))
x = to_tensor(image).unsqueeze(0).to("cuda")  # (1, 3, 224, 224), values in [0, 1]

with torch.no_grad():
    latents = model.encode(x).latent        # (1, 768, 16, 16)
    recon = model.decode(latents).sample     # (1, 3, 256, 256)

recon_image = to_pil_image(recon[0].clamp(0, 1).cpu())
recon_image.save("recon.png")

Latent normalization

Some pretrained checkpoints include per-channel latents_mean and latents_std statistics for normalizing the latent space. When present, encode and decode automatically apply the normalization and denormalization, respectively.

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

# Latent normalization is handled automatically inside encode/decode
# when the checkpoint config includes latents_mean/latents_std.
with torch.no_grad():
    latents = model.encode(x).latent   # normalized latents
    recon = model.decode(latents).sample

AutoencoderRAE

class diffusers.AutoencoderRAE

< source >

( encoder_type: str = 'dinov2' encoder_hidden_size: int = 768 encoder_patch_size: int = 14 encoder_num_hidden_layers: int = 12 decoder_hidden_size: int = 512 decoder_num_hidden_layers: int = 8 decoder_num_attention_heads: int = 16 decoder_intermediate_size: int = 2048 patch_size: int = 16 encoder_input_size: int = 224 image_size: int | None = None num_channels: int = 3 encoder_norm_mean: list | None = None encoder_norm_std: list | None = None latents_mean: list | tuple | torch.Tensor | None = None latents_std: list | tuple | torch.Tensor | None = None noise_tau: float = 0.0 reshape_to_2d: bool = True use_encoder_loss: bool = False scaling_factor: float = 1.0 )

Parameters

encoder_type (str, optional, defaults to "dinov2") — Type of frozen encoder to use. One of "dinov2", "siglip2", or "mae".
encoder_hidden_size (int, optional, defaults to 768) — Hidden size of the encoder model.
encoder_patch_size (int, optional, defaults to 14) — Patch size of the encoder model.
encoder_num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the encoder model.
patch_size (int, optional, defaults to 16) — Decoder patch size (used for unpatchify and decoder head).
encoder_input_size (int, optional, defaults to 224) — Input size expected by the encoder.
image_size (int, optional) — Decoder output image size. If None, it is derived from encoder token count and patch_size like RAE-main: image_size = patch_size * sqrt(num_patches), where num_patches = (encoder_input_size // encoder_patch_size) ** 2.
num_channels (int, optional, defaults to 3) — Number of input/output channels.
encoder_norm_mean (list, optional, defaults to [0.485, 0.456, 0.406]) — Channel-wise mean for encoder input normalization (ImageNet defaults).
encoder_norm_std (list, optional, defaults to [0.229, 0.224, 0.225]) — Channel-wise std for encoder input normalization (ImageNet defaults).
latents_mean (list or tuple, optional) — Optional mean for latent normalization. Tensor inputs are accepted and converted to config-serializable lists.
latents_std (list or tuple, optional) — Optional standard deviation for latent normalization. Tensor inputs are accepted and converted to config-serializable lists.
noise_tau (float, optional, defaults to 0.0) — Noise level for training (adds noise to latents during training).
reshape_to_2d (bool, optional, defaults to True) — Whether to reshape latents to 2D (B, C, H, W) format.
use_encoder_loss (bool, optional, defaults to False) — Whether to use encoder hidden states in the loss (for advanced training).

Representation Autoencoder (RAE) model for encoding images to latents and decoding latents to images.

This model uses a frozen pretrained encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT decoder to reconstruct images from learned representations.

This model inherits from ModelMixin. Check the superclass documentation for its generic methods implemented for all models (such as downloading or saving).

wrapper

< source >

( *args **kwargs )

wrapper

< source >

( *args **kwargs )

DecoderOutput

class diffusers.models.autoencoders.vae.DecoderOutput

< source >

( sample: Tensor commit_loss: torch.FloatTensor | None = None )

Parameters

sample (torch.Tensor of shape (batch_size, num_channels, height, width)) — The decoded output sample from the last layer of the model.

Output of decoding method.

Update on GitHub