--- tags: - RDT - rdt - tokenizer - action - discrete - vector-quantization - RDT 2 license: apache-2.0 pipeline_tag: robotics --- # RVQ-AT: Residual VQ Action Tokenizer for RDT 2 **RVQ-AT** is a fast, compact **Residual Vector-Quantization** (RVQ) tokenizer for robot action streams. It converts continuous control trajectories into short sequences of **discrete action tokens** that plug directly into autoregressive VLA models. Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes **residuals** level-by-level. This yields: * **Higher fidelity at the same bitrate** (lower recon MSE / higher SNR) * **Shorter token sequences** for the same time horizon * **Stable training** via commitment loss, EMA codebook updates, and dead-code revival Here, we provide: 1. **RVQ-AT** — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data. 2. **Simple APIs to fit your own tokenizer** on custom action datasets. --- ## Using the Universal RVQ-AT Tokenizer We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported. ```python # Run under repository: https://github.com/thu-ml/RDT2 import torch import numpy as np from models.normalizer import LinearNormalizer from vqvae.models.multivqvae import MultiVQVAE # Load from the Hub (replace with your repo id once published) vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval() normalizer = LinearNormalizer.load( "" # Download from: # http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt ) # Load your RELATIVE action chunk action_chunk = torch.zeros((1, 24, 20)) # action_chunk shape: (B, T, action_dim) # - T = 24: predicts the future 0.8s in fps=30 → 24 frames # - action_dim = 20: following UMI setting (both arms, right to left) # - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m # - [3-8]: RIGHT ARM end effector rotation (6D representation) # - [9]: RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open # - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m # - [13-18]: LEFT ARM end effector rotation (6D representation) # - [19]: LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open # Normalize action nsample = normalizer["action"].normalize(action_chunk).cuda() # Encode → tokens # tokens: torch.LongTensor with shape (B, num_valid_action_token) # num_valid_action_token = 27, values in range [0, 1024) tokens = vae.encode(nsample) # or vae.encode(action_chunk) # Decode back to continuous actions recon_nsample = vae.decode(tokens) recon_action_chunk = normalizer["action"].unnormalize(recon_nsample) ``` --- ## [IMPORTANT] Recommended Preprocessing Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data, we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ. Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning. --- ## Safety & Intended Use RVQ-AT is a representation learning component. **Do not** deploy decoded actions directly to hardware without: * Proper sim-to-real validation, * Safety bounds/clamping and rate limiters, * Task-level monitors and e-stop fallbacks. --- ## Citation If you use RVQ-AT in your work, please cite: ```bibtex @software{rdt2, title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data}, author={RDT Team}, url={https://github.com/thu-ml/RDT2}, month={September}, year={2025} } ``` --- ## Contact * Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page. --- ## License This repository and the released models are licensed under **Apache-2.0**. You may use, modify, and distribute, provided you **keep a copy of the original license and notices** in your distributions and **state significant changes** when you make them.