---
tags:
- RDT
- rdt
- tokenizer
- action
- discrete
- vector-quantization
- RDT 2
license: apache-2.0
pipeline_tag: robotics
---

# RVQ-AT: Residual VQ Action Tokenizer for RDT 2

**RVQ-AT** is a fast, compact **Residual Vector-Quantization** (RVQ) tokenizer for robot action streams.
It converts continuous control trajectories into short sequences of **discrete action tokens** that plug directly into autoregressive VLA models.

Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes **residuals** level-by-level. This yields:

* **Higher fidelity at the same bitrate** (lower recon MSE / higher SNR)
* **Shorter token sequences** for the same time horizon
* **Stable training** via commitment loss, EMA codebook updates, and dead-code revival

Here, we provide:

1. **RVQ-AT** — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
2. **Simple APIs to fit your own tokenizer** on custom action datasets.

---


## Using the Universal RVQ-AT Tokenizer

We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.

```python
# Run under repository: https://github.com/thu-ml/RDT2

import torch
import numpy as np
from models.normalizer import LinearNormalizer
from vqvae.models.multivqvae import MultiVQVAE

# Load from the Hub (replace with your repo id once published)
vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
normalizer = LinearNormalizer.load(
    "<Path_to_normalizer>"  # Download from:
    # http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
)

# Load your RELATIVE action chunk
action_chunk = torch.zeros((1, 24, 20))  
# action_chunk shape: (B, T, action_dim)
#   - T = 24: predicts the future 0.8s in fps=30 → 24 frames
#   - action_dim = 20: following UMI setting (both arms, right to left)
#     - [0-2]:  RIGHT ARM end effector position in (x, y, z), unit: m
#     - [3-8]:  RIGHT ARM end effector rotation (6D representation)
#     - [9]:    RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open
#     - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
#     - [13-18]: LEFT ARM end effector rotation (6D representation)
#     - [19]:   LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open

# Normalize action
nsample = normalizer["action"].normalize(action_chunk).cuda()

# Encode → tokens
# tokens: torch.LongTensor with shape (B, num_valid_action_token)
# num_valid_action_token = 27, values in range [0, 1024)
tokens = vae.encode(nsample)  # or vae.encode(action_chunk)

# Decode back to continuous actions
recon_nsample = vae.decode(tokens)
recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)

```

---

## [IMPORTANT] Recommended Preprocessing

Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data, 
we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ. 
Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.

---

<!-- ## Performance (Universal Model)

*(Representative, measured on internal eval — replace with your numbers when available.)*

* **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
* **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
* **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.

---
 -->
## Safety & Intended Use

RVQ-AT is a representation learning component. **Do not** deploy decoded actions directly to hardware without:

* Proper sim-to-real validation,
* Safety bounds/clamping and rate limiters,
* Task-level monitors and e-stop fallbacks.

---

## Citation

If you use RVQ-AT in your work, please cite:

```bibtex
@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}
```

---

## Contact

* Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page.

---

## License

This repository and the released models are licensed under **Apache-2.0**. You may use, modify, and distribute, provided you **keep a copy of the original license and notices** in your distributions and **state significant changes** when you make them.