robotics-diffusion-transformer
/

RVQActionTokenizer

@@ -1,9 +1,127 @@
 ---
 tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 tags:
+- RDT
+- rdt
+- tokenizer
+- action
+- discrete
+- vector-quantization
+license: apache-2.0
+pipeline_tag: robotics
 ---
+# RVQ-AT: Residual VQ Action Tokenizer for RDT 2
+**RVQ-AT** is a fast, compact **Residual Vector-Quantization** (RVQ) tokenizer for robot action streams.
+It converts continuous control trajectories into short sequences of **discrete action tokens** that plug directly into autoregressive VLA models.
+Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes **residuals** level-by-level. This yields:
+* **Higher fidelity at the same bitrate** (lower recon MSE / higher SNR)
+* **Shorter token sequences** for the same time horizon
+* **Stable training** via commitment loss, EMA codebook updates, and dead-code revival
+Here, we provide:
+1. **RVQ-AT (Universal)** — a general-purpose tokenizer trained on diverse manipulation & navigation logs.
+2. **Simple APIs to fit your own tokenizer** on custom action datasets.
+---
+## Using the Universal RVQ-AT Tokenizer
+We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
+```python
+import numpy as np
+from transformers import AutoProcessor
+# Load from the Hub (replace with your repo id once published)
+proc = .from_pretrained(
+    "your-org/residual-vq-action-tokenizer",  # e.g., "your-org/rvq-at-universal"
+    trust_remote_code=True
+)
+# Dummy batch: [batch, T, action_dim], concretely [batch_size, 24, 20]
+action_data = np.random.uniform(-1, 1, size=(256, 50, 20)).astype("float32")
+# Encode → tokens (List[List[int]] or np.ndarray[int])
+tokens = proc(action_data)  # or proc.encode(action_data)
+# Decode back to continuous actions
+# The processor caches (T, action_dim) on first forward;
+# or specify explicitly:
+recon = proc.decode(tokens, time_horizon=50, action_dim=14)
+```
+**Notes**
+* If your pipeline uses variable-length chunks, pass `time_horizon` per sample to `decode(...)`.
+* Special tokens (`pad`, `eos`, optional `chunk_sep`) are reserved and shouldn’t be used as code indices.
+---
+## Recommended Preprocessing
+* **Chunking:** 0.5–1.0 s windows work well for 10–50 Hz logs.
+* **Normalization:** per-dimension robust scaling to `[-1, 1]` (e.g., 1–99% quantiles). Save stats in `preprocessor_config.json`.
+* **Padding:** for variable `T`, pad to a small multiple of stride; RVQ-AT masks paddings internally.
+* **Action spaces:** supports mixed spaces (e.g., 7-DoF joints + gripper + base). Concatenate into a flat vector per timestep.
+---
+<!-- ## Performance (Universal Model)
+*(Representative, measured on internal eval — replace with your numbers when available.)*
+* **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
+* **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
+* **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
+* **Downstream VLA:** +1–3% SR on long-horizon tasks vs. raw-action modeling.
+---
+ -->
+## Safety & Intended Use
+RVQ-AT is a representation learning component. **Do not** deploy decoded actions directly to hardware without:
+* Proper sim-to-real validation,
+* Safety bounds/clamping and rate limiters,
+* Task-level monitors and e-stop fallbacks.
+---
+## FAQ
+**Q: How do I get back a `[T, A]` matrix at decode?**
+A: RVQ-AT caches `(time_horizon, action_dim)` on first `__call__`/`encode`. You can also pass them explicitly to `decode(...)`.
+**Q: Can I store shorter token sequences?**
+A: Yes—enable `temporal_stride>1` to quantize a downsampled latent; the decoder upsamples.
+**Q: How do I integrate with `transformers` trainers?**
+A: Treat RVQ-AT output as a discrete vocabulary and feed tokens to your VLA LM. Keep special token ids consistent across datasets.
+---
+## Citation
+If you use RVQ-AT in your work, please cite:
+```bibtex
+```
+---
+## Contact
+* Maintainers: Your Name [[email protected]](mailto:[email protected])
+* Issues & requests: open a GitHub issue or start a Hub discussion on the model page.
+---
+## License
+This repository and the released models are licensed under **Apache-2.0**. You may use, modify, and distribute, provided you **keep a copy of the original license and notices** in your distributions and **state significant changes** when you make them.