Commit
4acb070
·
verified ·
1 Parent(s): bd5525c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -5
README.md CHANGED
@@ -1,9 +1,127 @@
1
  ---
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
 
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - RDT
4
+ - rdt
5
+ - tokenizer
6
+ - action
7
+ - discrete
8
+ - vector-quantization
9
+ license: apache-2.0
10
+ pipeline_tag: robotics
11
  ---
12
 
13
+ # RVQ-AT: Residual VQ Action Tokenizer for RDT 2
14
+
15
+ **RVQ-AT** is a fast, compact **Residual Vector-Quantization** (RVQ) tokenizer for robot action streams.
16
+ It converts continuous control trajectories into short sequences of **discrete action tokens** that plug directly into autoregressive VLA models.
17
+
18
+ Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes **residuals** level-by-level. This yields:
19
+
20
+ * **Higher fidelity at the same bitrate** (lower recon MSE / higher SNR)
21
+ * **Shorter token sequences** for the same time horizon
22
+ * **Stable training** via commitment loss, EMA codebook updates, and dead-code revival
23
+
24
+ Here, we provide:
25
+
26
+ 1. **RVQ-AT (Universal)** — a general-purpose tokenizer trained on diverse manipulation & navigation logs.
27
+ 2. **Simple APIs to fit your own tokenizer** on custom action datasets.
28
+
29
+ ---
30
+
31
+
32
+ ## Using the Universal RVQ-AT Tokenizer
33
+
34
+ We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported.
35
+
36
+ ```python
37
+ import numpy as np
38
+ from transformers import AutoProcessor
39
+
40
+ # Load from the Hub (replace with your repo id once published)
41
+ proc = .from_pretrained(
42
+ "your-org/residual-vq-action-tokenizer", # e.g., "your-org/rvq-at-universal"
43
+ trust_remote_code=True
44
+ )
45
+
46
+ # Dummy batch: [batch, T, action_dim], concretely [batch_size, 24, 20]
47
+ action_data = np.random.uniform(-1, 1, size=(256, 50, 20)).astype("float32")
48
+
49
+ # Encode → tokens (List[List[int]] or np.ndarray[int])
50
+ tokens = proc(action_data) # or proc.encode(action_data)
51
+
52
+ # Decode back to continuous actions
53
+ # The processor caches (T, action_dim) on first forward;
54
+ # or specify explicitly:
55
+ recon = proc.decode(tokens, time_horizon=50, action_dim=14)
56
+ ```
57
+
58
+ **Notes**
59
+
60
+ * If your pipeline uses variable-length chunks, pass `time_horizon` per sample to `decode(...)`.
61
+ * Special tokens (`pad`, `eos`, optional `chunk_sep`) are reserved and shouldn’t be used as code indices.
62
+
63
+ ---
64
+
65
+ ## Recommended Preprocessing
66
+
67
+ * **Chunking:** 0.5–1.0 s windows work well for 10–50 Hz logs.
68
+ * **Normalization:** per-dimension robust scaling to `[-1, 1]` (e.g., 1–99% quantiles). Save stats in `preprocessor_config.json`.
69
+ * **Padding:** for variable `T`, pad to a small multiple of stride; RVQ-AT masks paddings internally.
70
+ * **Action spaces:** supports mixed spaces (e.g., 7-DoF joints + gripper + base). Concatenate into a flat vector per timestep.
71
+
72
+ ---
73
+
74
+ <!-- ## Performance (Universal Model)
75
+
76
+ *(Representative, measured on internal eval — replace with your numbers when available.)*
77
+
78
+ * **Compression:** 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
79
+ * **Reconstruction:** MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
80
+ * **Latency:** <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.
81
+ * **Downstream VLA:** +1–3% SR on long-horizon tasks vs. raw-action modeling.
82
+
83
+ ---
84
+ -->
85
+ ## Safety & Intended Use
86
+
87
+ RVQ-AT is a representation learning component. **Do not** deploy decoded actions directly to hardware without:
88
+
89
+ * Proper sim-to-real validation,
90
+ * Safety bounds/clamping and rate limiters,
91
+ * Task-level monitors and e-stop fallbacks.
92
+
93
+ ---
94
+
95
+ ## FAQ
96
+
97
+ **Q: How do I get back a `[T, A]` matrix at decode?**
98
+ A: RVQ-AT caches `(time_horizon, action_dim)` on first `__call__`/`encode`. You can also pass them explicitly to `decode(...)`.
99
+
100
+ **Q: Can I store shorter token sequences?**
101
+ A: Yes—enable `temporal_stride>1` to quantize a downsampled latent; the decoder upsamples.
102
+
103
+ **Q: How do I integrate with `transformers` trainers?**
104
+ A: Treat RVQ-AT output as a discrete vocabulary and feed tokens to your VLA LM. Keep special token ids consistent across datasets.
105
+
106
+ ---
107
+
108
+ ## Citation
109
+
110
+ If you use RVQ-AT in your work, please cite:
111
+
112
+ ```bibtex
113
+
114
+ ```
115
+
116
+ ---
117
+
118
+ ## Contact
119
+
120
+ * Maintainers: Your Name [[email protected]](mailto:[email protected])
121
+ * Issues & requests: open a GitHub issue or start a Hub discussion on the model page.
122
+
123
+ ---
124
+
125
+ ## License
126
+
127
+ This repository and the released models are licensed under **Apache-2.0**. You may use, modify, and distribute, provided you **keep a copy of the original license and notices** in your distributions and **state significant changes** when you make them.