--- license: apache-2.0 base_model: - microsoft/Florence-2-large tags: - robotics - vla pipeline_tag: robotics --- # X-VLA 0.9B (Google-Robot Edition) **Repository:** [2toINF/X-VLA](https://github.com/2toinf/X-VLA) **Authors:** [2toINF](https://github.com/2toINF)โ€ƒ|โ€ƒ**License:** Apache 2.0 **Paper:** *Zheng et al., 2025, โ€œX-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Modelโ€* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274)) ## ๐Ÿš€ Overview Successful generalist **Vision-Language-Action (VLA)** models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich robotic data sources, **X-VLA** introduces a **Soft Prompt approach** with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing **separate sets of learnable embeddings** for each distinct embodiment. These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively. Our architectureโ€”**a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformers**โ€”achieves superior scalability and simplicity. Trained on **Bridge Data** and evaluated across **six simulations** and **three real-world robots**, the 0.9B-parameter X-VLA simultaneously achieves **state-of-the-art performance** across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks. ๐ŸŒ **Project Website:** [https://thu-air-dream.github.io/X-VLA/](https://thu-air-dream.github.io/X-VLA/) ## โš™๏ธ Usage ### ๐Ÿ”น Load the model ```python from transformers import AutoModel model = AutoModel.from_pretrained( "2toINF/X-VLA-WidowX", trust_remote_code=True ) ``` ### ๐Ÿ”น Start FastAPI server ```python from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True) model.run(processor, host="0.0.0.0", port=8000) ``` ### ๐Ÿ”น Client-server evaluation You can run the provided evaluation client from our GitHub: ๐Ÿ‘‰ [2toINF/X-VLA โ€“ Client & Server Code](https://github.com/2toINF/X-VLA) ## ๐Ÿงฉ Architecture | Component | Role | | :-------------------------------- | :------------------------------------------------------------------------- | | **Florence 2 Encoder** | Vision-Language representation backbone (encoder-only). | | **SoftPromptedTransformer** | Flow-matching action denoiser using learnable soft prompts per embodiment. | | **Action Hub** | Defines action spaces, masking rules, pre/post-processing, and losses. | ## ๐Ÿง  Training Summary | Setting | Value | | :---------------- | :---------------------------------------------- | | Training Data | Bridge Data V2 | | Parameters | โ‰ˆ 0.9 B | | Action Mode | `ee6d` | | Precision | BP16 | | Framework | PyTorch + Transformers | --- ## ๐Ÿชช License ``` Copyright 2025 2toINF (https://github.com/2toINF) Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. http://www.apache.org/licenses/LICENSE-2.0 ``` --- ## ๐Ÿ“š Citation ```bibtex @article{zheng2025x, title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}, author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others}, journal = {arXiv preprint arXiv:2510.10274}, year = {2025} } ``` --- ## ๐ŸŒ Links - ๐Ÿ“„ **Paper:** [arXiv 2510.10274](https://arxiv.org/abs/2510.10274) - ๐Ÿ’ป **Code & Client/Server:** [GitHub โ€“ 2toINF/X-VLA](https://github.com/2toINF/X-VLA) - ๐Ÿค– **Model Hub:** [Hugging Face โ€“ 2toINF/X-VLA](https://huggingface.co/collections/2toINF/x-vla)