Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence
π Pelican-VL 1.0 Report
| π€ Hugging Face
| π€ ModelScope
π§° Quick Start
| π Project Website
| π οΈ Evaluation
π News
2025-11-13: Pelican1.0-VL-7B and Pelican1.0-VL-72B model checkpoint has been released in π€ Hugging Face Link and π€ ModelScope Link.2025-10-30: We have released the Pelican-VL 1.0 Report. The 7Bγ72B model for open source is coming soon. For more details, please check our report!
Introduction
We presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7B to 72B. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms.
Overview:
π Highlights:
Multimodal Understanding and Reasoning: Pelican-VL processes both visual and textual inputs, trained on massive datasets of images, videos, and cross-modal annotations. It not only recognizes objects accurately but also performs physical reasoning, spatial relationship understanding, and functional prediction based on scene context. For example, in closed environments like kitchens or supermarkets, it can distinguish the placement of fruits and vegetables, counter locations, and plan picking or placing actions accordingly.
Spatio-Temporal Cognition: The modelβs training includes tens of thousands of hours of video and dynamic scene question-answering, enabling it to understand continuous temporal sequences. When processing video frames, Pelican-VL captures object motion and the temporal order of actions, allowing it to make coherent inferences about complex sequential tasksβfor instance, determining βwhich item should be moved first before operating the next.β
Embodied Interaction Capabilities: In robotic tasks such as object grasping, navigation, and collaborative manipulation, Pelican-VL not only understands high-level task goals but also produces detailed action sequences along with feasibility assessments for each step. This means that upon receiving an instruction, the model can determine appropriate grasp points and devise corresponding manipulation strategies. Its multi-task proficiency spans grasping, navigation, and humanβrobot interaction, demonstrating strong cross-task generalization.
Self-Correction and Iterative Learning: Through DPPO cyclic training, Pelican-VL exhibits a βself-correctingβ capability. After each reinforcement learning cycle, the model automatically generates new challenging samples for retrainingβsimilar to repeated practice and reflection. Over time, its weaknesses are gradually addressed, and its abilities continuously improve. This process mirrors the concept of βdeliberate practice,β allowing Pelican-VL to advance iteratively and achieve performance on par with top-tier proprietary systems.
Performance
Overall Tasks
Performance comparison of Pelican-VL1.0. (Left) Comparison against models with β€100B parameters. The shaded(pink) region highlights the performance gain over our baseline. (Right) Comparison against models with β₯100B parameters, including leading open-source and proprietary models, where our model also demonstrates SOTA performance.
Detail Dimensions
Benchmark performance radar comparison of Pelican-VL 1.0 (72B) against other models across nine dimensions.
π‘ Downstream Applications
Please see our project websiteοΌπ pelican-vl.github.io
π§ Open-Source Weights
We will released our pelican models on π€ Hugging Face and π€ ModelScope:
| Model Name | Parameters | Checkpoint | Checkpoint |
|---|---|---|---|
| Pelican1.0-VL-7B | 7B | π€ Link | π€ Link |
| Pelican1.0-VL-72B | 72B | π€ Link | π€ Link |
Quick Start
Here, We provide you a simple script of LoRa fine-tuning and give you some embodied samples, allowing you to experience how to experiment with embodied data. Training is based on the LLM training and deployment framework Swift.
π οΈ Installation
# pip installation
# pip install ms-swift -U
# Source code installation
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install qwen-vl-utils[decord]==0.0.11 # qwen-vl-utils 0.0.11, decord 0.6.0
pip install deepspeed==0.16.9 # distributed training
pip install wandb # wandb=0.21.0
pip install msgspec
pip install transformers==4.51.1
pip install flash-attn==2.6.1 --no-build-isolation # if GPU supports
LoRa Fine-Tuning
Dataset Source
All embodied data used in this demo are JSON files and all derived from public datasets on Hugging Face:
| Dataset Name | Type | Link |
|---|---|---|
| Cosmos Reasoning SFT Data | Video | π€ Link |
| Robopoint GQA Data | Image | π€ Link |
| VSI-Bench ScanNetpp Data | Video | π€ Link |
Download the three datasets from the above links, Place the downloaded files in the local directory(e.g., /datasets/xxx) matching the paths in the JSON (or modify the JSON paths to your local storage path).
# Using an interactive command line for training.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft \
--model Qwen2.5-VL-7B-Instruct \
--dataset /datasets/robopoint_example_500.json \
/datasets/vsibench_example_500.json \
/datasets/cosmos_example_500.json \
--train_type lora \
--torch_dtype bfloat16 \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-5 \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--freeze_vit true \
--target_modules all-linear \
--gradient_accumulation_steps 16 \
--split_dataset_ratio 0.1 \
--data_seed 42 \
--eval_steps 200 \
--save_strategy epoch \
--logging_steps 1 \
--max_length 8192 \
--output_dir /xxx/output \
--warmup_ratio 0.05 \
--dataloader_num_workers 16 \
--save_only_model True \
--attn_impl flash_attn
After training is complete, use the following command to infer with the trained weights:
- Here,
--adaptersshould be replaced with the last checkpoint folder generated during training. Since the adapters folder contains the training parameter fileargs.json, there is no need to specify--model,--systemseparately; Swift will automatically read these parameters.
# Using an interactive command line for inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
--adapters /xxx/output/checkpoint-xxx \
--stream true \
--infer_backend pt \
--max_new_tokens 2048
For more detailed parameters, please refer to the official documents of Swift .
Evaluation Reproduction
To facilitate faithful reproduction of our reported results, we summarize our official evaluation settings below.
Please refer to Evaluation.md.
π¬ Contact With Us
- Email: {vito.dai, jason.ju}@x-humanoid.com
Citation
If you find our Pelican-VL useful in your research, please cite:
@article{Pelican-VL-1.0,
title={Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence},
author={Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang, Xiaozhu Ju},
journal={arXiv preprint arXiv:2511.00108},
year={2025}
}
π¨οΈ Discussions π¨οΈ
If you're interested in Pelican-VL, welcome to join our WeChat group for discussions.

- Downloads last month
- 268