SAIL-VL2

[📖 Technique Report] [🤗 SAIL-VL2-2B] [🤗 SAIL-VL2-8B] [🤗 SAIL-VL2-2B-Thinking] [🤗 SAIL-VL2-8B-Thinking] [💻 Github]

We are very excited to introduce SAIL-VL2 🚀, a state-of-the-art visual language model that significantly outperforms existing models in various visual language tasks.

🔥 Updates

2025.09.18 🌟 SAIL-VL2 Technical Report is now available at arxiv.

🌟 Highlights

SAIL-VL2 is powerful, efficient, and achieves top results under 2B parameters.
SAIL-VL2-Thinking boosts complex reasoning, matching larger models.
SAIL-VL2 excels in fine-grained visual tasks beyond similar-scale models.

Model Architecture:

Architecture	ViT	LLM	Adapter	Token Merge	Resolution
🤗SAIL-VL2-2B	🤗SAILViT-Huge	🤗Qwen3-1.7B	2-layer MLP	2x2	448x448xN
🤗SAIL-VL2-8B	🤗SAILViT-Huge	🤗Qwen3-8B	2-layer MLP	2x2	448x448xN
🤗SAIL-VL2-2B-Thinking	🤗SAILViT-Huge	🤗Qwen3-1.7B	2-layer MLP	2x2	448x448xN
🤗SAIL-VL2-8B-Thinking	🤗SAILViT-Huge	🤗Qwen3-8B	2-layer MLP	2x2	448x448xN

🎬 Quick Start

import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from PIL import Image


model_path = "your model path"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
device = torch.cuda.current_device()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16,).to(device)

print("##### with images")
cot_prompt = r"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."
messages = [
    {"role": "user", "content": [{"type": "image", "image": 'image_path'}, 
    {"type": "text", "text": "describe the image" + cot_prompt}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

image_path = 'your image path'
image = Image.open(image_path)
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)

generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)


print("##### without images")
cot_prompt = r"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "中国的首都是哪里？" + cot_prompt}]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=None, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)

👀 Introduction

SAIL-VL2 is powerful yet efficient: With training on 776B tokens, SAIL-VL2 has verified its effectiveness across 106 datasets, achieving state-of-the-art results on a broad spectrum of influential benchmarks under the 2B-parameter scale. Remarkably, even without specialized prompting, the base SAIL-VL2 model delivers highly competitive performance on challenging reasoning benchmarks such as MMMU and MathVista, demonstrating strong out-of-the-box capabilities.
SAIL-VL2 as a deep thinker: Many real-world tasks demand sophisticated reasoning and multi-step thought processes, which remain challenging for standard LVMs. To address this, we develop SAIL-VL2-Thinking, a specialized variant trained with advanced Chain-of-Thought (CoT) and reinforcement learning (RL) strategies. This design substantially improves performance on complex reasoning benchmarks, often matching or even surpassing models with far larger parameter scales, thereby setting a new standard for efficient architectures in high-level reasoning.
SAIL-VL2 perceives with clarity: Fine-grained visual understanding is a critical challenge for multimodal models. SAIL-VL2 delivers high-fidelity perception in tasks such as OCR, high-resolution document layout analysis, and complex chart interpretation, achieving detailed visual grounding beyond models of similar scale.

Overview of the SAIL-VL2 framework. The architecture is composed of a vision encoder that aligns visual inputs into the representation space of the LLM. A lightweight adapter further transforms visual embeddings into tokenized representations, which are jointly processed with linguistic embeddings for multimodal reasoning and prediction. SAIL-VL2 accommodates multiple LLM backbones, ensuring flexibility and scalability across model configurations.

📚 Training Strategy

🌟 Data construction

Data construction pipeline for SAIL-VL2 training. High-quality multimodal corpora are construted by curating and filtering open-source datasets, and generating synthetic data, with both components systematically organized to meet the requirements of different training stages.

🌟 Pre-Train

Basic Multimodal Pre-Training: develops SAIL-VL2’s multimodal alignment via SAIL-ViT, LLM and a random MLP adapter, using 64M samples, AdaLRS and 2048 batch size.
Multi-task Pre-Training: strengthens SAIL-VL2’s visual and instruction-following abilities, unfreezes all params, adds instruction-tuning data, uses 180M samples, and skips AdaLRS.

Scaling curves of SAIL-VL2-2B during the multi-task pre-training stage. Results are reported on overall benchmarks, natural-scene VQA datasets, and OCR VQA tasks. ’BMK Score’ denotes the average benchmark score.

🌟 Post-Train

Basic Supervised Fine-Tuning: 4 phases; Model Soup merges homogeneous models.
LongCoT Supervised Fine-tuning: enhances the model’s step-by-step reasoning capabilities for complex problems.
RL with Verifiable Rewards: refines the model by optimizing it against a reward system focused on two primary objectives: the correctness of the final answer and adherence to the specified output format
Think-Fusion Supervised Fine-tuning: enhances the model’s reasoning capabilities while maintaining its broad general understanding.
RL with a Mixed Reward System: enhances the model’s reasoning capabilities through a RL stage

📈 Experimental Results

🌟 Performance of 2B series

Overall comparison of the SAIL-VL2 series and existing open-source MLLMs (<4B).

🌟 Performance of 8B series

Overall comparison of the SAIL-VL2 series with existing open-source 8B MLLMs and closed-source models.

🌟 Performance of Thinking-mode models

Evaluation results on OpenCompass multimodal reasoning benchmarks.

🙏 Acknowledge

Our model is built upon numerous outstanding open-source projects, and we are grateful for their contributions. We extend special thanks to the InternVL team, Qwen team, and Apple team for their great base models, and to the BAAI team (Infinity-MM), MAmmoTH-VL team(MAmmoTH-VL-Instruction-12M) for their generous release of data, and to the OpenCompass team for their valuable benchmarks.

✒️ Citation

All contributors are listed in reverse alphabetical order by last name initial, with equal contributions.

If you find our work helpful for your research, please consider citing our work.

@misc{yin2025sailvl2technicalreport,
  title={SAIL-VL2 Technical Report}, 
  author={Weijie Yin and Yongjie Ye and Fangxun Shu and Yue Liao and Zijian Kang and Hongyuan Dong and Haiyang Yu and Dingkang Yang and Jiacong Wang and Han Wang and Wenzhuo Liu and Xiao Liang and Shuicheng Yan and Chao Feng},
  year={2025},
  eprint={2509.14033},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2509.14033}, 
}

@article{dong2025scalable,
  title={Scalable vision language model training via high quality data curation},
  author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
  journal={arXiv preprint arXiv:2501.05952},
  year={2025}
}

📜 License

This project is licensed under Apache License 2.0.

📧Contact

If you have any question, please feel free to contact us: [email protected]

Downloads last month: 164

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BytedanceDouyinContent/SAIL-VL2-8B-Thinking

SAIL-VL2

Collection

7 items • Updated Sep 18 • 6