SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

This model, VLAA-Thinker-Qwen2VL-7B, is a vision-language model fine-tuned on the VLAA-Thinking dataset. As described in , it leverages a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve reasoning capabilities in LLMs. The model excels in multimodal reasoning tasks, achieving state-of-the-art performance on the OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025.

🌐 Project Page β€’ Arxiv Logo Arxiv β€’ πŸ’» Code

Both VLAA-Thinker-Qwen2.5-3B and VLAA-Thinker-Qwen2.5-7B achieve SOTA performance on OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025. pipeline


pipeline

Quick Start πŸš€

Inference

Run python inference.py. Note that our model is trained with a system prompt. Please ensure that it is included for inference.

Dataset Download

Run bash ./utils/download_dataset.sh. Specify the dataset root with absolute path. The dataset should be ordered as follows:

β”œβ”€β”€ VLAA-Thinking-SFT-126K.json
β”œβ”€β”€ VLAA-Thinking-GRPO-25K.json
└── images
    β”œβ”€β”€ allava_laion
    β”œβ”€β”€ arxivqa
    β”œβ”€β”€ chartqa
    β”œβ”€β”€ clevr_math
    β”œβ”€β”€ coco
    β”‚   └── train2017
    β”œβ”€β”€ docvqa
    β”œβ”€β”€ geoqa170k
    β”œβ”€β”€ synthesis
    β”œβ”€β”€ vg
    β”‚   β”œβ”€β”€ VG_100K
    β”‚   └── VG_100K_2
    └── vizwiz

Training

Code coming soon!

(Rest of the README content can be kept as is)

Downloads last month
47
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for UCSC-VLAA/VLAA-Thinker-Qwen2VL-7B

Quantizations
2 models

Collection including UCSC-VLAA/VLAA-Thinker-Qwen2VL-7B