SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

This model, VLAA-Thinker-Qwen2VL-7B, is a vision-language model fine-tuned on the VLAA-Thinking dataset. As described in , it leverages a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve reasoning capabilities in LLMs. The model excels in multimodal reasoning tasks, achieving state-of-the-art performance on the OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025.

🌐 Project Page • Arxiv Logo Arxiv • 💻 Code

Both VLAA-Thinker-Qwen2.5-3B and VLAA-Thinker-Qwen2.5-7B achieve SOTA performance on OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025. pipeline

Quick Start 🚀

Inference

Run python inference.py. Note that our model is trained with a system prompt. Please ensure that it is included for inference.

Dataset Download

Run bash ./utils/download_dataset.sh. Specify the dataset root with absolute path. The dataset should be ordered as follows:

├── VLAA-Thinking-SFT-126K.json
├── VLAA-Thinking-GRPO-25K.json
└── images
    ├── allava_laion
    ├── arxivqa
    ├── chartqa
    ├── clevr_math
    ├── coco
    │   └── train2017
    ├── docvqa
    ├── geoqa170k
    ├── synthesis
    ├── vg
    │   ├── VG_100K
    │   └── VG_100K_2
    └── vizwiz