Skip to content

We-Math/V-Thinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
V-Thinker Logo

Paper Dataset License Python 3.10+ Multimodal Reasoning


If you like our project, please give us a star ⭐ on GitHub for the latest update.
Typing Animation
πŸ€— V-Interaction-400K | πŸ€— VTBench | πŸ€— Models(V-Thinker) | πŸ€— V-Perception-40K

πŸ“£ Latest News

  • [Nov 26, 2026]: πŸ”„ We have re-released VTBench on HuggingFace. The updated LMM judge prompt is now available in the HuggingFace README, and should be followed exactly for consistent evaluation. All evaluation settings, together with the updated judge framework, will be included in a revised arXiv version of the paper to be released later this week!

    (If you downloaded VTBench before Nov 19, please make sure to re-download the latest version. We apologize for the inconvenience and appreciate your understanding!)

  • [Nov 19, 2026]: πŸ› οΈ [FIXED] During an internal reproduction check (downloading the dataset from HuggingFace as a user would), we noticed that the VTBench subset for the Instruction-Guided Interaction task on HF corresponds to a non-final export and does not match the finalized version used in our experiments. We are re-verifying the full subset and, together with an upgraded judge component for the evaluation (as we are currently exploring a multi-level judging framework), will release the updated version after Nov 21 (the CVPR appendix deadline). πŸ™‡ We apologize for the inconvenience and appreciate your understanding. 🚧

  • [Nov 7, 2026]: πŸ”₯ We released V-Interaction-400K (preview version, containing 252K samples) β€” a large-scale, high-quality visual interaction dataset which can also be extended to image-to-code tasks.

  • [Nov 7, 2026]: πŸ”₯ We released V-Perception-40K (preview version, containing 37K samples) β€” a high-quality dataset for point-level perceptual alignment.

  • [Nov 7, 2026]: πŸ”₯ We released VTBench, a standardized benchmark for interactive visual reasoning across three task types β€” Perception, Instruction-Guided Interaction, and Interactive Reasoning.

  • [Nov 7, 2026]: πŸ“„ Our paper is now available on arXiv and Hugging Face daily paper.


πŸ”Ž Roadmap

πŸ› οΈ V-Thinker is still evolving!

V-Thinker is still under development and there are many issues and room for improvement. We will continue to update. And we also sincerely welcome contributions on this open-source toolkit.

  • Release codebase and datasets (preview version 252K+37K).
  • Release V-Thinker-7B.
  • Release VTBench.
  • Release knowledge system and visual tool system.
  • Release the complete version of datasets (planned before December).
  • Release improved checkpoints.

πŸ“‘ Contents

Note

Quick navigation guide for exploring V-Thinker


πŸ’‘ Overview

V-Thinker is a general-purpose multimodal reasoning assistant that enables Interactive Thinking with Images through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively interacts with visual contentβ€”editing, annotating, and transforming images to simplify complex problems.

πŸ“‚ Datasets

Dataset Description Download
V-Interaction-400K Large-scale interactive reasoning dataset πŸ€— HuggingFace
V-Perception-40K Point-level perception alignment dataset πŸ€— HuggingFace
VTBench Expert-verified interactive benchmark πŸ€— HuggingFace
image

πŸ’‘ Rethinking the Data Synthesis Paradigm

We rethink the traditional data synthesis paradigm by transforming models from "solvers" to "creators", enabling them to directly generate high-quality multimodal reasoning data through code-level rendering and reasoning generation. Furthermore, by leveraging knowledge-driven representations, structured knowledge systems guide models to produce diverse, coherent, and spatially aligned problems, expanding the scope and evolution of reasoning data.

πŸ”„ Data Evolution Flywheel

Automated synthesis of high-quality interactive reasoning data across three dimensions:

  • Diversity: Knowledge-driven synthesis from seed concepts (We-Math2.0) expanding to 25 domains and 24,767 nodes, enabling continuous evolution from data expansion to genuine data creation.
  • Quality: A coordinated checker–repairer mechanism ensures cross-modal consistency and high fidelity across textual, visual, and image-action dimensions.
  • Difficulty: A progressive expansion stage enriches the difficulty ladder through parallel and sequential extension strategies, supporting scalable reasoning complexity.
image

πŸ“š Visual Progressive Training Curriculum

Two-stage framework progressively building perception and interactive reasoning:

Stage 1: Perception Alignment β†’ Fine-grained visual grounding with point-level supervision

Stage 2: Interactive Reasoning β†’ Cold-start SFT + RL in sandboxed code executor.

πŸ“Š VTBench Benchmark

Expert-verified benchmark with 1,500 QA pairs across three hierarchical dimensions:

image
Task Specification
Perception Visual grounding via coordinate prediction and rendering.
Instruction-Guided Interaction Visual editing and manipulation from instructions.
Interactive Reasoning Multimodal reasoning and answer generation.

πŸš€ Quick Start

Installation

conda create -n vthinker python=3.10
conda activate vthinker
pip install -e .

Usage Example: How to use V-Thinker

We provide a simple script (eval/vtbench_IR/inference.py) to inference on custom cases. Simply run:

cd ./eval/vtbench_IR
python inference.py

Training

Download the perception dataset (V-Perception-40K), SFT dataset (V-Interaction-400K), RL dataset (WeMath 2.0, MMK12, ThinkLite) to the data folder and modify the image path as needed to match your coding environment.

Please ensure you have modified the model and dataset paths in the script to match your environment.

# Perception Alignment
sh scripts/perception.sh
# Interactive Reasoning (SFT + RL).
sh scripts/sft.sh
sh scripts/rl.sh

Inference

Environment setup for eval

pip install --upgrade vllm

Download the VTBench to the data folder and corresponding images to the eval/vtbench_IR, eval/vtbench_IGI, eval/vtbench_Perception folder.

Please ensure you have modified the model paths in the script to match your environment.

# Run on VTBench
cd eval/vtbench_IR
sh run.sh

Download the MathVison, WeMath, VisuLogic to the data folder and modify the image path as needed to match your coding environment.

For Visulogic, you also need to download the corresponding VisuLogic images to the eval/visulogic folder.

# Run on general benchmarks
cd eval/mathvision
python src/run_vthinker.py --benchmark mathvision --eval

πŸ† Experiments Results

Quantitative Results On VTBench

Model Perception Instruction-Guided Interactive Reasoning
GPT-4o 2.3 3.7 38.3
InternVL3-78B 10.8 16.0 43.4
Qwen2.5-VL-7B 9.6 8.8 32.2
V-Thinker-7B 18.0 (+8.4) 34.6 (+25.8) 41.8 (+9.6)

Qualitative Results

Qualitative Analysis
Qualitative Analysis
Rollout Sampling
Rollout Sampling
CoT
CoT
Evovled Knowledge System
CoT

πŸ“„ Citation

@article{qiao2025v,
  title={V-Thinker: Interactive Thinking with Images},
  author={Qiao, Runqi and Tan, Qiuna and Yang, Minghan and Dong, Guanting and Yang, Peiqing and Lang, Shiqiang and Wan, Enhui and Wang, Xiaowan and Xu, Yida and Yang, Lan and others},
  journal={arXiv preprint arXiv:2511.04460},
  year={2025}
}

🀝 Acknowledge

This training implementation builds upon Thyme and Swift, while our models are trained using Qwen2.5-VL. For evaluation, we rely on MathVision, We-Math, VisuLogic, and VLMEvalKit. For the GRPO-stage data, we sincerely thank We-Math 2.0, MM-Eureka, and ThinkLite for their open contributions. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.


πŸ“ž Contact

For any questions or feedback, please reach out to us at qrq@bupt.edu.cn or qiunatan@bupt.edu.cn.


πŸ“„ License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages