-
[Nov 26, 2026]: π We have re-released VTBench on HuggingFace. The updated LMM judge prompt is now available in the HuggingFace README, and should be followed exactly for consistent evaluation. All evaluation settings, together with the updated judge framework, will be included in a revised arXiv version of the paper to be released later this week!
(If you downloaded VTBench before Nov 19, please make sure to re-download the latest version. We apologize for the inconvenience and appreciate your understanding!)
-
[Nov 19, 2026]: π οΈ [FIXED] During an internal reproduction check (downloading the dataset from HuggingFace as a user would), we noticed that the VTBench subset for the Instruction-Guided Interaction task on HF corresponds to a non-final export and does not match the finalized version used in our experiments. We are re-verifying the full subset and, together with an upgraded judge component for the evaluation (as we are currently exploring a multi-level judging framework), will release the updated version after Nov 21 (the CVPR appendix deadline). π We apologize for the inconvenience and appreciate your understanding. π§
-
[Nov 7, 2026]: π₯ We released V-Interaction-400K (preview version, containing 252K samples) β a large-scale, high-quality visual interaction dataset which can also be extended to image-to-code tasks.
-
[Nov 7, 2026]: π₯ We released V-Perception-40K (preview version, containing 37K samples) β a high-quality dataset for point-level perceptual alignment.
-
[Nov 7, 2026]: π₯ We released VTBench, a standardized benchmark for interactive visual reasoning across three task types β Perception, Instruction-Guided Interaction, and Interactive Reasoning.
-
[Nov 7, 2026]: π Our paper is now available on arXiv and Hugging Face daily paper.
π οΈ V-Thinker is still evolving!
V-Thinker is still under development and there are many issues and room for improvement. We will continue to update. And we also sincerely welcome contributions on this open-source toolkit.
- Release codebase and datasets (preview version 252K+37K).
- Release V-Thinker-7B.
- Release VTBench.
- Release knowledge system and visual tool system.
- Release the complete version of datasets (planned before December).
- Release improved checkpoints.
Note
Quick navigation guide for exploring V-Thinker
V-Thinker is a general-purpose multimodal reasoning assistant that enables Interactive Thinking with Images through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively interacts with visual contentβediting, annotating, and transforming images to simplify complex problems.
| Dataset | Description | Download |
|---|---|---|
| V-Interaction-400K | Large-scale interactive reasoning dataset | π€ HuggingFace |
| V-Perception-40K | Point-level perception alignment dataset | π€ HuggingFace |
| VTBench | Expert-verified interactive benchmark | π€ HuggingFace |
We rethink the traditional data synthesis paradigm by transforming models from "solvers" to "creators", enabling them to directly generate high-quality multimodal reasoning data through code-level rendering and reasoning generation. Furthermore, by leveraging knowledge-driven representations, structured knowledge systems guide models to produce diverse, coherent, and spatially aligned problems, expanding the scope and evolution of reasoning data.
Automated synthesis of high-quality interactive reasoning data across three dimensions:
- Diversity: Knowledge-driven synthesis from seed concepts (We-Math2.0) expanding to 25 domains and 24,767 nodes, enabling continuous evolution from data expansion to genuine data creation.
- Quality: A coordinated checkerβrepairer mechanism ensures cross-modal consistency and high fidelity across textual, visual, and image-action dimensions.
- Difficulty: A progressive expansion stage enriches the difficulty ladder through parallel and sequential extension strategies, supporting scalable reasoning complexity.
Two-stage framework progressively building perception and interactive reasoning:
Stage 1: Perception Alignment β Fine-grained visual grounding with point-level supervision
Stage 2: Interactive Reasoning β Cold-start SFT + RL in sandboxed code executor.
Expert-verified benchmark with 1,500 QA pairs across three hierarchical dimensions:
| Task | Specification |
|---|---|
| Perception | Visual grounding via coordinate prediction and rendering. |
| Instruction-Guided Interaction | Visual editing and manipulation from instructions. |
| Interactive Reasoning | Multimodal reasoning and answer generation. |
conda create -n vthinker python=3.10
conda activate vthinker
pip install -e .We provide a simple script (eval/vtbench_IR/inference.py) to inference on custom cases. Simply run:
cd ./eval/vtbench_IR
python inference.pyDownload the perception dataset (V-Perception-40K), SFT dataset (V-Interaction-400K), RL dataset (WeMath 2.0, MMK12, ThinkLite) to the data folder and modify the image path as needed to match your coding environment.
Please ensure you have modified the model and dataset paths in the script to match your environment.
# Perception Alignment
sh scripts/perception.sh# Interactive Reasoning (SFT + RL).
sh scripts/sft.sh
sh scripts/rl.shEnvironment setup for eval
pip install --upgrade vllmDownload the VTBench to the data folder and corresponding images to the eval/vtbench_IR, eval/vtbench_IGI, eval/vtbench_Perception folder.
Please ensure you have modified the model paths in the script to match your environment.
# Run on VTBench
cd eval/vtbench_IR
sh run.shDownload the MathVison, WeMath, VisuLogic to the data folder and modify the image path as needed to match your coding environment.
For Visulogic, you also need to download the corresponding VisuLogic images to the eval/visulogic folder.
# Run on general benchmarks
cd eval/mathvision
python src/run_vthinker.py --benchmark mathvision --eval| Model | Perception | Instruction-Guided | Interactive Reasoning |
|---|---|---|---|
| GPT-4o | 2.3 | 3.7 | 38.3 |
| InternVL3-78B | 10.8 | 16.0 | 43.4 |
| Qwen2.5-VL-7B | 9.6 | 8.8 | 32.2 |
| V-Thinker-7B | 18.0 (+8.4) | 34.6 (+25.8) | 41.8 (+9.6) |
@article{qiao2025v,
title={V-Thinker: Interactive Thinking with Images},
author={Qiao, Runqi and Tan, Qiuna and Yang, Minghan and Dong, Guanting and Yang, Peiqing and Lang, Shiqiang and Wan, Enhui and Wang, Xiaowan and Xu, Yida and Yang, Lan and others},
journal={arXiv preprint arXiv:2511.04460},
year={2025}
}
This training implementation builds upon Thyme and Swift, while our models are trained using Qwen2.5-VL. For evaluation, we rely on MathVision, We-Math, VisuLogic, and VLMEvalKit. For the GRPO-stage data, we sincerely thank We-Math 2.0, MM-Eureka, and ThinkLite for their open contributions. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
For any questions or feedback, please reach out to us at qrq@bupt.edu.cn or qiunatan@bupt.edu.cn.
This project is released under the MIT License.






