GitHub - We-Math/V-Thinker

$V-Thinker Logo$

If you like our project, please give us a star ⭐ on GitHub for the latest update.

🤗 V-Interaction-400K ｜ 🤗 VTBench ｜ 🤗 Models(V-Thinker) ｜ 🤗 V-Perception-40K

📣 Latest News

[Nov 26, 2026]: 🔄 We have re-released VTBench on HuggingFace. The updated LMM judge prompt is now available in the HuggingFace README, and should be followed exactly for consistent evaluation. All evaluation settings, together with the updated judge framework, will be included in a revised arXiv version of the paper to be released later this week!

(If you downloaded VTBench before Nov 19, please make sure to re-download the latest version. We apologize for the inconvenience and appreciate your understanding!)
[Nov 19, 2026]: 🛠️ [FIXED] During an internal reproduction check (downloading the dataset from HuggingFace as a user would), we noticed that the VTBench subset for the Instruction-Guided Interaction task on HF corresponds to a non-final export and does not match the finalized version used in our experiments. We are re-verifying the full subset and, together with an upgraded judge component for the evaluation (as we are currently exploring a multi-level judging framework), will release the updated version after Nov 21 (the CVPR appendix deadline). 🙇 We apologize for the inconvenience and appreciate your understanding. 🚧
[Nov 7, 2026]: 🔥 We released V-Interaction-400K (preview version, containing 252K samples) — a large-scale, high-quality visual interaction dataset which can also be extended to image-to-code tasks.
[Nov 7, 2026]: 🔥 We released V-Perception-40K (preview version, containing 37K samples) — a high-quality dataset for point-level perceptual alignment.
[Nov 7, 2026]: 🔥 We released VTBench, a standardized benchmark for interactive visual reasoning across three task types — Perception, Instruction-Guided Interaction, and Interactive Reasoning.
[Nov 7, 2026]: 📄 Our paper is now available on arXiv and Hugging Face daily paper.

🔎 Roadmap

🛠️ V-Thinker is still evolving!

V-Thinker is still under development and there are many issues and room for improvement. We will continue to update. And we also sincerely welcome contributions on this open-source toolkit.

Release codebase and datasets (preview version 252K+37K).
Release V-Thinker-7B.
Release VTBench.
Release knowledge system and visual tool system.
Release the complete version of datasets (planned before December).
Release improved checkpoints.

📑 Contents

Note

Quick navigation guide for exploring V-Thinker

💡 Overview

V-Thinker is a general-purpose multimodal reasoning assistant that enables Interactive Thinking with Images through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively interacts with visual content—editing, annotating, and transforming images to simplify complex problems.

📂 Datasets

Dataset	Description	Download
V-Interaction-400K	Large-scale interactive reasoning dataset	🤗 HuggingFace
V-Perception-40K	Point-level perception alignment dataset	🤗 HuggingFace
VTBench	Expert-verified interactive benchmark	🤗 HuggingFace

💡 Rethinking the Data Synthesis Paradigm

We rethink the traditional data synthesis paradigm by transforming models from "solvers" to "creators", enabling them to directly generate high-quality multimodal reasoning data through code-level rendering and reasoning generation. Furthermore, by leveraging knowledge-driven representations, structured knowledge systems guide models to produce diverse, coherent, and spatially aligned problems, expanding the scope and evolution of reasoning data.

🔄 Data Evolution Flywheel

Automated synthesis of high-quality interactive reasoning data across three dimensions:

Diversity: Knowledge-driven synthesis from seed concepts (We-Math2.0) expanding to 25 domains and 24,767 nodes, enabling continuous evolution from data expansion to genuine data creation.
Quality: A coordinated checker–repairer mechanism ensures cross-modal consistency and high fidelity across textual, visual, and image-action dimensions.
Difficulty: A progressive expansion stage enriches the difficulty ladder through parallel and sequential extension strategies, supporting scalable reasoning complexity.

📚 Visual Progressive Training Curriculum

Two-stage framework progressively building perception and interactive reasoning:

Stage 1: Perception Alignment → Fine-grained visual grounding with point-level supervision

Stage 2: Interactive Reasoning → Cold-start SFT + RL in sandboxed code executor.

📊 VTBench Benchmark

Expert-verified benchmark with 1,500 QA pairs across three hierarchical dimensions:

Task	Specification
Perception	Visual grounding via coordinate prediction and rendering.
Instruction-Guided Interaction	Visual editing and manipulation from instructions.
Interactive Reasoning	Multimodal reasoning and answer generation.

🚀 Quick Start

Installation

conda create -n vthinker python=3.10
conda activate vthinker
pip install -e .

Usage Example: How to use V-Thinker

We provide a simple script (eval/vtbench_IR/inference.py) to inference on custom cases. Simply run:

cd ./eval/vtbench_IR
python inference.py

Training

Download the perception dataset (V-Perception-40K), SFT dataset (V-Interaction-400K), RL dataset (WeMath 2.0, MMK12, ThinkLite) to the data folder and modify the image path as needed to match your coding environment.

Please ensure you have modified the model and dataset paths in the script to match your environment.

# Perception Alignment
sh scripts/perception.sh

# Interactive Reasoning (SFT + RL).
sh scripts/sft.sh
sh scripts/rl.sh

Inference

Environment setup for eval

pip install --upgrade vllm

Download the VTBench to the data folder and corresponding images to the eval/vtbench_IR, eval/vtbench_IGI, eval/vtbench_Perception folder.

Please ensure you have modified the model paths in the script to match your environment.

# Run on VTBench
cd eval/vtbench_IR
sh run.sh

Download the MathVison, WeMath, VisuLogic to the data folder and modify the image path as needed to match your coding environment.

For Visulogic, you also need to download the corresponding VisuLogic images to the eval/visulogic folder.

# Run on general benchmarks
cd eval/mathvision
python src/run_vthinker.py --benchmark mathvision --eval

🏆 Experiments Results

Quantitative Results On VTBench

Model	Perception	Instruction-Guided	Interactive Reasoning
GPT-4o	2.3	3.7	38.3
InternVL3-78B	10.8	16.0	43.4
Qwen2.5-VL-7B	9.6	8.8	32.2
V-Thinker-7B	18.0 (+8.4)	34.6 (+25.8)	41.8 (+9.6)

Qualitative Results

Qualitative Analysis

$Qualitative Analysis$

Rollout Sampling

$Rollout Sampling$

CoT

$CoT$

Evovled Knowledge System

$CoT$

📄 Citation

@article{qiao2025v,
  title={V-Thinker: Interactive Thinking with Images},
  author={Qiao, Runqi and Tan, Qiuna and Yang, Minghan and Dong, Guanting and Yang, Peiqing and Lang, Shiqiang and Wan, Enhui and Wang, Xiaowan and Xu, Yida and Yang, Lan and others},
  journal={arXiv preprint arXiv:2511.04460},
  year={2025}
}

🤝 Acknowledge

This training implementation builds upon Thyme and Swift, while our models are trained using Qwen2.5-VL. For evaluation, we rely on MathVision, We-Math, VisuLogic, and VLMEvalKit. For the GRPO-stage data, we sincerely thank We-Math 2.0, MM-Eureka, and ThinkLite for their open contributions. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

📞 Contact

For any questions or feedback, please reach out to us at qrq@bupt.edu.cn or qiunatan@bupt.edu.cn.

📄 License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
assets		assets
data		data
eval		eval
scripts		scripts
swift		swift
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_CN.md		CONTRIBUTING_CN.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

🔎 Roadmap

📑 Contents

💡 Overview

📂 Datasets

💡 Rethinking the Data Synthesis Paradigm

🔄 Data Evolution Flywheel

📚 Visual Progressive Training Curriculum

📊 VTBench Benchmark

🚀 Quick Start

Installation

Usage Example: How to use V-Thinker

Training

Inference

🏆 Experiments Results

Quantitative Results On VTBench

Qualitative Results

📄 Citation

🤝 Acknowledge

📞 Contact

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

🔎 Roadmap

📑 Contents

💡 Overview

📂 Datasets

💡 Rethinking the Data Synthesis Paradigm

🔄 Data Evolution Flywheel

📚 Visual Progressive Training Curriculum

📊 VTBench Benchmark

🚀 Quick Start

Installation

Usage Example: How to use V-Thinker

Training

Inference

🏆 Experiments Results

Quantitative Results On VTBench

Qualitative Results

📄 Citation

🤝 Acknowledge

📞 Contact

📄 License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages