AliThousand Questions on Tongyi The Qwen team published a blog post today (December 25) announcing the launch of QVQ-72B-Preview, based on the Qwen2-VL-72B build Open Sourcevisual inference model,Be able to find solutions to complex physics problems through logical reasoning in a calm and collected manner, just like the masters of physics.
Ali Tongyi Thousand Questions team evaluates QVQ-72B-Preview on 4 datasets, 1AI attached the relevant introduction below:
- MMMU: A university-level, multidisciplinary, multimodal assessment set designed to examine integrated understanding and reasoning skills related to model vision.
- MathVista: a collection of math-related visual reasoning tests that assesses the ability to reason logically with puzzle test graphs, algebraically with function graphs, and scientifically with academic paper graphs.
- MathVision: a collection of high-quality multimodal mathematical reasoning tests from real math competitions, with more question diversity and subject breadth than MathVista.
- OlympiadBench: an Olympiad-level bilingual multimodal science benchmark test set containing 8,476 problems from the Olympiad math and physics competitions, including the Chinese Gaokao. Each problem is accompanied by expert-level annotations detailing step-by-step reasoning.
Test results show that QVQ-72B-Preview achieved a score of 70.3 on the MMMU benchmark, significantly outperforming Qwen2-VL-72B-Instruct. additionally, the model performed well in the three remaining benchmarks focused on math and science problems, effectively closing the gap with the leading state-of-the-art o1 model.
Ali Tongyi Thousand Questions Qwen team also stated that QVQ-72B-Preview is an experimental research model focused on enhancing visual reasoning. Although it performed beyond expectations, there are still several limitations to be aware of:
- Language mixing and switching: The model may accidentally mix languages or switch between languages, thus affecting the clarity of the response.
- Recursive reasoning: the model may fall into a circular logic pattern, generating lengthy responses without reaching a conclusion.
- Security and Ethical Considerations: Models require enhanced security measures to ensure reliable and safe performance, and users should exercise caution when deploying them.
- Performance and Benchmark Limitations: Although the model has improved in visual reasoning, it cannot fully replace the capabilities of the Qwen2-VL-72B. In addition, during multi-step visual reasoning, the model may gradually lose focus on the image content, leading to hallucinations.
refer to