Alibaba's Qwen research team has released another open-source artificial intelligence (AI) model in preview. Dubbed QVQ-72B, it is a vision-based reasoning model that can analyze visual information from images and understand the context behind them. The tech giant has also shared benchmark scores of the AI ​​model and highlighted that on one specific test, it was able to outperform OpenAI's o1 model. Notably, Alibaba has released several open-source AI models recently, including the QwQ-32B and Marco-o1 reasoning-focused large language models (LLMs).
Alibaba's Vision-Based QVQ-72B AI Model Launched
In a Hugging Face listingAlibaba's Qwen team detailed the new open-source AI model. Calling it an experimental research model, the researchers highlighted that the QVQ-72B comes with enhanced visual reasoning capabilities. Interestingly, these are two separate branches of performance, that the researchers have combined in this model.
Vision-based AI models are plentiful. These include an image encoder and can analyze the visual information and context behind them. Similarly, reasoning-focused models such as o1 and QwQ-32B come with test-time compute scaling capabilities that allow them to increase the processing time for the model. This enables the model to break down the problem, solve it in a step-by-step manner, assess the output and correct it against a verifier.
With QVQ-72B's preview model, Alibaba has combined these two functionalities. It can now analyze information from images and answer complex queries by using reasoning-focused structures. The team highlights that it has significantly improved the performance of the model.
Sharing evals from internal testing, the researchers claimed that the QVQ-72B was able to score 71.4 percent in the MathVista (mini) benchmark, outperforming the o1 model (71.0). It is also said to score 70.3 percent on the Multimodal Massive Multi-task Understanding (MMMU) benchmark.
Despite the improved performance, there are several limitations, as is the case with most experimental models. The Qwen team stated that the AI ​​model occasionally mixes different languages ​​or unexpectedly switches between them. The code-switching issue is also prominent in the model. Additionally, the model is prone to getting caught in recursive reasoning loops, affecting the final output.