RecentGemini-ProThe evaluation report shows thatMultimodalityThe Gemini-Pro has made significant progress in the field of machine learning, and is comparable to GPT-4V, and even performs better in some aspects. First, in the comprehensive performance on the multimodal proprietary benchmark MME, Gemini-Pro surpassed GPT-4V with a high score of 1933.4, showing its comprehensive advantages in perception and cognition. Among the 37 visual understanding tasks, Gemini-Pro performed outstandingly in tasks such as text translation, color/landmark/person recognition, and OCR, showing its excellent capabilities in the field of basic perception.
Paper address: https://arxiv.org/pdf/2312.12436.pdf
Project address: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
However, the evaluation also revealed differences between the two. In the celebrity recognition task, GPT-4V scored 0, mainly because it refused to answer related questions. In the location recognition task, both showed poor performance, showing their insensitivity to spatial location information. In addition, the open source model SPHINX is on par with or even better than GPT-4V and Gemini in perception tasks, but there is a large gap in cognition.
The evaluation report is divided into basic perception,advancedGemini-Pro's visual understanding capabilities were evaluated in detail in four areas: cognition, challenging visual tasks, and various expert capabilities. The basic perception test covers object-level perception, scene-level perception, and knowledge-based perception capabilities, among which Gemini-Pro performed outstandingly in tasks such as color/landmark/person recognition and OCR.
advancedThe cognitive tests involved tasks such as text-rich visual reasoning, abstract visual reasoning, scientific problem solving, sentiment analysis, and intellectual games, showing that Gemini-Pro achieved good results in formula generation and abstract visual stimulation.
Challenging visual tasks include referring expression understanding, object tracking, and visual story generation, in which Gemini-Pro demonstrated deep visual perception and understanding capabilities. Finally, various expert ability tests involved tasks such as defect detection and economic analysis, and Gemini-Pro showed excellent expertise in the analysis of stock price charts. However, the evaluation also pointed out that Gemini-Pro had hallucination problems in some tasks and needed further improvement.
Gemini-Pro has achieved remarkable results in the field of multimodality, demonstrating its strong potential in visual understanding. However, the evaluation also highlights that there is still room for further improvement in specific tasks and fields. The performance of Gemini-Pro demonstrates the potential power of multimodal technology and provides useful inspiration for future research and applications.