Back in the game! Gemini-Pro's multimodal capabilities are comparable to GPT-4V

RecentGemini-ProThe evaluation report shows thatMultimodalityThe Gemini-Pro has made significant progress in the field of machine learning, and is comparable to GPT-4V, and even performs better in some aspects. First, in the comprehensive performance on the multimodal proprietary benchmark MME, Gemini-Pro surpassed GPT-4V with a high score of 1933.4, showing its comprehensive advantages in perception and cognition. Among the 37 visual understanding tasks, Gemini-Pro performed outstandingly in tasks such as text translation, color/landmark/person recognition, and OCR, showing its excellent capabilities in the field of basic perception.

Paper address: https://arxiv.org/pdf/2312.12436.pdf

Project address: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

However, the evaluation also revealed differences between the two. In the celebrity recognition task, GPT-4V scored 0, mainly because it refused to answer related questions. In the location recognition task, both showed poor performance, showing their insensitivity to spatial location information. In addition, the open source model SPHINX is on par with or even better than GPT-4V and Gemini in perception tasks, but there is a large gap in cognition.

The evaluation report is divided into basic perception,advancedGemini-Pro's visual understanding capabilities were evaluated in detail in four areas: cognition, challenging visual tasks, and various expert capabilities. The basic perception test covers object-level perception, scene-level perception, and knowledge-based perception capabilities, among which Gemini-Pro performed outstandingly in tasks such as color/landmark/person recognition and OCR.

advancedThe cognitive tests involved tasks such as text-rich visual reasoning, abstract visual reasoning, scientific problem solving, sentiment analysis, and intellectual games, showing that Gemini-Pro achieved good results in formula generation and abstract visual stimulation.

Challenging visual tasks include referring expression understanding, object tracking, and visual story generation, in which Gemini-Pro demonstrated deep visual perception and understanding capabilities. Finally, various expert ability tests involved tasks such as defect detection and economic analysis, and Gemini-Pro showed excellent expertise in the analysis of stock price charts. However, the evaluation also pointed out that Gemini-Pro had hallucination problems in some tasks and needed further improvement.

Gemini-Pro has achieved remarkable results in the field of multimodality, demonstrating its strong potential in visual understanding. However, the evaluation also highlights that there is still room for further improvement in specific tasks and fields. The performance of Gemini-Pro demonstrates the potential power of multimodal technology and provides useful inspiration for future research and applications.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Back in the game! Gemini-Pro's multimodal capabilities are on par with GPT-4V

OpenAI CEO Sam Altman shares 17 suggestions for 2023 year-end summary

360 Smart Brain passed the national large model standard compliance test

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

OpenAI CEO Sam Altman shares 17 suggestions for 2023 year-end summary

360 Smart Brain passed the national large model standard compliance test

Google launches multimodal VLOGGER AI: making static portraits move and "talk"

The open source multimodal behemoth is here! Meta will launch the Llama 3 405B model on July 23

MIT startup Liquid AI raises nearly $40 million to build a new type of liquid neural network artificial intelligence

Two former directors say OpenAI's ability to govern itself is "unreliable": it will succumb to profits

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow