Huazhong University of Science and Technology open source multimodal large model Monkey

Monkey, yes.Huazhong University of Science and TechnologyJointly launched with Kingsoft, a high-performanceMultimodal large modelMonkey addresses the challenges of existing models in processing complex scenes and visual details by increasing the input resolution and introducing a multilevel description generation method. Monkey can be built based on existing visual editors without pre-training from 0, greatly improving R&D efficiency.

Monkey's multi-level description generation approach provides rich contextual information to guide the model in learning associations between scenes and objects. Tested on 16 different datasets, Monkey achieves excellent results on multimodal tasks such as image captioning, visual quizzing, and document classification, etc. Monkey demonstrates its ability to perceive subtle visual information and understand complex scenes, and has a wide range of applications.

Huazhong University of Science and Technology open-sources multimodal large model Monkey

Open SourceAddress:https://github.com/Yuliang-Liu/Monkey

Paper address:https://arxiv.org/abs/2311.06607v1

The quality of Monkey's training dataset is key to its ability to improve, and the researchers generated hundreds of thousands of high-quality image description data and used multiple models to automatically generate textual descriptions and fuse the outputs of the different models to enhance theLarge ModelAbility to understand image details.

In terms of model selection, Monkey uses the open source model Qwen-VL as a linguistic decoder and the 2-billion-parameter ViT-BigHuge as a visual coder to avoid the waste of resources in repeated pre-training. In order to improve Monkey's recognition ability and input resolution, as well as to generate richer image descriptions and the ability to understand complex scenes, three training phases, namely, multi-level description generation, high-resolution coding, and multi-task training, are used.

Monkey was thoroughly validated on 16 different datasets, including tasks such as image captioning, generic visual quizzing, and document-oriented quizzing. On the generalized visual quizzing task, Monkey shows significant strengths on multiple datasets. On the image captioning task, Monkey also performs well on the TextCaps dataset, demonstrating its multimodal understanding of text elements in images.

Monkey has achieved good results on several document image understanding datasets for document-oriented question and answer tasks. The researchers said Monkey has a wide range of applications in medical imaging, satellite imagery, and other fields, and will continue to optimize the perception, association, inference, and generalization capabilities of the Monkey model.

In summary, Monkey is a high-performance multimodal macromodel that addresses the challenges of complex scenes and visual detail processing by increasing the input resolution and introducing a multilevel description generation method.Monkey does not need to be pre-trained from 0, and can be constructed based on existing visual editors, which provides high efficiency and a wide range of applications. Tested on multiple datasets, Monkey has achieved excellent results in multimodal tasks, demonstrating strong visual information perception and scene understanding capabilities. In the future, Monkey will continue to optimize the model's perception, association, inference and generalization capabilities to further enhance its application value in various fields.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Huazhong University of Science and Technology open-sources multimodal large model Monkey

Samsung's on-device AI is named Galaxy AI and will debut on the Galaxy S24 next month

"AI undressing" app is rampant, with visits increasing sharply to 24 million

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

Samsung's on-device AI is named Galaxy AI and will debut on the Galaxy S24 next month

"AI undressing" app is rampant, with visits increasing sharply to 24 million

Yuanxiang's open source model has 30 quantitative versions that can be deployed at a lower cost

Kunlun Wanwei announced the release and open source of "Tiangong Model 3.0" on April 17: 400 billion parameters, claimed to have better performance than Grok 1.0

Zhipu open-sources the next-generation multimodal large model CogVLM2

"World's First" Single RTX 4090 Server Inference, Kunlun Wanwei Open Source 200 Billion Sparse Large Model Tiangong MoE

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow