Huazhong University of Science and Technology open-sources multimodal large model Monkey

Monkey, yes.Huazhong University of Science and TechnologyJointly launched with Kingsoft, a high-performanceMultimodal large modelMonkey addresses the challenges of existing models in processing complex scenes and visual details by increasing the input resolution and introducing a multilevel description generation method. Monkey can be built based on existing visual editors without pre-training from 0, greatly improving R&D efficiency.

Monkey's multi-level description generation approach provides rich contextual information to guide the model in learning associations between scenes and objects. Tested on 16 different datasets, Monkey achieves excellent results on multimodal tasks such as image captioning, visual quizzing, and document classification, etc. Monkey demonstrates its ability to perceive subtle visual information and understand complex scenes, and has a wide range of applications.

Huazhong University of Science and Technology open-sources multimodal large model Monkey

Open SourceAddress:https://github.com/Yuliang-Liu/Monkey

Paper address:https://arxiv.org/abs/2311.06607v1

The quality of Monkey's training dataset is key to its ability to improve, and the researchers generated hundreds of thousands of high-quality image description data and used multiple models to automatically generate textual descriptions and fuse the outputs of the different models to enhance theLarge ModelAbility to understand image details.

In terms of model selection, Monkey uses the open source model Qwen-VL as a linguistic decoder and the 2-billion-parameter ViT-BigHuge as a visual coder to avoid the waste of resources in repeated pre-training. In order to improve Monkey's recognition ability and input resolution, as well as to generate richer image descriptions and the ability to understand complex scenes, three training phases, namely, multi-level description generation, high-resolution coding, and multi-task training, are used.

Monkey was thoroughly validated on 16 different datasets, including tasks such as image captioning, generic visual quizzing, and document-oriented quizzing. On the generalized visual quizzing task, Monkey shows significant strengths on multiple datasets. On the image captioning task, Monkey also performs well on the TextCaps dataset, demonstrating its multimodal understanding of text elements in images.

Monkey has achieved good results on several document image understanding datasets for document-oriented question and answer tasks. The researchers said Monkey has a wide range of applications in medical imaging, satellite imagery, and other fields, and will continue to optimize the perception, association, inference, and generalization capabilities of the Monkey model.

In summary, Monkey is a high-performance multimodal macromodel that addresses the challenges of complex scenes and visual detail processing by increasing the input resolution and introducing a multilevel description generation method.Monkey does not need to be pre-trained from 0, and can be constructed based on existing visual editors, which provides high efficiency and a wide range of applications. Tested on multiple datasets, Monkey has achieved excellent results in multimodal tasks, demonstrating strong visual information perception and scene understanding capabilities. In the future, Monkey will continue to optimize the model's perception, association, inference and generalization capabilities to further enhance its application value in various fields.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Samsung's on-device AI is named Galaxy AI and will debut on the Galaxy S24 next month

2023-12-8 11:30:29

Information

"AI undressing" app is rampant, with visits increasing sharply to 24 million

2023-12-9 10:18:00

Search