Monkey, yes.Huazhong University of Science and TechnologyJointly launched with Kingsoft, a high-performanceMultimodal large modelMonkey addresses the challenges of existing models in processing complex scenes and visual details by increasing the input resolution and introducing a multilevel description generation method. Monkey can be built based on existing visual editors without pre-training from 0, greatly improving R&D efficiency.
Monkey's multi-level description generation approach provides rich contextual information to guide the model in learning associations between scenes and objects. Tested on 16 different datasets, Monkey achieves excellent results on multimodal tasks such as image captioning, visual quizzing, and document classification, etc. Monkey demonstrates its ability to perceive subtle visual information and understand complex scenes, and has a wide range of applications.
Open SourceAddress:https://github.com/Yuliang-Liu/Monkey
Paper address:https://arxiv.org/abs/2311.06607v1
The quality of Monkey's training dataset is key to its ability to improve, and the researchers generated hundreds of thousands of high-quality image description data and used multiple models to automatically generate textual descriptions and fuse the outputs of the different models to enhance theLarge ModelAbility to understand image details.
In terms of model selection, Monkey uses the open source model Qwen-VL as a linguistic decoder and the 2-billion-parameter ViT-BigHuge as a visual coder to avoid the waste of resources in repeated pre-training. In order to improve Monkey's recognition ability and input resolution, as well as to generate richer image descriptions and the ability to understand complex scenes, three training phases, namely, multi-level description generation, high-resolution coding, and multi-task training, are used.
Monkey was thoroughly validated on 16 different datasets, including tasks such as image captioning, generic visual quizzing, and document-oriented quizzing. On the generalized visual quizzing task, Monkey shows significant strengths on multiple datasets. On the image captioning task, Monkey also performs well on the TextCaps dataset, demonstrating its multimodal understanding of text elements in images.
Monkey has achieved good results on several document image understanding datasets for document-oriented question and answer tasks. The researchers said Monkey has a wide range of applications in medical imaging, satellite imagery, and other fields, and will continue to optimize the perception, association, inference, and generalization capabilities of the Monkey model.
In summary, Monkey is a high-performance multimodal macromodel that addresses the challenges of complex scenes and visual detail processing by increasing the input resolution and introducing a multilevel description generation method.Monkey does not need to be pre-trained from 0, and can be constructed based on existing visual editors, which provides high efficiency and a wide range of applications. Tested on multiple datasets, Monkey has achieved excellent results in multimodal tasks, demonstrating strong visual information perception and scene understanding capabilities. In the future, Monkey will continue to optimize the model's perception, association, inference and generalization capabilities to further enhance its application value in various fields.