Zhipu open-sources the next-generation multimodal large model CogVLM2

ZhipuAI recently announced a new generationMultimodal large modelCogVLM2, the model has significantly improved key performance indicators compared to the previous generation CogVLM, while supporting 8K text length and images with a resolution of up to 1344*1344. CogVLM2 has improved its performance by 32% on the OCRbench benchmark and 21.9% on the TextVQA benchmark, showing strong document image understanding capabilities. Although the model size of CogVLM2 is 19B, its performance is close to or exceeds the level of GPT-4V.

Zhipu open-sources the next-generation multimodal large model CogVLM2

The technical architecture of CogVLM2 is optimized based on the previous generation model, including a 5-billion-parameter visual encoder and a 7-billion-parameter visual expert module, which finely models the interaction between visual and language sequences through unique parameter settings. This deep fusion strategy enables a closer integration of the visual modality and the language modality while maintaining the model's advantages in language processing. In addition, the number of parameters actually activated by CogVLM2 during reasoning is only about 12 billion, thanks to its carefully designed multi-expert module structure, which significantly improves reasoning efficiency.

In terms of model performance, CogVLM2 has achieved excellent results in multiple multimodal benchmarks, including TextVQA, DocVQA, ChartQA, OCRbench, MMMU, MMVet, and MMBench. These tests cover a wide range of capabilities from text and image understanding to complex reasoning and interdisciplinary tasks. The two models of CogVLM2 have achieved excellent results in multiple benchmarks.FirstIt has advanced performance, while other performance can reach a level close to that of closed-source models.

Code repository:

Github:https://github.com/THUDM/CogVLM2

Model Download:

Huggingface:huggingface.co/THUDM

Moda Community: modelscope.cn/models/ZhipuAI

ZhiuAI Community: wisemodel.cn/models/ZhipuAI

Demo experience:

https://modelscope.cn/studios/ZhipuAI/Cogvlm2-llama3-chinese-chat-Demo/summary

CogVLM2 Technical Documentation:

https://zhipu-ai.feishu.cn/wiki/OQJ9wk5dYiqk93kp3SKcBGDPnGf

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Claiming to be "comparable to human experts", Google Gemini 1.5 Pro Mathematics Edition "improves intelligence": MATH benchmark accuracy rate is 91.1%

2024-5-21 9:31:59

Information

Tencent plans to invest in Dark Side of the Moon, with a valuation of $3 billion

2024-5-21 9:35:27

Search