AI Research Institute launches a new generation of encoder-free visual language multimodal large model EVE

Recently,Multimodal large modelThe research and application of multimodal models have made significant progress. Foreign companies such as OpenAI, Google, and Microsoft have launched a series of advanced models, and domestic institutions such as Zhipu AI and Jieyuexingchen have also made breakthroughs in this field. These models usually rely on visual encoders to extract visual features and combine them with large language models, but there is a problem of visual induction bias caused by training separation, which limits the deployment efficiency and performance of large multimodal models.

To solve these problems,AI Research InstituteIn collaboration with Dalian University of Technology, Peking University and other universities, we launched a new generation of encoder-free visual language model EVE. EVE integrates visual-language representation, alignment and reasoning into a unified pure decoder architecture through refined training strategies and additional visual supervision. Using public data, EVE performs well in multiple visual-language benchmarks, approaching or even outperforming mainstream encoder-based multimodal methods.

The main features of EVE include:

Native visual language model: removes the visual encoder, handles arbitrary image aspect ratios, and is significantly better than the similar Fuyu-8B model.
Low data and training cost: Pre-training uses public data such as OpenImages, SAM, and LAION, and the training time is short.
Transparent and Efficient Exploration: Providing an efficient and transparent development path for decoder-only native multimodal architectures.

Model structure:

Patch Embedding Layer: Obtain the image 2D feature map through a single convolution layer and an average pooling layer to enhance local features and global information.
Patch Aligning Layer: Integrates multi-layer network visual features to achieve fine-grained alignment with the output of the visual encoder.

Training strategy:

Large language model-guided pre-training phase: establishing an initial connection between vision and language.
Generative pre-training phase: Improve the model's ability to understand visual-linguistic content.
Supervised fine-tuning phase: regularizes the model’s ability to follow language instructions and learn conversational patterns.

Quantitative analysis: EVE performs well on multiple visual language benchmarks and is comparable to a variety of mainstream encoder-based visual language models. Despite the challenges in accurately responding to specific instructions, through efficient training strategies, EVE achieves comparable performance to visual language models with encoder bases.

EVE demonstrates the potential of encoder-free native visual language models, and may continue to promote the development of multimodal models in the future through further performance improvements, optimization of encoder-free architectures, and construction of native multimodality.

Paper address: https://arxiv.org/abs/2406.11832

Project code: https://github.com/baaivision/EVE

Model address: https://huggingface.co/BAAI/EVE-7B-HD-v1.0

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

AI Research Institute launches a new generation of encoder-free visual language multimodal large model EVE

Meta tests AI review summarization feature on Facebook

Baidu Netdisk launches AI English learning tool Panpan Word Mini Program

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

Meta tests AI review summarization feature on Facebook

Baidu Netdisk launches AI English learning tool Panpan Word Mini Program

Huazhong University of Science and Technology open-sources multimodal large model Monkey

Tsinghua University and Zhejiang University launch open source alternatives to GPT-4V! Open source visual models such as LLaVA and CogAgent explode

Zhipu open-sources the next-generation multimodal large model CogVLM2

SenseTime Jueying launches the industry's first native multi-modal large model vehicle-side deployment: 8 billion parameters, 40 tokens per second

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow