Recently,Multimodal large modelThe research and application of multimodal models have made significant progress. Foreign companies such as OpenAI, Google, and Microsoft have launched a series of advanced models, and domestic institutions such as Zhipu AI and Jieyuexingchen have also made breakthroughs in this field. These models usually rely on visual encoders to extract visual features and combine them with large language models, but there is a problem of visual induction bias caused by training separation, which limits the deployment efficiency and performance of large multimodal models.
To solve these problems,AI Research InstituteIn collaboration with Dalian University of Technology, Peking University and other universities, we launched a new generation of encoder-free visual language model EVE. EVE integrates visual-language representation, alignment and reasoning into a unified pure decoder architecture through refined training strategies and additional visual supervision. Using public data, EVE performs well in multiple visual-language benchmarks, approaching or even outperforming mainstream encoder-based multimodal methods.
The main features of EVE include:
- Native visual language model: removes the visual encoder, handles arbitrary image aspect ratios, and is significantly better than the similar Fuyu-8B model.
- Low data and training cost: Pre-training uses public data such as OpenImages, SAM, and LAION, and the training time is short.
- Transparent and Efficient Exploration: Providing an efficient and transparent development path for decoder-only native multimodal architectures.
Model structure:
- Patch Embedding Layer: Obtain the image 2D feature map through a single convolution layer and an average pooling layer to enhance local features and global information.
- Patch Aligning Layer: Integrates multi-layer network visual features to achieve fine-grained alignment with the output of the visual encoder.
Training strategy:
- Large language model-guided pre-training phase: establishing an initial connection between vision and language.
- Generative pre-training phase: Improve the model's ability to understand visual-linguistic content.
- Supervised fine-tuning phase: regularizes the model’s ability to follow language instructions and learn conversational patterns.
Quantitative analysis: EVE performs well on multiple visual language benchmarks and is comparable to a variety of mainstream encoder-based visual language models. Despite the challenges in accurately responding to specific instructions, through efficient training strategies, EVE achieves comparable performance to visual language models with encoder bases.
EVE demonstrates the potential of encoder-free native visual language models, and may continue to promote the development of multimodal models in the future through further performance improvements, optimization of encoder-free architectures, and construction of native multimodality.
Paper address: https://arxiv.org/abs/2406.11832
Project code: https://github.com/baaivision/EVE
Model address: https://huggingface.co/BAAI/EVE-7B-HD-v1.0