Microsoft open-sources multimodal model LLaVA-1.5, comparable to GPT-4V

Microsoft Open SourceBeMultimodal Model LLaVA-1.5, inheriting the LLaVA architecture and introducing new features. The researchers tested it in visual question answering, natural language processing, image generation, etc. and showed that LLaVA-1.5 reached the level of open source models.Highestlevel, comparable to the effect of GPT-4V.

Microsoft's open source multimodal model LLaVA-1.5 is comparable to GPT-4V

The model consists of three parts: a visual model, a large language model, and a visual language connector. The visual model uses the pre-trained CLIP ViT-L/336px. Through CLIP encoding, a fixed-length vector representation can be obtained to improve the representation of image semantic information. Compared with the previous version, the CLIP model parameters and input resolution have been significantly improved.

The large language model uses Vicuna v1.5, which has 13 billion parameters, to understand user input text and capture semantic information, with strong reasoning and generation capabilities. Unlike methods that only tune the image encoder, LLaVA-1.5 updates the parameters of the large language model during training, allowing it to directly learn how to integrate visual information for reasoning, improving model autonomy.

In terms of visual language connectors, LLaVA-1.5 uses a two-layer MLP connector instead of linear projection to effectively map the CLIP encoder output to the word vector space of the large language model.

In terms of the training process, LLaVA-1.5 follows a two-stage training method. First, pre-training of visual language representation is performed, using about 600,000 image-text pairs, and the training time is about 1 hour. Then, tuning is performed on 650,000 multimodal instruction data, and the training time is about 20 hours. This efficient two-stage training ensures the convergence of the model and completes the entire process within one day, which greatly reduces the AI computing power and time cost compared to other models.

The researchers also designed matching response format prompts to guide the model to adjust the output form according to the interaction type to meet the needs of specific scenarios. In terms of visual instruction tuning, LLaVA-1.5 uses different types of data sets, including VQA, OCR, regional VQA, visual dialogue, language dialogue, etc., totaling about 650,000 data, to provide the model with rich visual scene reasoning and interaction methods.

LLaVA-1.5 has made significant progress in the multimodal field, and through open source, it has promoted its widespread application in visual question answering, natural language processing, image generation, and other fields.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.

{{userData.name}}Verify

Microsoft's open source multimodal model LLaVA-1.5 is comparable to GPT-4V

Yelp, the US review website, will strengthen its AI functions, and some functions have been launched on the iOS version

Semron raises $7.9 million for 3D-packaged AI chips for mobile devices, 20 times more efficient

AI Weibo

AI Applications

5000+ AI applications! Updated daily

AIAICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

{{userData.name}}Verify

Related content:

Yelp, the US review website, will strengthen its AI functions, and some functions have been launched on the iOS version

Semron raises $7.9 million for 3D-packaged AI chips for mobile devices, 20 times more efficient

The open source MiniCPM 2.0 series of models from Mianbi Intelligent has significantly enhanced its OCR and other capabilities

Microsoft plans to integrate OpenAI's Sora video generation model into Copilot, but it will take time

The world's largest Oracle "dataset" is open source

Stable Diffusion3 open source commercial protocol, will open source a larger version of the model

AI Applications

5000+ AI applications! Updated daily

AIAICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow