Meta releases Chameleon, a GPT-4o-like multimodal model

MetaRecently published aChameleonofMultimodal Model, which sets a new benchmark in the development of multimodal models. Chameleon is an early-stage fusion token-based mixed-modal model family that can understand and generate images and text in any order. It uses a unified Transformer architecture, uses mixed modalities of text, images, and code to complete training, and tokenizes images to generate interleaved text and image sequences.

The innovation of the Chameleon model lies in its early fusion approach, where all processing pipelines are mapped to a common representation space from the beginning, allowing the model to seamlessly process text and images. It demonstrates a wide range of capabilities on a variety of tasks, including visual question answering, image annotation, text generation, image generation, and long-form mixed-modal generation. On the image annotation task, Chameleon achievedFirstIt achieves state-of-the-art performance and surpasses Llama-2 on text tasks, competing with models such as Mixtral8x7B and Gemini-Pro.

Meta releases Chameleon, a GPT-4o-like multimodal model

Paper address: https://arxiv.org/pdf/2405.09818

The Chameleon model faced significant technical challenges, and the Meta research team introduced a series of architectural innovations and training techniques. For example, they developed a new image segmenter that encodes a 512×512 image into 1024 discrete tokens based on a codebook of size 8192. In addition, Chameleon uses the BPE segmenter trained by the sentencepiece open source library.

In the pre-training phase, Chameleon uses mixed-modal data, including plain text, text-image pairs, and multi-modal documents with text and images interleaved. Pre-training is divided into two stages:FirstThe first stage is unsupervised learning, and the second stage is mixed with higher quality data.

The Chameleon model surpassed Llama2 across the board in benchmark evaluations, achieving significant results in common sense reasoning, reading comprehension, math problems, and world knowledge. In human evaluation and security testing, Chameleon-34B also far outperformed Gemini Pro and GPT-4V.

Although Chameleon lacks the speech capabilities in GPT-4o, Meta's product management director said that they are very proud to support this team and hope to make GPT-4o closer to the open source community. This may mean that in the near future, we may get an open source version of GPT-4o.

The release of the Chameleon model demonstrates Meta's significant progress in the field of multimodal models. It not only promotes the development of multimodal models, but also provides new possibilities for future research and applications.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Tencent Questionnaire launches AI function that has been integrated into Tencent Hunyuan Big Model

2024-5-22 9:13:21

Information

Microsoft releases AI tool Recall to help you find those lost files

2024-5-22 9:18:26

Search