MetaRecently published aChameleonofMultimodal Model, which sets a new benchmark in the development of multimodal models. Chameleon is an early-stage fusion token-based mixed-modal model family that can understand and generate images and text in any order. It uses a unified Transformer architecture, uses mixed modalities of text, images, and code to complete training, and tokenizes images to generate interleaved text and image sequences.
The innovation of the Chameleon model lies in its early fusion approach, where all processing pipelines are mapped to a common representation space from the beginning, allowing the model to seamlessly process text and images. It demonstrates a wide range of capabilities on a variety of tasks, including visual question answering, image annotation, text generation, image generation, and long-form mixed-modal generation. On the image annotation task, Chameleon achievedFirstIt achieves state-of-the-art performance and surpasses Llama-2 on text tasks, competing with models such as Mixtral8x7B and Gemini-Pro.
Paper address: https://arxiv.org/pdf/2405.09818
The Chameleon model faced significant technical challenges, and the Meta research team introduced a series of architectural innovations and training techniques. For example, they developed a new image segmenter that encodes a 512×512 image into 1024 discrete tokens based on a codebook of size 8192. In addition, Chameleon uses the BPE segmenter trained by the sentencepiece open source library.
In the pre-training phase, Chameleon uses mixed-modal data, including plain text, text-image pairs, and multi-modal documents with text and images interleaved. Pre-training is divided into two stages:FirstThe first stage is unsupervised learning, and the second stage is mixed with higher quality data.
The Chameleon model surpassed Llama2 across the board in benchmark evaluations, achieving significant results in common sense reasoning, reading comprehension, math problems, and world knowledge. In human evaluation and security testing, Chameleon-34B also far outperformed Gemini Pro and GPT-4V.
Although Chameleon lacks the speech capabilities in GPT-4o, Meta's product management director said that they are very proud to support this team and hope to make GPT-4o closer to the open source community. This may mean that in the near future, we may get an open source version of GPT-4o.
The release of the Chameleon model demonstrates Meta's significant progress in the field of multimodal models. It not only promotes the development of multimodal models, but also provides new possibilities for future research and applications.