Apple launches all-around visual model 4M-21 that can handle 21 different modalities

appleand researchers at the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland have jointly developed a single model for any-to-any modality that can be trained on dozens of highly diverse modalities and co-trained on large-scale multimodal datasets and text corpora. The model, named 4M-21, is trained on 21 different modalities and accomplishes at least three times more than existing models without loss of performance.

Apple launches all-around visual model 4M-21 that can handle 21 different modalities

The study used a 4M pre-training scheme, which can improve the performance and adaptability of the model by scaling up the size of the model and dataset, increasing the type and number of modalities involved in training the model, and co-training on multiple datasets. The researchers used different tokenization methods to discretize modalities with different features, such as global image embedding, human gestures, and semantic instances. In terms of architecture selection, the study uses a Transformer-based 4M encoder-decoder architecture and adds additional modal embeddings to accommodate new modalities.

Apple launches all-around visual model 4M-21 that can handle 21 different modalities

The model not only performs a range of common vision tasks out-of-the-box, such as DIODE surface normal and depth estimation, COCO semantic and instance segmentation, and 3DPW3D human pose estimation, but is also capable of generating arbitrary training modalities, supports several methods to perform fine-grained and multimodal generation, and can retrieve RGB images or other modalities by using other modalities as queries. In addition, the researchers have conducted multimodal transfer experiments on NYUv2, Hypersim semantic segmentation, and ARKitScenes.

Important functional features include.

Arbitrary to Arbitrary Modality: Increase the number of modalities from the existing best 7 modalities for arbitrary to arbitrary models to 21 different modalities for cross-modality retrieval, controlled generation and powerful out-of-the-box performance.

Versatility support: Add support for more structured data such as human posture, SAM instances, metadata, etc.

Tokenization:Investigates discrete tokenization for different modalities, such as global image embedding, human gestures, and semantic instances, using modality-specific approaches.

Extension:Extend the model size to 3B parameters and the dataset to 0.5B samples.

Co-training: Simultaneous visual and verbal co-training.

  • Paper address:https://arxiv.org/pdf/2406.09406
statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Kunlun Wanwei Tiangong Open Platform launches "One-click Moving Plan" for OpenAl API users

2024-6-26 8:52:50

Information

LG Uplus officially releases ixi-GEN, a small generative AI model that can be fine-tuned locally

2024-6-26 8:55:50

Search