appleand researchers at the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland have jointly developed a single model for any-to-any modality that can be trained on dozens of highly diverse modalities and co-trained on large-scale multimodal datasets and text corpora. The model, named 4M-21, is trained on 21 different modalities and accomplishes at least three times more than existing models without loss of performance.
The study used a 4M pre-training scheme, which can improve the performance and adaptability of the model by scaling up the size of the model and dataset, increasing the type and number of modalities involved in training the model, and co-training on multiple datasets. The researchers used different tokenization methods to discretize modalities with different features, such as global image embedding, human gestures, and semantic instances. In terms of architecture selection, the study uses a Transformer-based 4M encoder-decoder architecture and adds additional modal embeddings to accommodate new modalities.
The model not only performs a range of common vision tasks out-of-the-box, such as DIODE surface normal and depth estimation, COCO semantic and instance segmentation, and 3DPW3D human pose estimation, but is also capable of generating arbitrary training modalities, supports several methods to perform fine-grained and multimodal generation, and can retrieve RGB images or other modalities by using other modalities as queries. In addition, the researchers have conducted multimodal transfer experiments on NYUv2, Hypersim semantic segmentation, and ARKitScenes.
Important functional features include.
Arbitrary to Arbitrary Modality: Increase the number of modalities from the existing best 7 modalities for arbitrary to arbitrary models to 21 different modalities for cross-modality retrieval, controlled generation and powerful out-of-the-box performance.
Versatility support: Add support for more structured data such as human posture, SAM instances, metadata, etc.
Tokenization:Investigates discrete tokenization for different modalities, such as global image embedding, human gestures, and semantic instances, using modality-specific approaches.
Extension:Extend the model size to 3B parameters and the dataset to 0.5B samples.
Co-training: Simultaneous visual and verbal co-training.
- Paper address:https://arxiv.org/pdf/2406.09406