appleResearchers at the company modeled the image by autoregressive image modeling (AIMAIM can effectively utilize large amounts of unorganized image data, and its training methodology and stability are similar to that of recent large-scale language models (LLM) is similar. This observation is consistent with previous findings on extending large language models.
Although the model used for the experiments in this paper is limited in size, further exploration is needed to see if this law can be validated on models with larger parameter scales. The pre-training objective used by the researchers follows the standard autoregressive model applied to image patch sequences, and through a series of experiments and studies, it is verified that the model capacity can be easily scaled up to billions of parameters with good performance for downstream tasks.
In addition, the researchers explored multiple aspects of training ViT models with autoregressive objectives and revisited previous work. The researcher's experiments report that the optimization objective directly leads to better downstream performance throughout the training process, while both the loss value and the accuracy of the downstream task improve as the model capacity increases. This observation is consistent with the trend observed in LLMs, reflecting the fact that optimization goals lead directly to better downstream performance.
Among the design parameters of the AIM, in addition to the extended width, the researcher has specifically adopted a simple design using multi-layer perceptron blocks that process each patch independently. The researcher also emphasizes that the scale of the studied model is limited and validation of this law on models with larger parameter scales is yet to be further explored.
The experimental results of the paper prove that the visual model also follows the law of "the more parameters, the stronger the performance", and the autoregressive training has good scalability for the image model and can meet the training requirements of visual features. It provides a new research direction and idea for future image model performance improvement and optimization.