The Apple team published a paper on arXiv proposing a visual model of AIM that uses autoregressive generative targets for pre-training. The study demonstrates that autoregressive pre-training of image features has scaling properties similar to their textual counterparts (i.e., large language models). Specifically, the paper leads to two main findings: the model capacity can be easily scaled to billions of parameters, and AIM effectively utilizes a large unfiltered image dataset.
Paper address:
https://arxiv.org/pdf/2401.08541
https://arxiv.org/pdf/2401.08541.pdf
Project address:
https://github.com/apple/ml-aim