An AI image generation modelmaximumThe problem is speed: use ChatGPT or Stable Diffusion Generating a single image can take several minutes. Even Meta CEO Mark Zuckerberg complained about the image generation speed at last year’s Meta Connect conference.
Hugging Face The team is trying to speed that up with a new model called aMUSEd, which can generate images in just a few seconds.
This lightweight text-to-image model is based on Google's MUSE model, with a parameter size of about 800 million. aMUSEd can be deployed on devices such as mobile devices. Its speed comes from the way it is built. aMUSEd uses an architecture called Masked Image Model (MIM) instead of the latent diffusion in Stable Diffusion and other image generation models.
The Hugging Face team said that MIM reduces the number of inference steps, thereby improving the speed of model generation and interpretability. And its small size also makes it run quickly.
aMUSE Project experience website: https://huggingface.co/papers/2401.01808
You can try aMUSEd via the demo on Hugging Face. The model is currently available as a research preview, but uses an OpenRAIL license, which means it can be experimented with or tweaked, while also being friendly to commercial adaptations.
The quality of images generated by aMUSEd can be further improved, and the team openly acknowledges this, choosing to release it to "encourage the community to explore non-proliferation frameworks like MIM for image generation."
The aMUSEd model can perform zero-shot image restoration, which Stable Diffusion XL cannot do, according to the Hugging Face team.
As for how AI images are generated in seconds, the MIM method in aMUSEd is similar to the techniques used in language modeling, where certain parts of the data are hidden (or masked) and the model learns to predict these hidden parts. In the case of aMUSEd, it is the image that is hidden instead of the text.
When training the model, the Hugging Face team uses a tool called VQGAN (Vector Quantized Generative Adversarial Network) to convert the input image into a series of tokens. The image tokens are then partially masked, and the model predicts the masked portion based on the unmasked portion and the prompt through a text encoder. During inference, the text prompt is converted into a format that the model understands through the same text encoder. aMUSEd starts with a set of randomly masked tokens and gradually refines the image.
During each refinement, the model predicts parts of the image, retains the parts it is most confident about, and continues to refine the rest. After a certain number of steps, the model’s predictions are processed through the VQGAN decoder to generate the final image.
aMUSEd can also be fine-tuned on custom datasets. Hugging Face showed a model fine-tuned with an 8-bit Adam optimizer and float16 precision, using less than 11GB of GPU VRAM.
The training script for model fine-tuning can be accessed here:
https://github.com/huggingface/diffusers/blob/main/examples/amused/train_amused.py