Recently, AI portrait generation technology has become very popular. This article introducesInstantID, it can achieve personalized image synthesis using only a single facial image reference while maintaining high-fidelity identity preservation, and supports a variety of different styles.
Project Homepage:https://instantid.github.io/
Code address: https://github.com/InstantID/InstantID
Experience Address:https://huggingface.co/spaces/InstantX/InstantID
1. Introduction to InstantID
The paper introduces InstantID: "Zero-shot Identity-Preserving Generation in Seconds", which translates to "zero-shot identity preservation and generation in a few seconds".
InstantID is a powerful solution based on diffusion models. The designed plug-and-play module can skillfully handle various styles of image personalization using only a single face image while ensuring high fidelity. At its core, it designs a novel IdentityNet that combines face and landmark images with textual cues to guide image generation by imposing semantic and weak spatial conditions.
Given only one reference ID image, InstantID aims to generate customized images with various poses or styles from a single reference ID image while ensuring high fidelity. It consists of three key components:
(1) ID embedding that captures semantic face information;
(2) A lightweight adaptation module with decoupled cross-attention to facilitate the use of images as visual cues
(3) IdentityNet, which encodes detailed features of reference facial images through additional spatial control.
2. Introduction to InstantID Function
Function 1: Generate a picture of any style from a face
Feature 2: Editability
You can edit the generated images through text prompts, such as changing the expressions, background or other elements of the characters in the image. You can also use the ControlNet plug-in to more accurately control the details of image generation and achieve personalized customization.
Function 3: Multiple references
It allows multiple reference images to be used to generate a new image, thereby enhancing the richness and diversity of the generated images.
For multiple reference images, the average of the ID embeddings is taken as the image hint. InstantID achieves good results even with only one reference image.
InstantID also has the flexibility to support adding identity attributes to non-human roles.
3. Comparison between InstantID and similar products
Comparison 1: InstantID vs. IP-Adapter/IP-Adapter-FaceID/PhotoMaker
Compare with IP-Adapter (IPA), IP-Adapter-FaceID and the latest PhotoMaker. Among them, PhotoMaker needs to train the LoRA parameters of UNet. It can be seen that both PhotoMaker and IP-Adapter-FaceID achieve good fidelity, but the text control ability has obvious degradation. In contrast, InstantID achieves better fidelity and retains good text editability (faces and styles are better integrated).
Comparison 2: InstantID vs. LoRa
InstantID can achieve competitive results like LoRA without any training.
Comparison 3: InstantID vs. InsightFace Swapper
In the non-realistic style, InstantID is more flexible in the fusion of face and background.
4. InstantID User Experience
Next, let’s experience it on the huggingface website.
There is an explanation of the operation steps at the top, and the core operation only requires 4 steps.
[Step 1]: Upload personal pictures
For multi-person images, we will only detect the largest face. Make sure the face is not too small and not significantly occluded or blurred.
For example, we upload a photo of Fairy Zixia here.
Step 2: (Optional) Upload an image of another person as a reference pose
If not uploaded, we will use the first person image to extract landmarks. If a cropped face was used in step 1, it is recommended to upload it to extract a new pose.
【Step 3】:Writing prompt words
Prompt word: A beautiful woman was sitting on the grass in the park
[Step 4]: Image generation
We first select different styles, then click the "Submit" button to generate the image. Here we take a look at the image effects of different styles.
Style 1: WaterColor
Style 2: Film Noir (black and white film)
Style 3: Neon
Style 4: Jungle
Style 5: Mars
Style 6: Vibrant Color
Style 7: Snow
Style 8: Line art
Judging from the effect of the produced pictures, the character images remain very uniform and are very similar to the original pictures.
5. Related Notes
(1) If you are not satisfied with the similarity, you can increase the weights of controlnet_conditioning_scale (IdentityNet) and ip_adapter_scale (Adapter) appropriately.
(2) If the generated image is oversaturated, reduce the weight of ip_adapter_scale. If that does not work, reduce the weight of controlnet_conditioning_scale.
(3) If the text prompt word does not meet expectations, reduce the weight of ip_adapter_scale.
(4) It is important to choose a good basic model.
Okay, that’s all for today’s sharing. If you are interested, go and experience it.