MicrosoftMicrosoft Research Asia recently published a paper on the new VASA-1 Model,All the user needs to do is provide a static portrait image and a voice audio clip, and the model will automatically make the person in the image speak automatically.
What's particularly interesting about VASA-1 is its ability to simulate natural facial expressions, various emotions and lip synchronization, and most importantly, there are virtually no artificial traces, which are hard to detect if you don't look closely.
The researchers admit that, like all other models, the model is currently unable to properly handle non-rigid elements such as hair, but the overall results are superior to other similar models.
The researchers also say that VASA-1 supports the generation of short, dynamic videos with a resolution of 512*512 at 45fps in offline batch processing mode, and 40 fps in online live streaming mode, with a latency of only 170ms, and that the entire generation operation can be processed on just one computer equipped with NVIDIA's RTX 4090 graphics card.