Learn about generative AI video in one article

Last year was the year of AI video explosion. In January 2023, there is no public text-to-video model. So far,AI VideoThere are dozens of generated products with millions of users. Let's review the development of AI-generated videos and noteworthy technologies and applications in the past year. It mainly includes the following aspects:

  • Current AI video classification
  • Generative AI Videotechnology
  • AI video extension technology and applications
  • Generative AI Video Outlook
  • Challenges of Generative AI Video

AI Video Classification

AI videos can basically be divided into the following four categories:
Learn about generative AI video in one article

1. Text/Picture Generated Video

As the name suggests, you can generate the corresponding video by entering a text description/uploading a picture.
The common ones such as Runway, Pika, NeverEnds, Pixverse, svd, etc. belong to this category.
For example, the film style of runway
Pika's anime style
Portrait Models for NeverEnds
Of course, there are some extended applications, such as Alibaba’s recently popular “King of Dance”, which is based on the Diffusion Model at the bottom layer and combined with other technologies such as Controlnet, which will be discussed later.
Learn about generative AI video in one article

2. Video to video generation

It is usually divided into style transfer types, internal video replacement, partial redrawing, and video AI high definition.
Such as WonderStudio's character CG replacement:
Learn about generative AI video in one article
DomoAI's video style transfer
The technologies involved include: video sequence frame generation and Contorlnet processing, video style transfer Lora, video enlargement, face restoration, etc.
Learn about generative AI video in one article
Video Face Swap
Common ones include Faceswap, DeepFacelab, etc. The technologies involved include: face detection, feature extraction, face conversion, optimization, etc.
Learn about generative AI video in one article

3. Digital Humans

Represented by Heygen and D-iD, this is achieved through a combination of face detection, voice cloning TTS, and lip sync technology.
Learn about generative AI video in one article

4. Video Editing Type

Material Matching
You can search for existing materials and stitch them into a finished video according to your given theme or needs. The most commonly used editing tool is Jianying, which can search for materials online to match your text needs.
Learn about generative AI video in one article
Key part clipping 
Convert long videos into the required short videos, suitable for talk shows. The technologies involved may include using OpenCV and TensorFlow to analyze video content, identify key segments, and then use MoviePy to edit and assemble these segments to form short videos.
Learn about generative AI video in one article
Video HD
The video quality is improved through super-resolution algorithm, noise reduction algorithm, frame interpolation and other functions.
Learn about generative AI video in one article
Generative AI Video Technology
As you can see, the applications of the above AI videos are varied, but the underlying technologies are nothing more than the following three: GAN, Diffusion Model, and the Transformer architecture, which has been very popular in the field of large models in the past two years.
Of course, it also includes Variational Autoencoder (VAE) and Diffusion’s predecessor DDPM (Denoising Diffusion Probabilistic Model). We will not go into details here, but mainly introduce the first three in plain language.
1. Generative Adversarial Networks (GANs)
Generative adversarial networks
Learn about generative AI video in one article
As the name implies, GAN consists of a generator and a discriminator. The generator is like a painter, trying to draw realistic images based on text descriptions, while the discriminator is like an appraiser, trying to distinguish which paintings are real and which are drawn by the generator. The two are constantly competing, the generator becomes better and better at drawing realistic images, and the discriminator becomes smarter and smarter at distinguishing between the real and the fake, and finally achieves the generation of more realistic images.
"Isn't it like when you were a kid and your teacher was standing by your side with a ruler to guide you in your studies?"
GAN also has some shortcomings:
distortion:Compared to images generated by diffusion models, GANs tend to have more artifacts and distortions.
Training stability:The training process of GAN involves an adversarial process between a generator and a discriminator, which can lead to unstable training and difficulty in tuning. In contrast, the training process of diffusion models is more stable because they do not rely on adversarial training.
Diversity:Compared to GANs, diffusion models are able to exhibit higher diversity in generating images, meaning they are able to produce richer and more varied images without being overly dependent on specific patterns in the training dataset.
Around 2020, diffusion models started to gain more attention in academia and industry, especially as they performed well in various aspects of image generation.
But this does not mean that GAN is completely outdated. It has also been widely explored and applied in style transfer and super-resolution.
2. Diffusion Model
Diffusion Models are inspired by non-equilibrium thermodynamics. The theory first defines a Markov chain of diffusion steps to slowly add random noise to the data, and then learns the inverse diffusion process to construct the desired data samples from the noise.
To explain in layman's terms, the way a diffusion model works is a bit like a sculptor starting with a rough block of stone (or in our case, a blurry, disordered image) and gradually refining and tweaking it until a fine sculpture (i.e., a clear, meaningful image) is formed.
The Runway and Pika that we are familiar with are actually based on the Diffusion model. However, the details are different. There are two technical architectures for these two products:
Pika - Per Frame
In the “Per Frame” architecture, the diffusion model processes each frame in the video separately, as if they were independent pictures.
The advantage of this method is that it can guarantee the image quality of each frame. However, it cannot effectively capture the temporal coherence and dynamic changes in the video because each frame is processed independently.
Therefore, a certain degree of accuracy will be lost. We see that the early videos generated by Pika are a bit "blurry", which may be related to this.
Learn about generative AI video in one article
Runway - Per Clip
The "Per Clip" architecture treats the entire video clip as a single entity.
In this approach, the diffusion model takes into account the temporal relationship and coherence between frames in the video.
Its advantage is that it can better capture and generate the temporal dynamics of videos, including the coherence of motion and behavior, and more completely preserve the accuracy of training video data.
However, the “Per Clip” approach may require a more complex model and more computational resources since it needs to handle the temporal dependencies in the entire video clip.
Compared with Pika's Per Frame architecture, Per Clip retains the information of the training video material more completely, but its cost is higher and its ceiling is also relatively high.
Learn about generative AI video in one article
Since the diffusion model itself is computationally intensive, this computational burden will increase dramatically when generating long videos, and temporal consistency is also a considerable test for the diffusion model.
The Transformer architecture is particularly good at processing long sequence data, which is an important advantage for generating long videos. They can better understand and maintain the temporal coherence of video content.
3. Transformer architecture (LLM architecture)
In the language model, Transformer learns the rules and structure of language by analyzing a large amount of text, and then infers subsequent text through probability.
When we apply this architecture to image generation, compared to the diffusion model that creates order and meaning from chaos, the application of Transformer in image generation is similar to learning and imitating the "language" of the visual world. For example, it learns how colors, shapes, and objects combine and interact visually, and then uses this information to generate new images.
Learn about generative AI video in one article
Transformer architectures have unique advantages, including explicit density modeling and more stable training processes. They are able to exploit the correlation between frames to generate coherent and natural video content.
In addition, the current largest diffusion model has only 7 to 8 billion parameters, but the largest transformer model may have reached trillions of parameters, which is completely two orders of magnitude.
However, the Transformer architecture faces challenges in terms of computing resources, training data volume, and time. Compared with the diffusion model, it requires more model parameters and has relatively higher requirements for computing resources and data sets.
Therefore, in the early days when computing power and data volume were tight, the Transformer architecture for generating videos/images was not fully explored and applied.

AI video extension technology and applications

"Photo Dance" - Animate anyone

Based on diffusion model + Controlnet related technologies

Technical Overview: The network starts with multiple frames of noise as initial input and adopts a denoising UNet structure based on Stable Diffusion (SD) design. It is similar to the familiar Animatediff, combined with posture control and consistency optimization technologies similar to Controlnet.
The network core consists of three key parts:
1. ReferenceNet, responsible for encoding the appearance features of the characters in the reference image to ensure visual consistency.
2. Pose Guider, used to encode motion control signals to achieve precise control of character movements;
3. Temporal Layer, which processes time series information to ensure the smoothness and naturalness of character movement between consecutive frames. The combination of these three components enables the network to generate animated characters that are visually consistent, motionally controllable, and temporally coherent.
Learn about generative AI video in one article

“Converting live video into animation”——DomoAI

The basic model is also based on the Diffusion Model, and is combined with style transfer.
The first step, ControlNet Passes Export, is used to extract the control channels as the basis for making the initial raw animation frames.
The second step is Animation Raw - LCM, which is the core of the workflow and is mainly used to render the main raw animation.
The third step is AnimateDiff Refiner - LCM, which is used to further enhance the original animation, adding details, amplification and refinement.
Finally, there is AnimateDiff Face Fix - LCM, which is specifically designed to improve facial images that are still not ideal after being processed by the refinement workflow.
“AI video face swap”——Faceswap
In general, face swapping is mainly divided into the following three processes: face detection-feature extraction-face conversion-post-processing
Learn about generative AI video in one article
AI video face-swapping technology, commonly referred to as "deepfake," is based on deep learning, especially using models like GAN (generative adversarial network) or autoencoders. Because the technology is risky to use, it will not be introduced in detail here.

AI Video Technology Outlook

"The future of unification?" - Transformer architecture

Not only can you see, but you can also hear
Google recently released VideoPoet, a video generation tool that can generate video and audio in one stop, support longer video generation, and provide a good solution to the more common motion consistency in existing video generation, especially the continuity of large-scale motion.
Learn about generative AI video in one article
Unlike most models used in the video field, VideoPoet did not take the route of diffusion, but was developed along the transformer architecture, integrating multiple video generation functions into a single LLM (Large Language Model Transformer architecture), proving that in addition to its outstanding text generation capabilities, transformers also have great potential in video generation. In addition, it can also generate sound at the same time and support language control to modify videos.
"The largest diffusion model has only 7 to 8 billion parameters, but the largest transformer model may have reached the trillion level. In terms of language models, large companies have spent 5 years and invested tens of billions of dollars to bring the models to their current scale. Moreover, as the scale of the model increases, the cost of the large model architecture also increases exponentially." said Jiang Lu, a scientist at Google.
Essentially, the video model based on the large language model Transformer architecture is still a "language model" because the training and model framework have not changed. It's just that the input "language" has been expanded to other modalities such as vision, which can also be discretized and represented as symbols.
In the early days, we did not see the outstanding effect of Transformer in video generation due to the limitations of resources, computing power, video data, etc. However, in recent years, with the rapid development of large language models brought by GPT and financial support,In the future, the "one-stop" multimodal large model of text, image, sound and video will attract much attention.

Is AI video also about to usher in its GPT moment?

It is worth noting that although Transformer is the most popular architecture with a highly scalable and parallel neural network architecture, the memory requirement of the full attention mechanism in Transformer is quadratically proportional to the length of the input sequence. When processing high-dimensional signals such as video, this scaling will result in excessive costs.
Therefore, researchers proposed: Window Attention Latent Transformer (WALT): a Transformer-based Latent Video Diffusion Model (LVDM) method. It can also be said that:

Transformer and Diffusion Model coexist

WALT is a project in collaboration with Professor Fei-Fei Li and her students. WALT is based on diffusion, but also uses transformers. It can be said to combine the advantages of the diffusion model with the powerful functions of the transformer.
In this structure, the diffusion model is responsible for handling the generation and quality details of video images, while the Transformer uses its self-attention mechanism to optimize the correlation and consistency between sequences.
This combination makes the video not only more realistic visually, but also smoother and more natural in motion transitions. Therefore, in the next 1-2 years, Transformer and Diffusion Model will likely coexist.

Challenges facing AI video technology

In the field of AI video technology, @闲人一坤, a well-known AI video creator, raised several key challenges.
First, the clarity of the generated video needs to be further improved to achieve higher visual quality. Second, maintaining the consistency of the characters in the video is a difficult problem, which involves accurately capturing and reproducing the characteristics and movements of the characters. Finally, the quality of AI videos needs to be improved.Controllability needs to be improved, especially inAdjustment capabilities in three-dimensional space, the current technology is mostly limited to two-dimensional fine-tuning and cannot effectively adjust in the Z-axis dimensionThese challenges point to key areas that require attention and improvement in the development of AI video technology.
statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
TutorialEncyclopedia

Alibaba's "AI replaces everything", AI e-commerce artifact, everything in the photo can be replaced by AI

2024-1-17 10:01:22

Encyclopedia

Recommended 25 best free online artificial intelligence AI tools to meet your business or personal use needs

2024-1-19 10:21:46

Search