AlibabaThe research team recently launched AtomoVideo High-fidelity image-to-video (I2V) framework, aimed atGenerate high-quality video content from static images, and is compatible with various T2I models.
▲ Image source: AtomoVIdeo team paper
AtomoVideo features include:
-
High fidelity: The generated video is highly consistent with the input image in terms of details and style
-
Motion consistency: The video moves smoothly, ensuring temporal consistency without abrupt jumps
-
Video frame prediction: Supports the generation of long video sequences by iteratively predicting subsequent frames
-
Compatibility: Compatible with existing T2I models
-
High semantic controllability: Ability to generate customized video content based on user's specific needs
AtomoVideo uses the pre-trained T2I model as the basis, and adds a new one-dimensional spatiotemporal convolution and attention module after each spatial convolution layer and attention layer. The parameters of the T2I model are fixed, and only the added spatiotemporal layer is trained. Since the input concatenated image information is only encoded by VAE, it represents low-level information, which helpsEnhance the fidelity of the video relative to the input image.At the same time, the team also injected high-level image semantics in the form of Cross-Attention.Achieve higher image semantic controllability.
Currently, the team has only released AtomoVideo's papers and demonstration videos, and has not provided an online experience address. At the same time, the official GitHub account has been opened, but it is only used for official website hosting and no relevant code has been uploaded.