News on August 28,Zhipu AI Open SourceCogVideoX-5B Video Generation ModelCompared with the previously open source CogVideoX-2B, the official said that its video generation quality is higher and the visual effects are better.
Official statementThe model's reasoning performance has been greatly optimized, and the reasoning threshold has been greatly reduced., you can run CogVideoX-2B on early graphics cards such as GTX 1080Ti, and run the CogVideoX-5B model on desktop "dessert cards" such as RTX 3060.
CogVideoX is a large-scale DiT (diffusion transformer) model for text-to-video tasks. It mainly uses the following techniques:
- 3D causal VAE: achieves efficient video reconstruction by compressing video data into latent space and decoding in the temporal dimension.
- Expert Transformer: combines text embedding and video embedding, uses 3D-RoPE as position encoding, adopts expert adaptive layer to normalize the data of two modalities, and uses 3D full attention mechanism for spatiotemporal joint modeling.
The detailed parameters of CogVideoX-5B and CogVideoX-2B are as follows:
Attached related links:
- Code repository: https://github.com/THUDM/CogVideo
- Model download: https://huggingface.co/THUDM/CogVideoX-5b
- Paper link: https://arxiv.org/pdf/2408.06072