The shock of the beginning of 2024 also comes fromOpenAIBefore work started after the holiday, Altman launched Sora, the second killer app after ChatGPT. After watching the 60-second demo video, Shidao had only one sentence in his mind: No one wants to play it. Quickly return to reason, is there any other opportunity under Sora's rule? We start froma16zLet’s start with the published outlook - “Why 2023 Was AI Video's Breakout Year, and What to Expect in 2024” to see what space this track leaves for other players.
Make good use of the window period of the giants' "annihilation war"
OpenAI's launch of Sora is not surprising, but what is surprising is that Sora is so powerful that it is unimaginable.AI VideoThere are two very clear logics in the track. One is the rapid development of AI-generated videos. In early 2023, a public text-generated video model appeared. Just 12 months later, dozens of video generation products such as Runway, Pika, Genmo and Stable Video Diffusion were put into use. a16z believes that such a huge progress shows that we are at the beginning of a large-scale change-which is similar to the development of image generation technology. Text-video models are constantly evolving and improving, and branches such as image-video and video-video are also booming. Second, it is only a matter of time before giants enter the market. 2024 is destined to beMultimodal AIHowever, most of the 21 public AI video models in 2023 come from start-ups.
On the surface, tech giants such as Google and Meta are as calm as lake water, but under the water, there are undercurrents. The giants have not stopped publishing papers related to video generation; at the same time, they also release demo versions of videos without declaring the release time of the model, such as OpenAI releasing Sora. Obviously, the demo works are mature, so why are the giants not in a hurry to release them? a16z believes that due to legal, security and copyright considerations, it is difficult for giants to transform scientific research results into products, so they need to postpone product releases, which gives new players a first-mover advantage. Shidao believes that the most critical factor is that the "network effect" is not important-the first player is not the winner, and the winner is the technological leader. With Sora, which can generate 60s videos, will you still be obsessed with Pika, which generates 4s videos? But this does not mean that startups are completely out of luck. Because under this rule, the giants will not move too fast, and startups need to seize the "window period" and release products as quickly as possible to attract a wave of new users and make a wave of quick money, especially in the domestic market. To add the views of Jia Yangqing, former vice president of Alibaba Technology and currently engaged in AI architecture entrepreneurship:
1. Companies that benchmark OpenAI have a chance of being acquired by other large companies.
2. From the perspective of small algorithm companies, they can either compete with OpenAI in terms of algorithms, or focus on vertical applications, or choose open source. (Startup Blog)
What is the strength of “academic master” Sora?
Temporal coherence: characters, objects, and backgrounds remain consistent between frames without distortion.
Length: Can you make videos longer than a few seconds?
The length of a video is closely related to temporal coherence. Many products limit the length of videos because any form of coherence cannot be guaranteed after a few seconds. If you see a long video, it is likely composed of many short segments, and often requires dozens or even hundreds of commands to be entered.
In the "learning results" generation stage, Sora generates video content based on text prompts. This process relies on Sora's brain - the Diffusion Transformer Model.
Through the pre-trained transformer, Sora can recognize the content of each "small puzzle piece" and quickly find the "small puzzle pieces" it has learned based on text prompts, put them together, and generate video content that matches the text.
Through diffusion models, Sora can eliminate unnecessary "noise" and gradually make chaotic video information clearer.
For example, there are a lot of meaningless lines in the doodle book. Sora uses text instructions to optimize these meaningless lines into a picture with a clear theme.
Previous AI video models mostly modeled video data through technologies such as recurrent networks, generative adversarial networks, autoregressive Transformer, and diffusion models.
The result is that the "study master" Sora understands the principles of the dynamic changes in the physical world and can understand everything. The other contestants, after learning the solution to each problem, can only copy it, so it is reasonable that they are "beaten".
How will AI video products develop in the future?
According to a16z's outlook, there is still some room for improvement in AI video products.
First of all, where does high-quality training data come from?
Compared with other content modalities, video models are more difficult to train, mainly because there is not much high-quality, labeled training data. Language models are usually trained on public datasets (such as Common Crawl), while image models are trained on labeled datasets (text-image pairs) (such as LAION and ImageNet).
Video data is harder to come by. While there is no shortage of publicly viewable videos on platforms like YouTube and TikTok, these videos are unlabeled and may not be diverse enough (for example, cat videos and influencer apologies may be overrepresented in the dataset).
Based on this, a16z believes that the "holy grail" of video data may come from studios or production companies that have long videos shot from multiple angles, with scripts and instructions. However, it is not yet known whether they are willing to license this data for training.
Shidao believes that in addition to technology giants, in the long run, industry giants represented by foreign Netflix and Disney and domestic "iQIYI" and "Tencent Video" cannot be ignored. These companies have accumulated billions of member reviews, are familiar with the habits and needs of the audience, and have data barriers and application scenarios. In January last year, Netflix released an AI animated short film "Dog and Boy". The drawing of the animation scenes was completed by AI. Compared with China, the AI video track is likely to still be dominated by Internet giants.
Second, how do use cases segment across platforms/models?
a16z believes that one model cannot be "capable" of all use cases. For example, Midjourney, Ideogram, and DALL-E all have unique styles and excel at generating different types of images. Similar dynamics are expected with video models. Products developed around these models may further differentiate in terms of workflows and serve different end markets. For example, animated avatars (HeyGen), visual effects (Wonder Dynamics), and video-to-video (DomoAI).
Shidao believes that these problems will eventually be solved by Sora. But for domestic players, this may also be an opportunity for "middlemen to make a profit from the price difference."
Third, who will dictate the workflow?
Most products currently focus on only one type of content and have limited functionality. We often see videos like this: Midjourney creates the image, then puts it into Pika for animation, and then enlarges it on Topaz. The creator then imports the video into an editing platform such as Capcut or Kapwing and adds music and voiceover (generated by Suno and ElevenLabs or other products).
This process is obviously not "smart" enough. Users really hope for a "one-click generation" platform.
According to a16z, some emerging generation products will add more workflow capabilities and expand into other types of content generation - either by training their own models, leveraging open source models, or working with other vendors.
First, the video generation platform will start adding some features. For example, Pika allows users to zoom in on videos on its website. In addition, Sora can also create perfect looping videos, animated static images, forward or backward expansion of videos, etc., and has the ability to edit videos. But we have to wait for the test after the opening to see how the editing effect is.
Second, AI-native editing platforms have emerged that allow users to “plug in” different models and piece the content together.
It is foreseeable that in the future, a large number of content producers will use both AI and artificially generated content. Therefore, products that can "smoothly" edit these two types of content will be very popular. This may be the latest opportunity for players.