Sora's Computing Math Problem
By Matthias Plappert
Compiled by: Siqi, Lavida, Tianyi
After launching the video generation model Sora last month, just yesterday, OpenAI A series of creative workers have used Sora to create amazing results. There is no doubt that Sora is the strongest video generation model to date in terms of generation quality. Its emergence will not only have a direct impact on the creative industry, but also affect the solution of some key problems in the fields of robotics and autonomous driving.
Although OpenAI 发布了 Sora 的技术报告,但报告中关于技术细节的呈现极为有限,本文编译自 Factorial Fund 的 Matthias Plappert 的研究,Matthias 曾在 OpenAI 任职并参与了 Codex 项目,在这篇研究中,Matthias 探讨了 Sora 的关键技术细节、模型的创新点是什么又会带来哪些重要影响外,还分析了 Sora 这样的视频生成模型对算力的需求。Matthias 认为,随着视频生成的应用越来越广泛依赖,推理环节的计算需求一定会迅速超过训练环节,尤其对于 Sora 这样的 diffusion-based 模型。
According to Matthias's estimation, Sora's computing power requirements in the training phase are several times higher than LLM, and it takes about 4,200-10,500 Nvidia H100s to train for 1 month. Moreover, when the model generates 15.3 million to 38.1 million minutes of video, the computing cost of the inference phase will quickly exceed the training phase. In comparison, users currently upload 17 million minutes of video to TikTok and 43 million minutes to YouTube every day. OpenAI CTO Mira also mentioned in a recent interview that the cost of video generation is also the reason why Sora cannot be opened to the public for the time being. OpenAI hopes to consider opening it after achieving a cost close to that of DallE image generation.
OpenAI recently released Sora, which shocked the world with its ability to generate extremely realistic video scenes. In this post, we will discuss in depth the technical details behind Sora, the potential impact of these video models, and some of our current thinking. Finally, we will also share our understanding of the computing power required to train a model like Sora and show the prediction of training compute compared to inference, which is important for estimating future GPU demand.
Core Viewpoint
The key conclusions of this report are as follows:
-
Sora is a diffusion model, which is trained based on DiT and Latent Diffusion, and is scaled in terms of model size and training dataset;
-
Sora proves the importance of scale up in video models, and continuous scaling will be the main driving force in improving model capabilities, similar to LLM.
-
Companies such as Runway, Genmo, and Pika are exploring building intuitive interfaces and workflows on diffusion-based video generation models such as Sora, which will determine the promotion and ease of use of the model;
-
Sora training requires a huge amount of computing power. We estimate that it will take 4,200-10,500 Nvidia H100s to train for one month.
-
In the inference phase, we estimate that each H100 can generate a maximum of about 5 minutes of video per hour. The inference cost of diffusion-based models such as Sora is several orders of magnitude higher than that of LLM.
-
As video generation models like Sora are widely promoted and applied, the inference link will dominate the computational consumption over model training. The critical point here is after producing 15.3 million to 38.1 million minutes of video, at which point the computational consumption spent on inference will exceed the computational consumption of original training. In comparison, users upload 17 million minutes of video to TikTok every day, and 43 million minutes to YouTube;
-
Assuming that AI has been fully applied on video platforms, for example, 50% of videos on TikTok and 15% of videos on YouTube are generated by AI. Considering the efficiency and usage of hardware, we estimate that under peak demand, the inference link requires about 720,000 Nvidia H100s.
Overall, Sora not only represents a significant advancement in the quality and capabilities of video generation, but also indicates that the demand for GPUs in inference may increase significantly in the future.
01. Background
Sora is a diffusion model. Diffusion models are widely used in the field of image generation. For example, OpenAI's Dall-E or Stable Diffusion of Stability AI are all diffusion-based. Recently emerged companies exploring video generation, such as Runway, Genmo and Pika, are also likely to use diffusion models.
Broadly speaking, as a generative model, a diffusion model is the ability to create data similar to its training data, such as images or videos, by gradually learning to reverse a process of adding random noise to the data. These models initially start with complete noise, gradually removing the noise and refining the pattern until it becomes a coherent and detailed output.
Schematic diagram of the diffusion process:
Noise is gradually removed until detailed video content emerges
Source: Sora Technical Report
This process is significantly different from the way the model works under the LLM concept: LLM generates tokens one by one through iteration, a process also called autoregressive sampling. Once the model generates a token, it will not change. We can see this process when using tools such as Perplexity or ChatGPT: the answer appears word by word, just like someone typing.
02. Technical details of Sora
At the same time as Sora was released, OpenAI also released a technical report on Sora, but the report did not present many details. However, Sora's design seems to be largely influenced by the paper Scalable Diffusion Models with Transformers. In this paper, the two authors proposed a Transformer-based architecture called DiT for image generation. Sora seems to have extended the work of this paper to the field of video generation. Combining Sora's technical report and the DiT paper, we can basically accurately sort out the entire logic of Sora.
Three important things to know about Sora:
1. Sora does not choose to work at the pixel space level, but chooses to diffuse in latent space (also known as latent diffusion);
2. Sora uses the Transformer architecture;
3. Sora appears to use a very large dataset.
Detail 1: Latent Diffusion
To understand the latent diffusion mentioned in the first point above, we can first think about how an image is generated. We can generate each pixel through diffusion, but this process will be quite inefficient. For example, a 512×512 image has 262,144 pixels. But in addition to this method, we can also choose to convert the pixels into a compressed latent representation first, and then diffuse it on this latent space with a smaller amount of data, and finally convert the diffused results back to the pixel layer. This conversion process can significantly reduce the computational complexity. We no longer need to process 262,144 pixels, but only 64×64=4096 latent representations. This method is a key breakthrough in High-Resolution Image Synthesis with Latent Diffusion Models and is also the basis of Stable Diffusion.
Mapping the pixels on the left to the potential representation represented by the grid on the right
Source: Sora Technical Report
Both DiT and Sora use latent diffusion. For Sora, an additional consideration is that video has a time dimension: video is a time sequence of images, which we also call frames. From Sora's technical report, we can see that the encoding from the pixel layer to the latent space occurs both in the spatial level, that is, compressing the width and height of each frame, and in the temporal dimension, that is, compressing across time.
Detail 2: Transformer Architecture
Regarding the second point, both DiT and Sora replaced the commonly used U-Net architecture with the most basic Transformer architecture. This is critical because the authors of DiT found that predictable scaling can occur by using the Transformer architecture: as the amount of technology increases, whether the model training time increases or the model size changes, or both, the model capabilities can be improved. The same point is also mentioned in Sora's technical report, but it is for video generation scenarios, and an intuitive diagram is attached to the report.
Model quality improves as the amount of training computation increases: from left to right: basic computation, 4 times the computation, and 32 times the computation
This scaling property can be quantified by what we often call the scaling law, which is also an important property. Scaling laws have been studied before in the context of LLM and autoregressive models in other modalities. The ability to get better models by scaling is one of the key driving forces behind the rapid development of LLM. Since image and video generation also have scaling properties, we should expect scaling laws to apply to these fields as well.
Detail 3: Dataset
To train a model like Sora, the last key factor to consider is labeled data. We believe that the data link contains most of Sora’s secrets. To train a text2video model like Sora, we need paired data of videos and their corresponding text descriptions. OpenAI did not talk much about the dataset, but they also hinted that this dataset is large. In the technical report, OpenAI mentioned: "Based on training on Internet-level data, LLM has gained general capabilities, and we are inspired by this."
Source: Sora Technical Report
OpenAI also published a method for annotating images with detailed text labels, which was used to collect the DALLE-3 dataset. In simple terms, this method is to train a captioner model on a labeled subset of the dataset, and then use this model to automatically complete the labeling of the remaining data. Sora's dataset should also use similar technology.
03.Sora’s influence
Video models are beginning to be applied in practice
The quality of the videos generated by Sora is undoubtedly a breakthrough in terms of detail and temporal coherence. For example, Sora can correctly handle objects in the video being temporarily occluded and remain motionless, and can accurately generate reflections on water. We believe that Sora's current video quality is good enough for certain types of scenes, and these videos can be used in some real-world applications. For example, Sora may soon replace the need for some video libraries.
Video Generation Domain Atlas
However, Sora still faces some challenges: We don’t know how controllable Sora is yet. Because the model outputs pixels, editing a generated video content is quite difficult and time-consuming. To make the model useful, it is also necessary to build an intuitive UI and workflow around the video generation model. As shown in the figure above, Runway, Genmo, Pika and other companies in the field of video generation are already solving these problems.
Because of Scaling, we can speed up the expected video generation
As we discussed earlier, a key conclusion from the DiT study is that model quality improves directly with more computation. This is very similar to the scaling law we have observed in LLM. We can therefore expect that the quality of video generation models will rapidly improve further as the model is trained on more computational resources. Sora strongly demonstrates this, and we expect OpenAI and other companies to double down on this.
Synthetic Data Generation and Data Augmentation
In areas such as robotics and autonomous driving, data is essentially a scarce resource: there is no "internet" full of robots helping with work or driving in these areas. Typically, some problems in these two fields are solved by training in simulated environments, collecting large-scale data in the real world, or a combination of both. However, both approaches have challenges, because simulated data is often not realistic, and collecting data on a large scale in the real world is very expensive, and collecting enough data on low-probability events is also challenging.
As shown in the figure above, you can enhance the video by modifying some of its properties, such as rendering the original video (left) into a dense jungle environment (right)
Source: Sora Technical Report
We believe that models like Sora can be useful for these problems. We believe that models like Sora could potentially be used directly to generate synthetic data for 100%. Sora could also be used for data augmentation, which is to perform various transformations on the way existing videos are presented.
The data enhancement mentioned here can actually be illustrated by the example in the technical report above. In the original video, a red car is driving on a forest road. After Sora's processing, the video becomes a car driving on a tropical jungle road. We can fully believe that using the same technology to re-render, we can also achieve day and night scene conversion, or change weather conditions.
Simulations and World Models
"World Models" is a very valuable research direction. If the models are accurate enough, these world models can allow people to train AI agents directly in them, or these models can be used for planning and search.
Models like Sora implicitly learn a basic model of how the real world works from video data. Although this "emergent simulation" is currently flawed, it is still exciting: it shows that we may be able to train models of the world by using video data on a large scale. In addition, Sora seems to be able to simulate very complex scenes, such as the flow of liquids, the reflection of light, and the movement of fibers and hair. OpenAI even named Sora's technical report "The Path to Simulation". Video generation models as world simulators, which clearly shows that they believe this is the most important aspect in which the model will have an impact.
Recently, DeepMind demonstrated a similar effect in its own Genie model: by training only on a series of game videos, the model learned the ability to simulate these games and even create new games. In this case, the model was able to learn to adjust its predictions or decisions based on behavior even without directly observing the behavior. In the case of Genie, the goal of model training is still to be able to learn in these simulated environments.
Video from Google DeepMind’s Genie:
Introduction to Generative Interactive Environments
Overall, we believe that if we want to train embodied agents such as robots on a large scale based on real-world tasks, models like Sora and Genie will definitely be able to play a role. Of course, this model also has limitations: because the model is trained in pixel space, the model will simulate every detail, including the slightest movement in the video, but these details are completely irrelevant to the current task.Xiaobai Navigation It is compressed, but it still needs to retain a lot of this information because it needs to be able to map back to pixels, so it is not clear whether it can be planned efficiently in latent space.
04. Estimation of computing power
We pay close attention to the computing resource requirements for model training and inference, which can help us predict how much computing resources will be needed in the future. However, since there is very little detailed information about Sora's model size and dataset, it is difficult to estimate these numbers. Therefore, our estimates in this section do not truly reflect the actual situation, so please refer to them with caution.
Deducing the computing scale of Sora based on DiT
The details about Sora are quite limited, but we can refer to the DiT paper again and refer to the data in the DiT paper to infer information about the amount of computation required for Sora, as this research is obviously the basis for Sora. As the largest DiT model, DiT-XL has 675 million parameters and uses about 1021FLOPS of total computing resources to train. To put this into context, this is equivalent to running 0.4 Nvidia H100s for 1 month, or a single H100 for 12 days.
Currently, DiT is only used for image generation, but Sora is a video model. Sora can generate videos up to 1 minute long. If we assume that the video is encoded at 24 frames per second (fps), then a video contains up to 1440 frames. Sora compresses both the time and space dimensions in the mapping from pixel to latent space. Assuming that Sora uses the same compression rate as in the DiT paper, that is, 8 times compression, there are 180 frames in the latent space. Therefore, if we simply linearly extrapolate the DiT value to the video, it means that Sora's computational effort is 180 times that of DiT.
In addition, we believe that Sora's parameters are far more than 675 million. We estimate that 20 billion parameters are also possible, which means that from this perspective, we have guessed that Sora's computational workload is 30 times that of DiT.
Finally, we argue that Sora was trained on a much larger dataset than DiT. DiT was trained for 3 million steps at a batch size of 256, which is a total of 768 million images. Note, however, that this involves multiple reuses of the same data, since ImageNet only contains 14 million images. Sora appears to be trained on a mixed dataset of images and videos, but we know almost nothing about the dataset. Therefore, we simply assume that Sora's dataset consists of 50% of still images and 50% of videos, and that this dataset is 10-100 times larger than the dataset used by DiT. However, DiT's repeated training on the same datapoints may be suboptimal if a larger dataset is available. Therefore, it is more reasonable to give a multiplier of 4-10 times the increase in computation.
Combining the above information and taking into account our estimates of the different levels of data set computational scale, we can derive the following calculation results:
Formula: DiT basic calculation amount model increases the data set. The increase in calculation amount caused by adding 180 frames of video data (only for 50% in the data set)
-
Conservative estimate of the size of the dataset: 1021 FLOPS 30 4 (180 / 2) ≈ 1.11025 FLOPS
-
Optimistic estimate of the size of the data set: 1021FLOPS 30 10 (180 / 2) ≈ 2.71025FLOPS
The computing scale of Sora is equivalent to the computing capacity of 4211-10528 H100 running for one month.
Computing power requirements: model reasoning vs. model calculation
Another important part of computing that we pay attention to is the comparison of the amount of computing in the training and inference phases. In theory, even if the amount of computing in the training phase is huge, the training cost is one-time and only needs to be paid once. In contrast, although the computation required for inference is smaller than that for training, it is generated every time the model generates content, and it will increase as the number of users increases. Therefore, as the number of users increases and the model is widely used, model inference becomes more and more important.
Therefore, it is also valuable to find the tipping point where inference compute exceeds training compute.
We compare the training and inference compute of DiT (left) and Sora (right). For Sora, based on the above estimates, Sora's data is not completely reliable. We also show two estimates of training compute: a low estimate (assuming a dataset size multiplier of 4x) and a high estimate (assuming a dataset size multiplier of 10x).
For the above data, we use DiT again to infer Sora's situation. For DiT, the largest model DiT-XL uses 524109 FLOPS for each inference step, and DiT uses 250 steps of diffusion to generate a picture, which is a total of 1311012 FLOPS. We can see that after generating 7.6 million pictures, the "inference-training critical point" is finally reached, after which model inference begins to dominate computing requirements. For reference, users upload about 95 million pictures to Instagram every day.
For Sora, we extrapolate the FLOPS to be 524109FLOPS 30 180 ≈ 2.81015FLOPS. If we still assume 250 diffusion steps per video, the total FLOPS per video is 7081015FLOPS. For reference, this is roughly equivalent to generating 5 minutes of video per Nvidia H100 per hour. With a conservative estimate of the dataset, 15.3 million minutes of video would need to be generated to reach the "inference-training critical point", and with an optimistic estimate of the dataset size, 38.1 million minutes of video would need to be generated to reach the critical point. For reference, approximately 43 million minutes of video are uploaded to YouTube every day.
A few more things to note: FLOPS is not the only thing that matters for inference. For example, memory bandwidth is another important factor. In addition, there are teams actively working on reducing the diffusion step, which will also reduce the model's computational requirements and thus make inference faster. FLOPS utilization may also be different in training and inference, which is also an important consideration.
Yang Song, Prafulla Dhariwal, Mark Chen and Ilya Sutskever published in March 2023 Consistency Models Research shows that diffusion models have made significant progress in the field of image, audio and video generation, but they have limitations such as reliance on iterative sampling processes and slow generation. The study proposed a consistency model that allows multiple sampling exchange calculations to improve sample quality. https://arxiv.org/abs/2303.01469
Computational requirements for different modal model reasoning stages
We also studied the trend of inference computation per unit output of different models in different modalities. The purpose of the study is to see how much more computationally intensive inference will become in different categories of models, which has a direct impact on computational planning and requirements. Since they operate in different modalities, the output units of each model are different: Sora's single output is a 1-minute video, DiT's single output is a 512×512 pixel image; and for Llama 2 and GPT-4, we define a single output as a document containing 1000 tokens of text (for reference, the average Wikipedia article has about 670 tokens).
Comparison of inference calculations per unit output of the model: Sora outputs 1 minute of video per unit, GPT-4 and LLama 2 output 1000 tokens of text per unit, and DiT outputs a 512x512px image per unit. The picture shows that Sora's inference estimate consumes several orders of magnitude more computationally.
We compared Sora, DiT-XL, LLama2-70B, and GPT-4, and plotted their FLOPS comparison using log-scale. For Sora and DiT, we used the inference estimates above, and for Llama 2 and GPT-4, we empirically chose to use "FLOPS = 2 parameters to generate the number of tokens" to make a quick estimate. For GPT-4, we first assumed that the model is a MoE model, each expert model has 220 billion parameters, and each forward propagation activates 2 experts. It should be pointed out that the data related to GPT-4 is not the official caliber confirmed by OpenAI, and is only for reference.
Source: X
We can see that diffusion-based models like DiT and Sora consume more computing power during inference: DiT-XL with 675 million parameters consumes about the same amount of computing power during inference as LLama 2 with 70 billion parameters. Furthermore, we can see that Sora's inference consumption is several orders of magnitude higher than GPT-4.
It is important to note again that many of the numbers used in the above calculations are estimates and rely on simplifying assumptions. For example, they do not take into account the actual FLOPS utilization of the GPU, memory capacity and memory bandwidth limitations, and more advanced techniques such as speculative decoding.
Forecast of inference computing demand when Sora is widely used:
In this section, we start from Sora’s computing needs to calculate how many Nvidia H100s are needed to meet the needs of AI-generated videos that are already widely used on video platforms such as TikTok and YouTube.
• As mentioned above, we assume that each H100 can produce 5 minutes of video per hour, which means that each H100 can produce 120 minutes of video per day.
• On TikTok: Current users upload 17 million minutes of video per day (34 million total videos with an average length of 30 seconds), assuming an AI penetration rate of 50%;
• On YouTube: Currently, users upload 43 million minutes of video every day. Assuming the penetration rate of AI is 15% (mainly videos under 2 minutes),
• So the total amount of videos produced by AI every day: 8.5 million + 6.5 million = 15 million minutes.
• Need to support creators on TikTok and YouTubeCommunityTotal number of Nvidia H100s required: 15 million / 120 ≈ 89,000.
However, the number of 89,000 may be too low, because we need to consider the following factors:
• We assumed a FLOPS utilization of 100% in our calculations, and did not consider memory and communication bottlenecks. A utilization of 50% would be more realistic, which means that the actual GPU demand is twice our estimated value;
• Inference demand is not evenly distributed along the timeline, but bursty, especially considering the peak situation, because more GPUs are needed to ensure service. We believe that if the peak traffic situation is taken into account, the number of GPU requirements should be multiplied by 2X;
• Creators may generate many videos and then select the best one to upload. If we conservatively assume that each uploaded video corresponds to 2 generation on average, the GPU demand will be multiplied by 2;
In general, at peak traffic, approximately 720,000 H100s are needed to meet inference requirements.
This also confirms our belief that as generative AI models become more popular and widely relied upon, the computational requirements of the inference phase will dominate, especially for diffusion-based models like Sora.
It is also important to note that model scaling will further significantly increase the demand for inference computing. However, on the other hand, this increased demand can be offset in part by optimizing inference technology and other optimization methods for the entire technology stack.
Video content production directly drives the demand for models like Sora
The article comes from the Internet:Sora's Computing Math Problem
Related recommendations: The new world on the chain, how to use OKX Web3 walletA one-stop shop to explore the Bitcoin ecosystem?
Exploring the Bitcoin ecosystem, the most comprehensive review of OKX Web3 wallet features. Author: 0xming For decades, the Bitcoin network value storage significance has been greater than the development and application significance, but this stereotype was broken last year. Domo was released on the Ordinals protocolTokenAfter the BRC-20 standard, the Bitcoin network has gradually changed from a peer-to-peer electronic cash system to a…