Artificial intelligence is learning to recognize and replicate the laws of physics in motion, enabling the development of training models. These models become essential tools for solving tasks involving real-world interaction.
Introducing Sora — a new text-to-video model. Sora can generate videos up to one minute long while maintaining visual quality and alignment with user prompts.
Sora can generate complex scenes with multiple characters, specific types of motion, and precise details of objects and backgrounds. The model understands not only what the user described in the prompt, but also how those elements exist in the physical world.
The model has a deep understanding of language, allowing it to accurately interpret prompts and create compelling characters that express vivid emotions. Sora can also generate multiple frames within a single video that consistently reflect the characters and visual style.
The current model has limitations. It may struggle with accurately modeling the physics of complex scenes and understanding specific cause-and-effect relationships. For example, a person might take a bite of a cookie, but the cookie may remain visually unchanged.
The model can also confuse spatial details in prompts—such as mixing up left and right—and may have difficulty precisely representing events that unfold over time, like following a specific camera trajectory.
Before Sora is made available in OpenAI products, several critical safety measures will be implemented.
The model will undergo rigorous adversarial testing by red teamers — experts in areas such as misinformation, hateful content, and bias.
Detection tools are also being developed, including a detection classifier that can identify when a video was generated by Sora. In the future, if Sora is integrated into an OpenAI product, C2PA metadata is planned to be included for added traceability.
In addition to new deployment safeguards, existing safety mechanisms from products like DALL·E 3 will also apply to Sora.
For instance, if integrated into an OpenAI product, a text classifier will automatically screen and reject prompts that violate usage policies — including requests for extreme violence, sexual content, hateful imagery, or content mimicking celebrities or copyrighted IP.
Robust image classifiers are also in place to review frames of each generated video to ensure they comply with usage policies before being shown to the user.
OpenAI is also engaging with policymakers, educators, and artists around the world to better understand their concerns and explore positive use cases for this new technology.
Despite extensive research and testing, it's impossible to predict every beneficial or harmful use case. That’s why learning from real-world use is seen as a key part of developing and releasing increasingly safer AI systems over time.
Sora is a diffusion model that generates videos by starting with a frame that looks like static noise and gradually refining it over many steps.
It can create entire videos at once or extend existing ones to make them longer. By enabling the model to anticipate multiple frames simultaneously, we address the challenge of keeping objects consistent—even when they temporarily leave the frame.
Like GPT models, Sora uses a transformer-based architecture, offering strong scaling performance.
It builds on prior research from DALL·E and GPT models. Specifically, it uses a technique from DALL·E 3 called “re-captioning,” which involves generating detailed captions for visual training data—allowing the model to follow user prompts with greater accuracy in the generated video.
Beyond text-to-video generation, Sora can animate still images with impressive detail and realism, or take an existing video and either extend it or fill in missing frames.
Learn more in our technical report.
Sora lays the foundation for models that can understand and simulate the physical world—an important milestone on the path to achieving AGI.