████ # This file was generated bot-o-matically! Edit at your own risk. ████
OpenAI teases an amazing new generative video model called Sora [technologyreview.com]:
It may be some time before we find out. OpenAI’s announcement of Sora today is a tech tease, and the company says it has no current plans to release it to the public. Instead, OpenAI will today begin sharing the model with third-party safety testers for the first time.
In particular, the firm is worried about the potential misuses [technologyreview.com] of fake but photorealistic video [technologyreview.com]. “We're being careful about deployment here and making sure we have all our bases covered before we put this in the hands of the general public,” says Aditya Ramesh, a scientist at OpenAI, who created the firm’s text-to-image model DALL-E [technologyreview.com].
But OpenAI is eyeing a product launch sometime in the future. As well as safety testers, the company is also sharing the model with a select group of video makers and artists to get feedback on how to make Sora as useful as possible to creative professionals. “The other goal is to show everyone what is on the horizon, to give a preview of what these models will be capable of,” says Ramesh.
To build Sora, the team adapted the tech behind DALL-E 3, the latest version of OpenAI’s flagship text-to-image model. Like most text-to-image models, DALL-E 3 uses what’s known as a diffusion model. These are trained to turn a fuzz of random pixels into a picture.
Sora takes this approach and applies it to videos rather than still images. But the researchers also added another technique to the mix. Unlike DALL-E or most other generative video models, Sora combines its diffusion model with a type of neural network called a transformer.
Transformers are great at processing long sequences of data, like words. That has made them the special sauce inside large language models like OpenAI’s GPT-4 [technologyreview.com] and Google DeepMind’s Gemini [technologyreview.com]. But videos are not made of words. Instead, the researchers had to find a way to cut videos into chunks that could be treated as if they were. The approach they came up with was to dice videos up across both space and time. “It's like if you were to have a stack of all the video frames and you cut little cubes from it,” says Brooks.
The transformer inside Sora can then process these chunks of video data in much the same way that the transformer inside a large language model processes words in a block of text. The researchers say that this let them train Sora on many more types of video than other text-to-video models, including different resolutions, durations, aspect ratio, and orientation. “It really helps the model,” says Brooks. “That is something that we're not aware of any existing work on.”
OpenAI is well aware of the risks that come with a generative video model. We are already seeing the large-scale misuse of deepfake images [technologyreview.com]. Photorealistic video takes this to another level.