Ever since image-generative AIs, like Stable Diffusion, DALL·E or Midjourney, became publicly available, people have been waiting for the day when we could make videos with simple text prompts like we do to create AI-generated images. It was obvious that video is the next big step for generative AI. And that day is here.
In an era where short video content runs the internet, OpenAI has once again pushed the boundaries of artificial intelligence with the launch of Sora, its first text-to-video AI model that generates anything from animated, picturesque to photorealistic videos from simple text prompts.
This development is not just an incremental step in artificial intelligence and LLMs; it is a giant leap forward, promising to redefine how we imagine, create, and consume video content. That’s why Sora’s announcement captured the attention, excitement, and imagination of developers and content creators alike.
Developed on the back of OpenAI’s extensive research in AI models like DALL·E and GPT, Sora stands as a testament to the potential of AI in bridging the gap between textual concepts and visual storytelling. As expected, Sora’s announcement was met with anticipation, excitement, and awe, hinting at a shift towards more immersive and intuitive content creation that requires little to no editing skills.
At its core, Sora is developed as a diffusion model that first creates a static noise, and through several iterations, transforms it into an accurate, clear, vivid video. This process, inspired by the recaptioning technique from DALL·E, allows Sora to generate highly descriptive visuals based on textual cues, otherwise known as, prompts.
Judging by the examples or demo videos OpenAI published on their website and their dedicated YouTube channel, it is clear that Sora could very possibly be a replacement for video editors. Besides generating videos from prompts, Sora’s ability to extend existing videos or enhance missing frames adds another layer of utility, making it a versatile tool for video content generation and editing.
The videos generated by Sora seem flawless at first glance. The details are very sharp and the representation of landscapes and the camera styles are very accurate. You can ask it to make videos of anything, of any style, from any camera angle and chances are, it’ll get all three right almost all the time.
For most of its videos, it’s almost impossible to know that the video was generated by an AI if you are not told beforehand.
However, Sora, like any other AI model, is not without its quirks. Even though it can generate photorealistic videos of landscapes, architecture, and sometimes studio-quality humans, it still sometimes struggles to present humans, animals, and moving objects.
Generating realistic human hands and adhering to the nuanced laws of the physical world still remain a challenge for Sora that it sometimes can’t overcome. It knows how hands look but it doesn’t completely understand how they should move or behave.
Similarly, as seen in the demo videos, when it comes to generating moving things like a person running or a candle’s light flowing in the wind, Sora knows what they look like. But it doesn’t understand the laws of the physical world and thus it sometimes fails to make the movement natural.
However, despite its quirks, Sora’s application extends beyond mere novelty, offering tangible benefits to content creators and filmmakers. The AI’s capability to produce up to one-minute-long videos that are both realistic and imaginative opens up new possibilities for storytelling. Despite its current limitations, such as occasional lapses in physical accuracy, the quality of Sora-generated content stands as a testament to its potential, offering a glimpse into the future of automated video production.
For content creators and studios, Sora presents an opportunity to revolutionize the production process. Things that usually took studios days and several visual graphics experts to accomplish can be done in minutes now. The fur of a computer-generated animal is a good example of that.
Studios like Disney and Pixar had to work days and weeks to make sure the furs on their animals and monsters looked realistic. But now, Sora can do that without any human supervision or modification. ‘The pups playing in the snow’ demo video shows how easily AI can create these details without having multiple people working on it for hours. And that’s just one example of how Sora can save manhours.
Besides, Sora can come in handy for filler stock footage in content. Right now, if you need specific footage of a scenario or a landscape, you need to either get it from places like Shutterstock or iStock by paying for it, or, if it is not available there, you have to take a camera crew, build the set, and take the shot yourself. Neither of the two is ideal.
Sora’s ability to generate royalty-free videos from text instructions not only reduces the time and resources typically required for video creation but also democratizes access to high-quality visual content. This could significantly impact industries reliant on video content, from marketing and advertising to education and entertainment, by enabling more creators to bring their visions to life with this easy-to-use text-to-video AI.
Author: Rifat Ahmed