On Friday, Meta announced a preview of Movie Gen, a new suite of AI models designed to create and manipulate video, audio, and images, including creating a realistic video from a single photo of a person. The company claims the models outperform other video-synthesis models when evaluated by humans, pushing us closer to a future where anyone can synthesize a full video of any subject on demand.
The company does not yet have plans of when or how it will release these capabilities to the public, but Meta says Movie Gen is a tool that may allow people to “enhance their inherent creativity” rather than replace human artists and animators. The company envisions future applications such as easily creating and editing “day in the life” videos for social media platforms or generating personalized animated birthday greetings.
Movie Gen builds on Meta’s previous work in video synthesis, following 2022’s Make-A-Scene video generator and the Emu image-synthesis model. Using text prompts for guidance, this latest system can generate custom videos with sounds for the first time, edit and insert changes into existing videos, and transform images of people into realistic personalized videos.
Meta isn’t the only game in town when it comes to AI video synthesis. Google showed off a new model called “Veo” in May, and Meta says that in human preference tests, its Movie Gen outputs beat OpenAI’s Sora, Runway Gen-3, and Chinese video model Kling.
Movie Gen’s video-generation model can create 1080p high-definition videos up to 16 seconds long at 16 frames per second from text descriptions or an image input. Meta claims the model can handle complex concepts like object motion, subject-object interactions, and camera movements.
Even so, as we’ve seen with previous AI video generators, Movie Gen’s ability to generate coherent scenes on a particular topic is likely dependent on the concepts found in the example videos that Meta used to train its video-synthesis model. It’s worth keeping in mind that cherry-picked results from video generators often differ dramatically from typical results and getting a coherent result may require lots of trial and error.
Speaking of training data, Meta says it trained these models on a combination of “licensed and publicly available datasets,” which very likely includes videos uploaded by Facebook and Instagram users over the years, although this is speculation based on Meta’s current policies and previous behavior.
The new vanguard in video deepfakes
Meta calls one of the key features of Movie Gen “personalized video creation,” but there’s another name for it that has been around since 2017: deepfakes. Deepfake technology has raised alarm among some experts because it could be used to simulate authentic camera footage, making people appear to do things they didn’t actually do.
In this case, creating a deepfake with Movie Gen appears as easy as providing a single input image of a person, along with a text prompt that describes what you want them to do or where you want them to be in the resulting video. The system then generates a video featuring that individual, aiming to preserve their identity and motion while incorporating details from the prompt.
This technology could be abused in myriad ways, including creating humiliating videos, putting people in compromising fake situations, fabricating historical context, or generating deepfake video pornography. It’s bringing us closer to a cultural singularity where truth and fiction in media are indistinguishable without deeper context due to fluid and eventually real-time AI media synthesis.
In April, Microsoft demonstrated a model called VASA-1 that can create a photorealistic video of a person talking from a single photo and single audio track, but Movie Gen takes things a step further by placing a deepfaked person inside a video scene, AI-generated or otherwise. Movie Gen, however, does not appear to generate or synchronize speech yet.
Editing and sound synthesis
Meta also showed off a video editing component of Movie Gen, which allows for precise modifications to existing videos based on text instructions. It can perform localized edits like adding or removing elements, as well as make global changes such as altering the background or overall style.
Also, until now, every video synthesis model we’ve used has created silent videos. Meta is bringing sound synthesis to AI video courtesy of a separate audio-generation model capable of producing ambient sound, sound effects, and instrumental background music synced to video content from text prompts. The company claims this model can generate audio for videos of any length, maintaining coherent audio throughout.
Despite the advances, Meta acknowledges that the current models have limitations. The company plans to speed up video-generation time and improve overall quality by scaling up the models further. You can read more about how the Movie Gen models work in a research paper Meta also released today.
Meta also plans to collaborate with filmmakers and creators to integrate their feedback in future versions of the model. However, after warnings from the SAG-AFTRA actor’s union last year and divisive reactions to video synthesis from some industry professionals, we can imagine that not all of that feedback will be positive.