Google announces Lumiere, an AI model capable of generating realistic videos

Google, the Weizmann Institute of Science, and Tel Aviv University have published a paper announcing Lumiere, a “space-time diffusion model” capable of generating realistic and stylized short videos, with editing options on command, leveraging a generative AI model.

In the paper, the researchers say that this model takes a different approach than existing ones (Pika, for example) by being able to synthesize videos that portray realistic, diverse and consistent movements – a challenge judged to be “fundamental” in video generation. Currently, the document detailing the technology can be freely consulted but no models are yet available to test.

To use Lumiere, users will be able to provide text inputs that describe what they want in natural language. The template, as a result, generates a video that represents the textual input. Users can also upload a static image and add a prompt to turn it into a dynamic video.

The template also supports additional features including inpainting, which can inherit specific objects to edit videos with text instructions; Cinemagraph, to add movement to specific parts of a scene; It is the stylized generation that uses the reference style of an image for the creation of the video.

While these features are not new to the industry, Lumiere uses an architecture called “Space-Time U-Net” to generate the entire time duration of a video in one go, leading to more realistic and consistent motion. According to the researchers, this differs from existing video models that “synthesize images between keyframes to which temporal super resolution (TSR) models are added to generate the missing data.”

Lumiere’s video model was trained on a dataset of 30 million videos, along with their text captions, and is capable of generating 80 frames at 16fps. The base model is trained at 128×128″. The source of this data, however, at least at this early stage, is still unclear. In the research paper, the researchers say that the model produces 1024×1024 pixel videos that are five seconds long “in low resolution.”

Despite these limitations, the researchers performed a study on users and say that Lumiere’s results were preferred over existing AI video synthesis models. Currently, Lumiere is also not yet able to generate videos that consist of multiple shots or involve transitions between scenes, an open challenge for future research.

LEAVE A REPLY

Please enter your comment!
Please enter your name here