arxiv:2401.12945

Lumiere: A Space-Time Diffusion Model for Video Generation

Published on Jan 23

· Submitted by

akhaliq on Jan 24

#1 Paper of the day

Upvote

Authors:

Omer Bar-Tal ,

Hila Chefer ,

Omer Tov ,

Charles Herrmann ,

Roni Paiss ,

Shiran Zada ,

Ariel Ephrat ,

Junhwa Hur ,

Yuanzhen Li ,

Tomer Michaeli ,

Oliver Wang ,

Tali Dekel ,

Abstract

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

View arXiv page View PDF Add to collection

Community

MonsterMMORPG

Jan 24

No model just show once again? Google you are being way way behind in AI race.

susanlee

Jan 24

decide

GoldMan2023

Jan 24

The video is amazing, but where is the action?! You can take top 1 in the AI race, but why don't you want it yourself? What should push you to do this?!