Google unveils text-to-video generator based on Imagen

ForkLog

3 years ago

Researchers at Google announced the development of an artificial intelligence system, Imagen Video, capable of generating video from textual prompts at a resolution of 1280×768 pixels and a frame rate of 24 frames per second.

Excited to announce Imagen Video, our new text-conditioned video diffusion model that generates 1280×768 24fps HD videos! #ImagenVideo https://t.co/JWj3L7MpBU
Work w/ @wchan212 @Chitwan_Saharia @jaywhang_ @RuiqiGao @agritsenko @dpkingma @poolio @mo_norouzi @fleet_dj @TimSalimans pic.twitter.com/eN81LqZW7I

— Jonathan Ho (@hojonathanho) October 5, 2022

The tool is based on the Imagen algorithm, which is an analogue of DALL-E 2 and Stable Diffusion. The image generator uses a large pre-trained language neural network and a cascaded diffusion model, combining \”a deep level of understanding of words with an unprecedented degree of photorealism\”.

Images generated by Imagen. Data: Google.

According to Google researchers, Imagen Video takes a text description and creates a 16-frame clip at a resolution of 24×48 pixels and a frame rate of 3 FPS. The system then scales up and \”predicts\” additional frames.

As a result, the algorithm generates a 128-frame animation at a resolution of 1280×768 pixels and a frame rate of 24 FPS.

\"https://forklog.com/wp-content/uploads/cdm-1.mp4\"

First stage of generating Imagen Video. Data: Google.

\"https://forklog.com/wp-content/uploads/cdm-5.mp4\"

Intermediate stage of generating Imagen Video. Data: Google.

\"https://forklog.com/wp-content/uploads/fairytale.mp4\"

Final video generated by Imagen Video. Data: Google.

To train Imagen Video, developers used 14 million video-caption pairs and 60 million image-text pairs, as well as the public dataset LAION-400M, which allowed the model to apply a range of aesthetic aspects.

\"https://forklog.com/wp-content/uploads/31.mp4\"

Video generated by Imagen Video. Data: Google.

During testing, researchers found that the algorithm could produce \”watercolor\” videos or transfer the style of Van Gogh. They said Imagen Video demonstrated an understanding of depth and three-dimensionality, enabling it to generate videos as if recorded by a drone.

\"https://forklog.com/wp-content/uploads/39.mp4\"

Video generated by Imagen Video. Data: Google.

The system can also render text correctly.

\”Unlike Stable Diffusion and DALL-E 2, which try to turn a prompt like ‘logo for Diffusion’ into readable words, Imagen Video reproduces it without issue,\” the project paper states.

According to Matthew Guzdial, an AI researcher at the University of Alberta, the problem of turning text into video remains unsolved.

\”We are unlikely to reach something like DALL-E 2 or Midjourney in terms of quality [of video creation] any time soon,\” he said.

To reduce jitter and distortions, the Imagen Video team plans to join forces with Phenaki developers. This is another Google generator that turns long, detailed prompts into two-minute low-quality clips.

Google also notes that the data used for training contained inappropriate content, which meant Imagen Video sometimes generated clips depicting violence or sexual content. The company therefore does not plan to release the model or its source code until the issue is fixed.

In September, an enthusiast developed a text-to-video animation generator based on Stable Diffusion Video.

In August, TikTok unveiled a tool for creating video backgrounds from text prompts.

In June, Chinese researchers developed the CogVideo transformer with 9 billion parameters to translate text into animation.

Subscribe to ForkLog News on Telegram: ForkLog AI — all the news from the AI world!