Stability AI is Testing Generative Video which Generates up to 25 Frames from a Still Image

Naorem Mohen

1 year ago

Stability AI is Testing Generative Video which Generates up to 25 Frames from a Still Image

Stability AI, the developer behind Stable Diffusion’s generative art is testing generative video. The company has developed the ability to animate its generative art. The firm has introduced a research preview of a new offering named Stable Video Diffusion, which enables users to generate videos from a single image. Stability AI expressed that this cutting-edge AI video model is a substantial advancement in their mission to develop models that cater to all users.

The newly launched tool comes in the form of two image-to-video models. Each of these models can generate videos that are 14 to 25 frames long, with frame rates ranging from 3 to 30 frames per second and a resolution of 576 × 1024.

The generative video tool is capable of synthesizing multi-view images from a single frame and can be fine-tuned on multi-view datasets. Upon its initial release, external evaluations have shown that these models outperform leading closed models in user preference studies, according to the company.

This comparison was made with text-to-video platforms such as Runway and Pika Labs. Currently, Stable Video Diffusion is only available for research purposes and not for real-world or commercial applications.

However, interested users can join a waitlist for access to a forthcoming web experience that features a text-to-video interface. This new generative video tool, Stable Video Diffusion is expected to demonstrate its potential uses in various sectors, including advertising, education, entertainment, and more.

Stability AI’s Stable Video Diffusion operates on diffusion models, transforming training images into noise images through a process called Forward Diffusion. The Reverse Diffusion process then uses the trained model to generate an image from this noise. Unlike other models, Stable Diffusion works in a reduced-definition latent space, not the pixel space of the image, making it more efficient for consumer-grade graphics cards.

It uses a latent diffusion model to generate AI images from text, compressing the image into the latent space. The model is trained on a large dataset of videos and fine-tuned on a smaller set. Currently, Stable Video Diffusion is only available for research purposes, but it has shown potential in generating high-quality images and videos.

The video samples provided appear to be of relatively high quality, comparable to other generative systems. However, the company acknowledges some limitations: the generated videos are relatively short (less than 4 seconds), lack perfect photorealism, are unable to perform camera motion except for slow pans, have no text control, cannot generate legible text, and may not properly generate people and faces.

The tool was trained on a dataset comprising millions of videos and subsequently fine-tuned on a smaller set. Stability AI has stated that the generated videos used for this purpose were publicly available for research.

Share this story: