Gemini Omni Flash generates video from multimodal input

Conversational video editing and generation via text prompts on images, audio, and video references—now in Gemini app and Google Flow.

May 22, 2026

Summary

Replaces manual video editing workflows with natural language instructions that maintain character consistency and physics across multi-turn edits. Developers building content generation APIs can now reference this native multimodal capability.

Why it matters

Replaces manual video editing workflows with natural language instructions that maintain character consistency and physics across multi-turn edits. Developers building content generation APIs can now reference this native multimodal capability.

Implementation verdict

Flash model is live in Gemini app, Google Flow, and YouTube Shorts today. Supports image/audio/video input with video output; image and audio output modalities coming later. Worth testing now for prompt engineering patterns, but production integration depends on API availability and rate limits (not specified in announcement).

Sources

  1. 1.Omni is our new model that can create anything from any input — starting with video
  2. 2.With Omini, you can combine images, audio, video and text as input and generate high-quality videos
  3. 3.we're rolling out the first model in the Omni family: Gemini Omni Flash, to the Gemini app, Google Flow and YouTube Shorts
  4. 4.Edit your videos through conversation
  5. 5.Omni has an improved intuitive understanding of forces like gravity, kinetic energy and fluid dynamics

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.