Gemini Omni Flash generates video from multimodal input
Conversational video editing and generation via text prompts on images, audio, and video references—now in Gemini app and Google Flow.
May 22, 2026
Summary
Replaces manual video editing workflows with natural language instructions that maintain character consistency and physics across multi-turn edits. Developers building content generation APIs can now reference this native multimodal capability.
Why it matters
Replaces manual video editing workflows with natural language instructions that maintain character consistency and physics across multi-turn edits. Developers building content generation APIs can now reference this native multimodal capability.
Implementation verdict
Flash model is live in Gemini app, Google Flow, and YouTube Shorts today. Supports image/audio/video input with video output; image and audio output modalities coming later. Worth testing now for prompt engineering patterns, but production integration depends on API availability and rate limits (not specified in announcement).
Sources
- 1.Omni is our new model that can create anything from any input — starting with video
- 2.With Omini, you can combine images, audio, video and text as input and generate high-quality videos
- 3.we're rolling out the first model in the Omni family: Gemini Omni Flash, to the Gemini app, Google Flow and YouTube Shorts
- 4.Edit your videos through conversation
- 5.Omni has an improved intuitive understanding of forces like gravity, kinetic energy and fluid dynamics
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.