Produce cinematic, sound-on video generations with Veo 3.1—without the technical overhead
Veo 3.1 is a state-of-the-art generative video model from Google DeepMind, built for creating high-fidelity video with natively generated audio from text prompts and images. Google introduced Veo 3.1 as part of major Flow updates on Oct 15, 2025 , and later highlighted expanded “Ingredients to Video” capabilities on Jan 13, 2026 . The model targets filmmaking-grade realism, stronger prompt adherence, and more controllable storytelling workflows.
From the Gemini API, Veo 3.1 supports short, production-focused clips, with documented options including video duration presets (4s / 6s / 8s) , output resolutions such as 720p, 1080p, and 4K , and a 24fps output frame rate . It also adds filmmaker-friendly controls like choosing portrait vs landscape output (9:16 / 16:9) , generating from first/last frames, extending previous generations, and guiding scenes with up to 3 reference images .
Vidofy.ai turns these capabilities into a streamlined, creator-first workflow: pick Veo 3.1, select orientation and quality settings, add reference images for tighter continuity, and generate. Whether you’re producing social-first vertical cuts or sound-on cinematic shots for editing, Vidofy gives you a single place to iterate quickly—without juggling multiple tools or complicated setup.
Creator’s Choice Showdown: Veo 3.1 vs Sora 2 Pro (Quality, Control, and Output Options)
Both Veo 3.1 and Sora 2 Pro are built for modern AI video generation with audio—but they emphasize control and production workflows in different ways. Here’s a spec-led comparison using only officially published details where available.
| Feature/Spec | Veo 3.1 | Sora 2 Pro |
|---|---|---|
| Model type & modalities | Video generation with native audio; supports text-to-video and image-to-video | Video generation with synced audio; supports text and image input, video and audio output |
| Clip duration presets / control | Video duration presets: 8s, 6s, 4s | Not verified in official sources (latest check) |
| Output resolution options | 720p; 1080p (8s length only); 4K (8s length only) | Portrait: 720x1280; 1024x1792. Landscape: 1280x720; 1792x1024 |
| Orientation / aspect ratio | Landscape 16:9 and portrait 9:16 | Portrait and landscape output sizes available |
| Frame rate (FPS) | 24fps | Not verified in official sources (latest check) |
| Reference-image guidance | Up to 3 reference images to guide generation | Image reference supported via input_reference (acts as the first frame) |
| API pricing (per second) | $0.40 per second (720p and 1080p); $0.60 per second (4K) | $0.30 per second (portrait 720x1280 / landscape 1280x720); $0.50 per second (portrait 1024x1792 / landscape 1792x1024) |
| Accessibility | Instant on Vidofy | Sora 2 Pro Also availabe on Vidofy |
Detailed Analysis
Analysis: Creative control with images (continuity without heavy post)
Veo 3.1’s official Gemini API documentation emphasizes strong image-led direction, including guidance with up to 3 reference images . In practice, that matters when you need consistency across characters, wardrobe, props, and environments—especially when generating multiple shots that must cut together cleanly.
On Vidofy, this becomes a predictable workflow: you can keep a “reference set” per project (hero character, environment, style frame), then iterate prompts without rebuilding the look every time.
Analysis: Sound-on generations and production-ready outputs
Both models support video generation with audio, but Veo 3.1 is explicitly positioned by Google as “video, meet audio,” and its Gemini API page documents native audio generation alongside video output . That’s valuable when your shot needs dialogue, ambience, or sound design as part of the first draft—not as a separate toolchain.
Vidofy helps teams turn this into a repeatable pipeline: generate multiple takes, compare pacing and tone, then export the best candidate for editing—without needing to stitch together video and audio from different systems.
Verdict: Choose Veo 3.1 when you need controlled, sound-on shots you can iterate fast
How It Works
Follow these 3 simple steps to get started with our platform.
Step 1: Choose Veo 3.1 on Vidofy
Select Veo 3.1 from Vidofy’s model library to start a sound-on video generation session from a prompt or an image-led concept.
Step 2: Add direction (prompt + optional reference images)
Describe the shot with camera movement, lighting, and action. For tighter continuity, attach reference images and generate portrait or landscape outputs depending on where the video will be published.
Step 3: Generate, iterate, and export
Create multiple takes, refine the shot, then export your chosen result for editing and delivery—all from one workflow.
Frequently Asked Questions
What is Veo 3.1 (and who developed it)?
Veo 3.1 is a generative video model from Google DeepMind designed to create videos with natively generated audio from text prompts and images. It’s referenced in official Google product updates and is available via Google surfaces including the Gemini API.
Does Veo 3.1 generate audio automatically?
Yes. The Gemini API documentation for Veo 3.1 describes video generation with native audio output.
Can Veo 3.1 generate portrait (vertical) videos?
Yes. Veo 3.1 supports both portrait and landscape outputs in the Gemini API documentation.
What are the official Veo 3.1 limits for duration, resolution, and FPS?
Google documents Veo 3.1 generation settings (including duration presets, resolution options, and frame rate) in the Gemini API Veo video generation documentation.
Can I use Veo 3.1 commercially?
Commercial usage depends on the terms that apply to how you access the model (for example, platform terms and applicable product policies). Review the relevant terms before launching client work; this is not legal advice.
Do Veo 3.1 videos include any watermarking or provenance signals?
Google’s Veo documentation describes watermarking using SynthID for videos created by Veo.