Create Production-Grade Videos with Synchronized Audio
LTX 2.3 is the latest multimodal video generation model from Lightricks, built on a Diffusion Transformer (DiT) architecture with 22 billion parameters. It generates synchronized video and audio in a single pass, supporting text-to-video, image-to-video, audio-to-video, video extension, and keyframe interpolation — all within one unified system. The model ships with open weights on HuggingFace, a desktop editor for fully local execution, and API access for cloud-based integration.
Key improvements over the prior LTX-2 release include a redesigned VAE for sharper fine details and cleaner edges, a 4× larger text connector for stronger prompt adherence on complex multi-subject scenes, native 9:16 portrait video trained on portrait-orientation data, and significantly cleaner audio output after noise filtering of the training set. For teams building products, the model is available in full (dev), distilled, and FP8 quantized variants to balance quality and inference speed.
Whether you need cinematic concept clips, social-ready vertical video, or audio-driven animation, the LTX 2.3 engine handles it without switching between fragmented tools — and it can run entirely on your own hardware.
Technical Capabilities at a Glance
Core generation limits and supported modalities for this model.
Max Video Duration
Up to ~20 seconds per single generation
Max Resolution
Up to 4K (2160p) via two-stage upscaler pipeline; native 1080p generation
Frame Rate Options
24 FPS and 48 FPS supported
Native Audio
Synchronized audio-video generation in a single model pass
Portrait Video
Native 9:16 at 1080×1920, trained on portrait-orientation data
Model Variants
22B dev (bf16), 22B distilled (8-step), FP8 quantized
Before You Generate: Key Checks for This Model
Avoid common failures and quality loss by verifying these settings first.
Resolution Must Be Divisible by 32
Both width and height inputs must be divisible by 32. Non-compliant values cause padding artifacts or generation errors. Use standard presets like 1920×1080 or 1080×1920 to stay safe.
Frame Count Formula: Divisible by 8, Plus 1
The number of frames must follow the pattern (n × 8) + 1. For example, 121 frames at 24 FPS yields roughly 5 seconds. Incorrect frame counts break the generation pipeline.
Write Prompts Like a Shot List — Under 200 Words
Start directly with the action, describe chronologically in a single flowing paragraph, and include camera angles, wardrobe, lighting, and motion. Avoid abstract or vague language. The official guideline caps prompts at 200 words.
Choose Dev vs. Distilled Based on Your Quality/Speed Need
The dev checkpoint runs 40 inference steps for maximum quality but is slower. The distilled variant uses just 8 steps for rapid iteration. Pick the right variant before queuing generation to avoid wasting time or getting subpar results.
Verify Audio Source Quality for Audio-to-Video
When conditioning on audio input, the model maps sound events to visual motion. Noisy or clipped audio sources produce weaker synchronization. Use clean, well-separated audio tracks for best results.
Choose the Right Workflow: LTX 2.3 or Kling 3.0
Both models generate video with native audio from text and image inputs, but they differ sharply in openness, deployment flexibility, duration limits, and multi-shot capabilities. This comparison highlights the practical differences that matter when selecting a model for your project.
| Feature/Spec |
LTX 2.3
Recommended
|
Kling 3.0 |
|---|---|---|
| Developer | Lightricks | Kuaishou Technology |
| Max Video Duration | Up to ~20 seconds per generation | Up to 15 seconds per generation |
| Max Resolution | Up to 4K (2160p) with upscaler; native 1080p | Up to 1080p natively (Pro mode); 4K claimed in marketing |
| Frame Rate | 24 / 48 FPS | 24–30 FPS standard; up to 60 FPS in some modes |
| Native Audio Languages | Synchronized audio-video (language scope not specified per variant) | 5 languages: Chinese, English, Japanese, Korean, Spanish with lip-sync |
| Multi-Shot Storyboarding | Not supported natively (single continuous clip) | Up to 6 shots per clip with per-shot camera/duration control |
| Open Weights / Local Execution | Yes — open weights on HuggingFace; runs locally on consumer GPUs | No — closed/proprietary model; cloud-only through Kuaishou platform or third-party APIs |
| LoRA Fine-Tuning | Yes — LoRA and IC-LoRA training supported; can complete in under 1 hour | Not verified in official sources (latest check) |
| Accessibility | Available on Vidofy.ai | Kling 3.0 also available on Vidofy.ai |
Practical Tradeoffs to Consider
Openness and Deployment Flexibility
LTX 2.3 provides a fundamentally different deployment model than Kling 3.0. With open weights, a desktop editor, ComfyUI integration, and API access, creators can choose between fully local execution (where no data leaves the machine), cloud API calls for scale, or hybrid workflows that prototype locally and render via API. This makes it uniquely suitable for teams with IP sensitivity, custom pipeline needs, or fine-tuning requirements. Kling 3.0, by contrast, is a cloud-only service — powerful, but locked behind platform access with no self-hosting option.
Multi-Shot Narrative vs. Single-Clip Depth
Kling 3.0's multi-shot storyboarding — with up to six camera cuts per generation and per-shot control over duration, framing, and dialogue — is a clear advantage for structured short-form storytelling. LTX 2.3 generates single continuous clips but compensates with longer maximum duration, audio-to-video conditioning, and the ability to extend clips via a dedicated endpoint. The choice depends on whether your workflow needs pre-structured multi-shot sequences or flexible, longer single-shot clips that can be assembled in post.
When to Choose LTX 2.3 vs Kling 3.0
Use this quick guidance to pick the best option for your workflow.
From Prompt to Video in Four Steps
Generate your first clip in minutes with this straightforward workflow.
Step 1: Select LTX 2.3 as Your Model
Open Vidofy.ai, navigate to the video generation tool, and select LTX 2.3 from the available models. Choose your generation mode: text-to-video, image-to-video, or audio-to-video.
Step 2: Configure Your Output Settings
Set resolution (up to 1080p or 4K with upscaler), aspect ratio (16:9 or 9:16 portrait), duration, and frame rate. Enable audio generation if you want synchronized sound output.
Step 3: Write Your Prompt and Generate
Describe the scene chronologically in a single flowing paragraph — include camera angle, subject action, lighting, and environment. Keep it under 200 words. Click Generate to start the process.
Step 4: Download, Extend, or Iterate
Preview your generated clip with audio. Download the result, use the extend function to add more seconds, or adjust your prompt and regenerate. Export your final video for use in any editing tool.
Frequently Asked Questions
What types of video can I generate with LTX 2.3?
LTX 2.3 supports text-to-video, image-to-video, audio-to-video, video extension, retake (regenerating a specific time region), and keyframe interpolation — all with optional synchronized audio output.
How long can a single generated clip be?
A single generation produces up to approximately 20 seconds of video. You can extend clips beyond that using the dedicated extend-video function.
Can I run this model locally on my own GPU?
Yes. Open weights are available on HuggingFace, and a desktop editor supports local inference on NVIDIA GPUs (RTX 30/40/50 series). For full-quality two-stage generation at higher resolutions, higher-VRAM GPUs are recommended. macOS users can generate via the API fallback mode in the desktop app.
Can I fine-tune LTX 2.3 for my own style or characters?
Yes. The dev checkpoint supports LoRA and IC-LoRA training through the included trainer. Lightricks states that training for motion, style, or likeness can complete in under an hour in many settings.
Is commercial use of generated content permitted?
The model weights are available under the LTX-2 Community License Agreement. For companies under $10M annual revenue, usage is free. Companies above that threshold need a commercial licensing agreement. Check the specific license terms on HuggingFace before deploying in a commercial product.
What are the main resolution and frame count constraints to watch?
Width and height must be divisible by 32, and frame count must follow the formula (n × 8) + 1. Non-compliant values can cause generation failures or padding artifacts. Standard presets like 1920×1080 or 1080×1920 at 121 frames (≈5s at 24 FPS) work reliably.