LTX 2.3 AI Video Generator

Create synchronized audio-video content with LTX 2.3 by Lightricks. Open-weight 22B model generates up to 20-second clips at 4K with LoRA fine-tuning support.

Model:

Ltx 2.3 Fast Text to video Audio 4K Quality

Creative animations with advanced dynamics

Ultra Quality • 1.5 min • 20+ credits

Ltx 2.3 Pro Text to video Audio 4K Quality

Creative animations with advanced dynamics

Ultra Quality • 1.5 min • 30+ credits

Ltx 2.3 Fast Image to video Audio 4K Quality

Creative animations with advanced dynamics

Ultra Quality • 1.5 min • 20+ credits

Ltx 2.3 Pro Image to video Audio 4K Quality

Creative animations with advanced dynamics

Ultra Quality • 1.5 min • 30+ credits

First Frame *

Upload First Frame

Last Frame *

Upload Last Frame

Prompt: 0 / 2048

Generate

Create Production-Grade Videos with Synchronized Audio

LTX 2.3 is the latest multimodal video generation model from Lightricks, built on a Diffusion Transformer (DiT) architecture with 22 billion parameters. It generates synchronized video and audio in a single pass, supporting text-to-video, image-to-video, audio-to-video, video extension, and keyframe interpolation — all within one unified system. The model ships with open weights on HuggingFace, a desktop editor for fully local execution, and API access for cloud-based integration.

Key improvements over the prior LTX-2 release include a redesigned VAE for sharper fine details and cleaner edges, a 4× larger text connector for stronger prompt adherence on complex multi-subject scenes, native 9:16 portrait video trained on portrait-orientation data, and significantly cleaner audio output after noise filtering of the training set. For teams building products, the model is available in full (dev), distilled, and FP8 quantized variants to balance quality and inference speed.

Whether you need cinematic concept clips, social-ready vertical video, or audio-driven animation, the LTX 2.3 engine handles it without switching between fragmented tools — and it can run entirely on your own hardware.

Capability Snapshot

Technical Capabilities at a Glance

Core generation limits and supported modalities for this model.

Max Video Duration

Up to ~20 seconds per single generation

Max Resolution

Up to 4K (2160p) via two-stage upscaler pipeline; native 1080p generation

Supported

Frame Rate Options

24 FPS and 48 FPS supported

Native Audio

Synchronized audio-video generation in a single model pass

Portrait Video

Native 9:16 at 1080×1920, trained on portrait-orientation data

Model Variants

22B dev (bf16), 22B distilled (8-step), FP8 quantized

Before You Generate: Key Checks for This Model

Avoid common failures and quality loss by verifying these settings first.

Resolution Must Be Divisible by 32

Both width and height inputs must be divisible by 32. Non-compliant values cause padding artifacts or generation errors. Use standard presets like 1920×1080 or 1080×1920 to stay safe.

Frame Count Formula: Divisible by 8, Plus 1

The number of frames must follow the pattern (n × 8) + 1. For example, 121 frames at 24 FPS yields roughly 5 seconds. Incorrect frame counts break the generation pipeline.

Write Prompts Like a Shot List — Under 200 Words

Start directly with the action, describe chronologically in a single flowing paragraph, and include camera angles, wardrobe, lighting, and motion. Avoid abstract or vague language. The official guideline caps prompts at 200 words.

Choose Dev vs. Distilled Based on Your Quality/Speed Need

The dev checkpoint runs 40 inference steps for maximum quality but is slower. The distilled variant uses just 8 steps for rapid iteration. Pick the right variant before queuing generation to avoid wasting time or getting subpar results.

Verify Audio Source Quality for Audio-to-Video

When conditioning on audio input, the model maps sound events to visual motion. Noisy or clipped audio sources produce weaker synchronization. Use clean, well-separated audio tracks for best results.

Model Comparison

Choose the Right Workflow: LTX 2.3 or Kling 3.0

Both models generate video with native audio from text and image inputs, but they differ sharply in openness, deployment flexibility, duration limits, and multi-shot capabilities. This comparison highlights the practical differences that matter when selecting a model for your project.

9 Criteria 2 Options

Feature/Spec	LTX 2.3 Recommended	Kling 3.0
Developer	Lightricks	Kuaishou Technology
Max Video Duration	Up to ~20 seconds per generation	Up to 15 seconds per generation
Max Resolution	Up to 4K (2160p) with upscaler; native 1080p	Up to 1080p natively (Pro mode); 4K claimed in marketing
Frame Rate	24 / 48 FPS	24–30 FPS standard; up to 60 FPS in some modes
Native Audio Languages	Synchronized audio-video (language scope not specified per variant)	5 languages: Chinese, English, Japanese, Korean, Spanish with lip-sync
Multi-Shot Storyboarding	Not supported natively (single continuous clip)	Up to 6 shots per clip with per-shot camera/duration control
Open Weights / Local Execution	Yes — open weights on HuggingFace; runs locally on consumer GPUs	No — closed/proprietary model; cloud-only through Kuaishou platform or third-party APIs
LoRA Fine-Tuning	Yes — LoRA and IC-LoRA training supported; can complete in under 1 hour	Not verified in official sources (latest check)
Accessibility	Available on Vidofy.ai	Kling 3.0 also available on Vidofy.ai

Practical Tradeoffs to Consider

Openness and Deployment Flexibility

LTX 2.3 provides a fundamentally different deployment model than Kling 3.0. With open weights, a desktop editor, ComfyUI integration, and API access, creators can choose between fully local execution (where no data leaves the machine), cloud API calls for scale, or hybrid workflows that prototype locally and render via API. This makes it uniquely suitable for teams with IP sensitivity, custom pipeline needs, or fine-tuning requirements. Kling 3.0, by contrast, is a cloud-only service — powerful, but locked behind platform access with no self-hosting option.

Multi-Shot Narrative vs. Single-Clip Depth

Kling 3.0's multi-shot storyboarding — with up to six camera cuts per generation and per-shot control over duration, framing, and dialogue — is a clear advantage for structured short-form storytelling. LTX 2.3 generates single continuous clips but compensates with longer maximum duration, audio-to-video conditioning, and the ability to extend clips via a dedicated endpoint. The choice depends on whether your workflow needs pre-structured multi-shot sequences or flexible, longer single-shot clips that can be assembled in post.

When to Choose LTX 2.3 vs Kling 3.0

Use this quick guidance to pick the best option for your workflow.

When to choose each: Choose LTX 2.3 when you need open weights, local/on-prem execution, LoRA fine-tuning, audio-to-video conditioning, or longer single-clip generation up to 20 seconds. It's ideal for developers, studios with IP concerns, and teams building custom pipelines. Choose Kling 3.0 when your priority is multi-shot storyboarding with per-shot camera control, multilingual lip-synced dialogue across five languages, or character consistency via the Elements system — especially for social content, ads, and short-form narrative video produced through a managed platform.

From Prompt to Video in Four Steps

Generate your first clip in minutes with this straightforward workflow.

Step 1: Select LTX 2.3 as Your Model

Open Vidofy.ai, navigate to the video generation tool, and select LTX 2.3 from the available models. Choose your generation mode: text-to-video, image-to-video, or audio-to-video.

Step 2: Configure Your Output Settings

Set resolution (up to 1080p or 4K with upscaler), aspect ratio (16:9 or 9:16 portrait), duration, and frame rate. Enable audio generation if you want synchronized sound output.

Step 3: Write Your Prompt and Generate

Describe the scene chronologically in a single flowing paragraph — include camera angle, subject action, lighting, and environment. Keep it under 200 words. Click Generate to start the process.

Step 4: Download, Extend, or Iterate

Preview your generated clip with audio. Download the result, use the extend function to add more seconds, or adjust your prompt and regenerate. Export your final video for use in any editing tool.

Frequently Asked Questions

What types of video can I generate with LTX 2.3?

LTX 2.3 supports text-to-video, image-to-video, audio-to-video, video extension, retake (regenerating a specific time region), and keyframe interpolation — all with optional synchronized audio output.

How long can a single generated clip be?

A single generation produces up to approximately 20 seconds of video. You can extend clips beyond that using the dedicated extend-video function.

Can I run this model locally on my own GPU?

Yes. Open weights are available on HuggingFace, and a desktop editor supports local inference on NVIDIA GPUs (RTX 30/40/50 series). For full-quality two-stage generation at higher resolutions, higher-VRAM GPUs are recommended. macOS users can generate via the API fallback mode in the desktop app.

Can I fine-tune LTX 2.3 for my own style or characters?

Yes. The dev checkpoint supports LoRA and IC-LoRA training through the included trainer. Lightricks states that training for motion, style, or likeness can complete in under an hour in many settings.

Is commercial use of generated content permitted?

The model weights are available under the LTX-2 Community License Agreement. For companies under $10M annual revenue, usage is free. Companies above that threshold need a commercial licensing agreement. Check the specific license terms on HuggingFace before deploying in a commercial product.

What are the main resolution and frame count constraints to watch?

Width and height must be divisible by 32, and frame count must follow the formula (n × 8) + 1. Non-compliant values can cause generation failures or padding artifacts. Standard presets like 1920×1080 or 1080×1920 at 121 frames (≈5s at 24 FPS) work reliably.