LTX-2.3开源：视频生成引擎级升级

魔搭ModelScope社区

2384人浏览 · 2026-03-09 10:51:29

魔搭ModelScope社区 · 2026-03-09 10:51:29 发布

Lightricks的开源续作 LTX-2.3正式开源，这是 LTX-2 音视频基础模型的重大版本更新，在视频质量以及提示词遵循方面均有提升。LTX-2.3重建了 VAE 架构、扩容文本连接器，同时改进 I2V 训练，也更换音频声码器，并首次支持原生竖版视频生成。

📎ltx2-3-hero-fin-opt1.mp4

开源地址:

模型权重：

https://modelscope.cn/models/Lightricks/LTX-2.3

GitHub：

https://github.com/Lightricks/LTX-2

技术论文：

https://arxiv.org/abs/2601.03233

核心更新

重建 VAE，细节更锐利

重新训练 VAE 架构，构建全新潜在空间。毛发、边缘、文字等高频细节在整个生成流程中保留更完整，低分辨率下的软化问题明显改善。如果你之前依赖后期锐化来补偿生成质量，2.3 版本应能减少这个需求。

📎01-SharpDetails.mp4

文本连接器扩容 4 倍，Prompt 理解更准

增大文本连接器容量并改进架构。多主体场景、空间关系描述、特定风格指令，现在都能更忠实地转化到输出中。之前需要简化 Prompt 才能稳定出图的情况，现在可以尝试更具体的描述。

📎02-Prompt-Enharance.mp4

I2V 大幅改善，真实运动更多

这是社区反馈最集中的问题。重新训练 I2V，减少"Ken Burns 效应"（静态缩放平移），消除静止视频，减少意外跳切，提高从输入帧到输出的视觉一致性。生产流程中的 I2V 废弃率应有明显下降。

📎03-Image2Video.mp4

音频更干净

过滤训练集中的静音段和噪声，同步更换新声码器（改进版 HiFi-GAN，支持 24kHz 立体声）。音画对齐更紧，随机噪声和意外静音更少。对文生视频和音频条件生成流程均适用。

📎04-Audio-After.mp4

改进版

📎04-Audio-Before.mp4

原版

原生竖版视频，最高 1080×1920

LTX 系列首次支持竖版生成。在竖版数据上训练，而非从横版裁剪。直接对应 TikTok、Instagram Reels、YouTube Shorts 等主流短视频格式。

📎05-Portrait-1.mp4

模型规格

项目	参数
架构	非对称双流扩散 Transformer（DiT）
参数量	22B（dev 全量版）
最长生成时长	20 秒
最高分辨率	4K，支持竖版 1080×1920
帧率	25fps / 50fps
文本编码器	Gemma-3
训练数据来源	授权数据（Getty Images、Shutterstock 合作）
支持任务	文生视频、图生视频、视频生视频、音频生视频、视频扩展

开源模型权重

名称	说明
ltx-2.3-22b-dev	完整模型，bf16 精度，支持微调和 LoRA 训练
ltx-2.3-22b-distilled	8 步蒸馏版，推理更快，显存占用更低（CFG=1）
ltx-2.3-22b-distilled-lora-384	蒸馏 LoRA，可叠加到完整模型使用
ltx-2.3-spatial-upscaler-x2-1.0	2 倍空间超分放大模块
ltx-2.3-spatial-upscaler-x1.5-1.0	1.5 倍空间超分放大模块
ltx-2.3-temporal-upscaler-x2-1.0	2 倍时序超分，用于提升帧率

模型推理

官方github推理
环境安装

git clone https://github.com/Lightricks/LTX-2.gitcd LTX-2

uv sync --frozen
source .venv/bin/activate

推理脚本

# Run a pipeline (example: two-stage text-to-video)
python -m ltx_pipelines.ti2vid_two_stages \
    --checkpoint-path path/to/checkpoint.safetensors \    
    --distilled-lora path/to/distilled_lora.safetensors 0.8 \    
    --spatial-upsampler-path path/to/upsampler.safetensors \    
    --gemma-root path/to/gemma \    
    --prompt "A beautiful sunset over the ocean" \    
    --output-path output.mp4
    
# View all available options for any pipeline
python -m ltx_pipelines.ti2vid_two_stages --help

DiffSynth-Studio 推理

安装 DiffSynth-Studio：https://github.com/modelscope/DiffSynth-Studio

git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .

推理代码：

import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2
vram_config = {
    "offload_dtype": torch.bfloat16,
    "offload_device": "cpu",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cuda",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = LTX2AudioVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
        ModelConfig(model_id="Lightricks/LTX-2.3", origin_file_pattern="ltx-2.3-22b-dev.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
)
prompt = "A girl is very happy, she is speaking: “I enjoy working with Diffsynth-Studio, it's a perfect framework.”"
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
height, width, num_frames = 512, 768, 121
video, audio = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    seed=43,
    height=height,
    width=width,
    num_frames=num_frames,
    tiled=True,
)
write_video_audio_ltx2(
    video=video,
    audio=audio,
    output_path='ltx2.3_onestage.mp4',
    fps=24,
    audio_sample_rate=pipe.audio_vocoder.output_sampling_rate,
)

DiffSynth-Studio LoRA 训练

下载样例数据集：

modelscope download --dataset DiffSynth-Studio/example_video_dataset --local_dir ./data/example_video_dataset

数据处理：

accelerate launch examples/ltx2/model_training/train.py 
  --dataset_base_path data/example_video_dataset/ltx2 
  --dataset_metadata_path data/example_video_dataset/ltx2_t2av.csv 
  --data_file_keys "video,input_audio" 
  --extra_inputs "input_audio" 
  --height 512 
  --width 768 
  --num_frames 121 
  --dataset_repeat 1 
  --model_id_with_origin_paths "DiffSynth-Studio/LTX-2.3-Repackage:text_encoder_post_modules.safetensors,DiffSynth-Studio/LTX-2.3-Repackage:video_vae_encoder.safetensors,DiffSynth-Studio/LTX-2.3-Repackage:audio_vae_encoder.safetensors,google/gemma-3-12b-it-qat-q4_0-unquantized:model-*.safetensors" 
  --learning_rate 1e-4 
  --num_epochs 5 
  --remove_prefix_in_ckpt "pipe.dit." 
  --output_path "./models/train/LTX2.3-T2AV_lora-splited-cache" 
  --lora_base_model "dit" 
  --lora_target_modules "to_k,to_q,to_v,to_out.0" 
  --lora_rank 32 
  --use_gradient_checkpointing 
  --task "sft:data_process"

LoRA 模型训练（需 80G 显存）：

accelerate launch examples/ltx2/model_training/train.py 
  --dataset_base_path ./models/train/LTX2.3-T2AV_lora-splited-cache 
  --data_file_keys "video,input_audio" 
  --extra_inputs "input_audio" 
  --height 512 
  --width 768 
  --num_frames 121 
  --dataset_repeat 100 
  --model_id_with_origin_paths "DiffSynth-Studio/LTX-2.3-Repackage:transformer.safetensors" 
  --learning_rate 1e-4 
  --num_epochs 5 
  --remove_prefix_in_ckpt "pipe.dit." 
  --output_path "./models/train/LTX2.3-T2AV_lora" 
  --lora_base_model "dit" 
  --lora_target_modules "to_k,to_q,to_v,to_out.0" 
  --lora_rank 32 
  --use_gradient_checkpointing 
  --task "sft:train"

说明：

从 LTX-2 升级到 2.3 有一个破坏性变更：LTX-2 的 LoRA 无法直接用于 2.3。VAE 重建后潜在空间发生了变化，已有的自定义 LoRA 需要重新训练。