Qwen3-VL新成员 2B、32B来啦！更适合开发者体质

魔搭ModelScope社区

810人浏览 · 2025-10-23 11:34:12

魔搭ModelScope社区 · 2025-10-23 11:34:12 发布

卷出新高度——Qwen3-VL 家族再扩容！
今天，通义千问团队一口气推出 2B 与 32B 两款全新 Dense 模型，从轻量实验到高能推理，全面覆盖开发者在视觉理解场景中的各种“体质”需求。

无论你是想在端侧上跑个 demo，还是在服务器集群里挑战复杂多模态任务，Qwen3-VL 都有你的专属选择。

模型合集：
👉 https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b

01模型介绍

两种版本，按需搭配

每款模型都提供 Instruct 与 Thinking 双版本，适配不同任务风格：

Instruct 版本：响应更快、执行更稳，专为对话交互、工具调用等场景优化；
Thinking 版本：强化长链推理与复杂视觉理解，真正实现“看图思考”，轻松应对高难度任务。

性能炸裂，小模型也有大能量

Qwen3-VL-32B：32B 参数，干翻 235B？

在多项权威基准测试中，Qwen3-VL-32B 表现惊艳：

在 STEM、VQA、OCR、视频理解、代理任务等关键领域，全面超越 GPT-5 mini 与 Claude 4 Sonnet；
更是在 OSWorld（操作系统交互任务）中，击败了参数量高达 235B 的 MoE 模型！

不需要堆参数，也能赢。

Qwen3-VL-2B：小身材，大智慧

别小看这 2B 参数——它专为端侧部署而生：

能在资源受限设备（如手机、边缘盒子）上流畅运行；
开发者做实验、快速验证想法，再也不用等“云上排队”；
轻盈、高效、开箱即用，是原型开发的绝佳搭档。

Instruct模型效果

Thinking模型效果

文本能力同样拉满，未来多模态即基础模型

别被“视觉语言”四个字局限了想象——本次 Qwen3-VL 系列的纯文本能力同样强悍，在语言理解、逻辑推理、代码生成等核心指标上，完全不输同规模纯文本大模型。这意味着：你不再需要为“多模态”牺牲语言性能，也不必在文本与视觉任务间做取舍。

多模态大模型，正在成为新一代基础模型的标配。
无论是图文混合输入、视频语义解析，还是纯文本对话、复杂编程，Qwen3-VL 都能一肩扛起——一个模型，通吃多场景，这才是真正的通用智能底座。

02模型推理

使用魔搭社区免费算力进行推理

from modelscope import Qwen3VLForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen3-VL-2B-Instruct",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

显存占用

03模型训练

ms-swift支持了对Qwen3-VL的微调。ms-swift是魔搭社区官方提供的大模型与多模态大模型训练部署框架。ms-swift开源地址：https://github.com/modelscope/ms-swift

在开始微调之前，请确保您的环境已准备妥当。

pip install transformers qwen_vl_utils -U
# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

如果您需要自定义数据集微调模型，你可以将数据准备成以下格式，并在命令行中设置`--dataset train.jsonl --val_dataset val.jsonl`，验证集为可选。

如果您需要自定义数据集微调模型，你可以将数据准备成以下格式，并在命令行中设置`--dataset train.jsonl --val_dataset val.jsonl`，验证集为可选。{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]}
{"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫，后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么，<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象，视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}

grounding任务格式如下：

{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

ms-swift支持了使用 transformers 和 megatron 后端对Qwen3-VL进行训练，这里只对transformers后端进行介绍。若要使用megatron对Qwen3-VL进行训练，请参考文档：https://swift.readthedocs.io/en/latest/BestPractices/Qwen3-VL-Best-Practice.html#training

以下提供对Qwen3-VL-32B-Instruct模型的微调脚本，我们使用混合模态数据作为Demo数据集，该示例脚本仅作为演示用途。训练显存为2 * 50GiB，训练时间为80分钟。

# 2 * 50GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
IMAGE_MAX_TOKEN_NUM=1024 \
VIDEO_MAX_TOKEN_NUM=128 \
FPS_MAX_FRAMES=16 \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
    --model Qwen/Qwen3-VL-32B-Instruct \
    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000' \
              'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
              'swift/VideoChatGPT:Generic#2000' \
    --load_from_cache_file true \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --attn_impl flash_attn \
    --padding_free true \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --freeze_vit true \
    --freeze_aligner true \
    --packing true \
    --gradient_checkpointing true \
    --vit_gradient_checkpointing false \
    --gradient_accumulation_steps 2 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --deepspeed zero3 \
    --dataset_num_proc 4 \
    --dataloader_num_workers 4

训练显存：

训练结束后，我们使用以下脚本对验证集进行推理：

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
CUDA_VISIBLE_DEVICES=0 \
IMAGE_MAX_TOKEN_NUM=1024 \
VIDEO_MAX_TOKEN_NUM=128 \
FPS_MAX_FRAMES=16 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --max_new_tokens 2048 \
    --load_data_args true

推送模型到ModelScope：

swift export \
    --adapters output/vx-xxx/checkpoint-xxx \
    --push_to_hub true \
    --hub_model_id '<your-model-id>' \
    --hub_token '<your-sdk-token>'

点击可跳转模型合集~

https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b