Gemma 4 12B 开源：无编码器统一多模态架构，16GB 笔记本本地运行，性能逼近 26B

魔搭ModelScope社区

11人浏览 · 2026-06-05 09:53:07

魔搭ModelScope社区 · 2026-06-05 09:53:07 发布

Google DeepMind 开源了 Gemma 4 12B，一款约 11.95B 参数的统一多模态模型。它最特别的地方，是去掉了传统多模态模型里的视觉和音频编码器，让图像、音频信号直接进入 LLM 主干处理，也是 Gemma 4 家族里第一个原生支持音频输入的中尺寸模型。性能上，它逼近体积大一倍的 26B MoE，显存却只要不到一半：GPQA Diamond 78.8%、AIME 2026(无工具)77.5%、MMLU Pro 77.2%，全面超越上一代 Gemma 3 27B。更重要的是，一台 16GB 显存的笔记本就能在本地把它跑起来。模型采用 Apache 2.0 协议,支持 256K 上下文和 140+ 语言。

开源地址：

● gemma-4-12B：

https://www.modelscope.cn/models/google/gemma-4-12B

● gemma-4 模型合集：

https://www.modelscope.cn/collections/google/Gemma-4

核心技术点

统一的无编码器多模态　传统多模态模型靠独立编码器先把图像/音频翻译成 LLM 能读的表示,会带来额外延迟和显存开销。Gemma 4 12B 把视觉编码器替换为”单次矩阵乘 + 位置嵌入 + 归一化”的轻量嵌入模块,音频编码器则被完全移除、原始音频信号直接投影到与文本 token 相同的维度空间。所有模态汇入同一个 decoder-only Transformer,理解多模态的”重担”交给 LLM 主干,从而降低延迟、缩小部署体积,并让整模型能一次性端到端微调。

混合注意力 + 长上下文优化　模型交替使用局部滑动窗口注意力(窗口 1024 token)与全局注意力,并保证最后一层始终为全局,在轻量模型的速度和低显存下保留长上下文所需的全局感知。全局层采用统一的 Key/Value 并应用 Proportional RoPE(p-RoPE),进一步为长上下文优化内存占用。

可配置思考模式　在 system prompt 开头加入 <|think|> 控制 token 即开启逐步推理,移除即关闭;多轮对话中历史回复只保留最终答案,不带入上一轮的思考内容。

可变图像分辨率　图像 token 预算可在 70 / 140 / 280 / 560 / 1120 间配置:分类、字幕、视频抽帧用低预算换速度,OCR、文档解析等小字任务用高预算保细节。

Agent 与解码加速　原生支持结构化函数调用,并新增原生 system 角色,便于构建可控的 Agent 工作流;内置 Multi-Token Prediction(MTP)draft 模块以降低推理延迟。

模型推理

使用Transformers推理

首先，在您的环境中安装必要的依赖项：

pip install -U transformers torch accelerate

其次下载模型：

modelscope download --model google/gemma-4-12B-it --local_dir google/gemma-4-12B-it

接着，你可以使用以下代码加载模型：

from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

然后你就可以运行以下脚本生成输出:

# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]
# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
processor.parse_response(response)

模型微调

ms-swift 支持使用 transformers/megatron后端对 gemma-4-12B-it 系列模型进行微调。ms-swift 是魔搭社区官方提供的大模型训练框架，开源地址：https://github.com/modelscope/ms-swift

这里介绍使用 megatron 后端进行训练。使用 transformers 后端训练脚本可以参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/gemma4

环境准备：

pip install git+https://github.com/modelscope/ms-swift.git
pip install git+https://github.com/modelscope/mcore-bridge.git
# 更多依赖安装参考: https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Quick-start.html

pip install "transformers>=5.10" -U

微调脚本如下，显存占用为 4 * 65GiB：

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
megatron sft \
    --model google/gemma-4-12B-it \
    --save_safetensors true \
    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#2000' \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --loss_scale ignore_empty_think \
    --split_dataset_ratio 0.01 \
    --tuner_type full \
    --tensor_model_parallel_size 4 \
    --micro_batch_size 16 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 1 \
    --finetune true \
    --freeze_llm false \
    --freeze_vit true \
    --freeze_aligner true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --output_dir megatron_output/gemma-4-12B-it \
    --eval_steps 500 \
    --save_steps 500 \
    --max_length 4096 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend unfused \
    --group_by_length true \
    --padding_free false

训练后对验证集进行推理：

CUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048gemma-4-12BCUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048ModelScope 魔搭社区CUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048gemma-4-12BCUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048快速开始 - swift 4.4.0.dev0 文档CUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048gemma-4-12BCUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048ModelScope 魔搭社区CUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048gemma-4-12BCUDA_VISIBLE_DEVICES=0 swift infer \
    --model megatron_output/gemma-4-12B-it/vx-xxx/checkpoint-xxx \
    --stream true \
    --enable_thinking false \
    --load_data_args true \
    --max_new_tokens 2048