Skip to content

Add Phi-4 multimodal support#4648

Open
yicycyc wants to merge 1 commit into
PaddlePaddle:developfrom
yicycyc:feat/phi4-multimodal-migration
Open

Add Phi-4 multimodal support#4648
yicycyc wants to merge 1 commit into
PaddlePaddle:developfrom
yicycyc:feat/phi4-multimodal-migration

Conversation

@yicycyc

@yicycyc yicycyc commented Jun 10, 2026

Copy link
Copy Markdown

PR 新增模型支持:Phi-4-multimodal

本 PR 新增 microsoft/Phi-4-multimodal-instruct 支持。复现对齐对象为 Microsoft/HuggingFace remote-code 实现以及 ModelScope 上的原始 safetensors 权重。

主要改动

  • 新增 paddleformers/transformers/phi4_multimodal/
    • Phi4MultimodalConfig
    • Phi4MultimodalModel
    • Phi4MultimodalForCausalLM
    • Phi4MultimodalForCausalLMPipe
    • Phi4MultimodalProcessor
    • Phi4MultimodalImageProcessor
    • Phi4MultimodalFeatureExtractor
    • 文本 decoder、LongRoPE、RMSNorm、MLP、attention eager/cache generation
    • 视觉分支:ViT、patch embedding、vision attention、pooling head、image projector、image feature merge
    • 音频分支:audio feature extractor、Conformer encoder、audio projector、speech/vision-speech adapter route
  • 注册 Phi-4-multimodal 到:
    • paddleformers.transformers
    • AutoConfig
    • AutoModel
    • AutoModelForCausalLM
    • AutoProcessor
    • AutoImageProcessor
    • AutoFeatureExtractor
  • 新增 Phi-4-multimodal 多模态 SFT 输入链路:
    • 新增 phi4_multimodal template / MM plugin
    • mm_collate_fn 支持 image_pixel_valuesimage_sizesimage_attention_mask
    • mm_collate_fn 支持 audio_input_featuresaudio_attention_maskaudio_embed_sizes
    • 支持 input_mode,用于区分 text / vision / speech / vision-speech adapter route
    • SFT label mask 支持屏蔽 image/audio token
  • 权重加载:
    • Phi4MultimodalPreTrainedModel._gen_aoa_config 中实现 HF/ModelScope safetensors 到 PaddleFormers 权重的自动转换规则
    • 支持原始 phi4mm config 转 Paddle Phi4MultimodalConfig
    • 用户可直接使用原始 safetensors 目录加载,无需手动运行离线权重转换脚本
  • 文档:
    • 更新 README 模型列表
    • 更新 docs/zh/model_capability.md 能力矩阵

前向对齐验证

模型:microsoft/Phi-4-multimodal-instruct

环境设置:

  • Paddle: FLAGS_use_accuracy_compatible_kernel=1FLAGS_cudnn_deterministic=1NVIDIA_TF32_OVERRIDE=0
  • Torch: torch.backends.cudnn.deterministic=Truetorch.backends.cudnn.allow_tf32=False
  • attention backend:两侧均使用 eager

fp32 最终 logits:

模态 adapter HF top token Paddle top token logits max abs diff logits mean abs diff 结论
text none 1 1 0.0 0.0 对齐
vision vision 199999 199999 0.0001831055 0.0000245297 对齐
audio speech 11 11 0.0012969971 0.0002102744 对齐

text bf16 补充:

输入长度 logits max abs diff logits mean abs diff top token 是否一致
1 token 0.0 0.0
2 tokens 0.0 0.0
4 tokens 0.0 0.0

生成对齐

Text-only 生成:

dtype 输入 前 10 token step logits max abs diff step logits mean abs diff 结论
bf16 [[100, 101]] [220, 17, 87, 659, 220, 18, 88, 314, 220, 899] 0.0 0.0 对齐
fp32 [[100, 101]] [220, 17, 87, 659, 220, 18, 88, 314, 220, 899] 3.6239624e-05 2.6317712e-06 对齐

Vision 生成:

dtype 输入 前 10 token step logits max abs diff step logits mean abs diff 结论
bf16 synthetic real-shape image tokens, seq len 549 [200, 200, 200, 200, 200, 200, 200, 200, 200, 200] 0.0 0.0 对齐

Audio 说明:

  • audio bf16 生成前 10 token 已一致。
  • logits 仍存在差异,已定位为 Paddle/Torch bf16 Conv1D 算子差异。

训练验证

1. 文本

使用 GSM8K 做 BF16 full-SFT,Paddle 侧使用 sharding stage3,Torch 侧使用 ms-swift ZeRO-3。

共同设置:

  • global batch size = 4
  • learning_rate = 1e-5
  • warmup_steps = 0
  • weight_decay = 0
  • shuffle 关闭
  • attention eager

loss曲线如下:

phi4_gsm8k_len512_300_loss_curve_paddle_vs_transformers

2. 视觉多模态

使用小多模态 SFT 数据做 BF16 full-SFT,Paddle 侧使用 sharding stage3,Torch 侧使用 ms-swift ZeRO-3。

loss曲线如下:
phi4mm_tiny_vision_300_loss_curve_detailed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant