Skip to content

[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment#4668

Open
zhanghonggeng wants to merge 1 commit into
PaddlePaddle:developfrom
zhanghonggeng:glm_algin
Open

[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment#4668
zhanghonggeng wants to merge 1 commit into
PaddlePaddle:developfrom
zhanghonggeng:glm_algin

Conversation

@zhanghonggeng

Copy link
Copy Markdown
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@Paddle-CI-Bot

Paddle-CI-Bot commented Jun 14, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #27546777556 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Unittest GPU CI dtype 类型不匹配 Bug 检查 GlmMoeDsa 模型前向中 attention_probs(float32)与 value(bfloat16)的类型对齐,在 dot_product_attention.py 调用 paddle.bmm 前显式 cast 为同一 dtype 报错代码
Model Unittest GPU CI Loss Diff merge develop 分支或更新 glm4_moe 的 base loss 基准文件 报错代码

失败的测试case:

# Unittest GPU CI
tests/transformers/glm_moe_dsa/test_modeling.py::GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model

# Model Unittest GPU CI
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_full_map[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full_function_call[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full_function_call[glm4_moe-dpo]

根本原因分析:

  • Unittest GPU CIGlmMoeDsa 模型在 paddlefleet/transformer/dot_product_attention.py:648 调用 paddle.bmm 时,attention_probs 为 float32 而 value 为 bfloat16,dtype 不一致触发 InvalidArgument,本 PR 引入的 glm_moe_dsa 模型在初始化或权重加载时未正确统一 dtype。

  • Model Unittest GPU CI:glm4_moe 全量/LoRA/TP-PP 训练场景的实际 loss 与存储基准(base_loss)均存在偏差(差值约 0.001~0.013),本 PR 对 glm4_moe 的改动影响了数值,需要 merge develop 最新基准或更新 base loss 文件。

修复建议:

  1. dtype 类型不匹配:在 GlmMoeDsa 模型的 create_and_check_lm_head_model 或模型初始化代码中,确保 attention 相关张量(q/k/v)统一 cast 为同一 dtype(如 bfloat16);或在测试的 GlmMoeDsaModelTest 中配置 dtype=bfloat16,确保模型以正确精度初始化。

  2. Loss Diff(glm4_moe)

    • 先 merge 最新 develop 分支,确认是否是 develop 已有的基准漂移。
    • 如果本 PR 确实修改了 glm4_moe 的计算逻辑,需同步更新 scripts/regression/ 下对应的 base loss 文件(glm4_moe_*_base_loss)至新的期望值。

🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants