[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment by zhanghonggeng · Pull Request #4668 · PaddlePaddle/PaddleFormers

zhanghonggeng · 2026-06-14T07:32:56Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

Paddle-CI-Bot · 2026-06-14T08:18:19Z

PaddleFormers Log Analysis

Run #27546777556 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Unittest GPU CI	dtype 类型不匹配 Bug	检查 `GlmMoeDsa` 模型前向中 `attention_probs`（float32）与 `value`（bfloat16）的类型对齐，在 `dot_product_attention.py` 调用 `paddle.bmm` 前显式 cast 为同一 dtype	报错代码
Model Unittest GPU CI	Loss Diff	merge develop 分支或更新 glm4_moe 的 base loss 基准文件	报错代码

失败的测试case:

# Unittest GPU CI
tests/transformers/glm_moe_dsa/test_modeling.py::GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model

# Model Unittest GPU CI
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_full_map[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_lora[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full_tp_pp[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-pt]
scripts/regression/test_models.py::TestTrain::test_lora_tp_pp[glm4_moe-dpo]
scripts/regression/test_models.py::TestTrain::test_full_function_call[glm4_moe-sft]
scripts/regression/test_models.py::TestTrain::test_full_function_call[glm4_moe-dpo]

根本原因分析:

Unittest GPU CI：GlmMoeDsa 模型在 paddlefleet/transformer/dot_product_attention.py:648 调用 paddle.bmm 时，attention_probs 为 float32 而 value 为 bfloat16，dtype 不一致触发 InvalidArgument，本 PR 引入的 glm_moe_dsa 模型在初始化或权重加载时未正确统一 dtype。
Model Unittest GPU CI：glm4_moe 全量/LoRA/TP-PP 训练场景的实际 loss 与存储基准（base_loss）均存在偏差（差值约 0.001~0.013），本 PR 对 glm4_moe 的改动影响了数值，需要 merge develop 最新基准或更新 base loss 文件。

修复建议:

dtype 类型不匹配：在 GlmMoeDsa 模型的 create_and_check_lm_head_model 或模型初始化代码中，确保 attention 相关张量（q/k/v）统一 cast 为同一 dtype（如 bfloat16）；或在测试的 GlmMoeDsaModelTest 中配置 dtype=bfloat16，确保模型以正确精度初始化。
Loss Diff（glm4_moe）：
- 先 merge 最新 develop 分支，确认是否是 develop 已有的基准漂移。
- 如果本 PR 确实修改了 glm4_moe 的计算逻辑，需同步更新 scripts/regression/ 下对应的 base loss 文件（glm4_moe_*_base_loss）至新的期望值。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment

bc5dd0d

zhanghonggeng force-pushed the glm_algin branch from 916609f to bc5dd0d Compare June 15, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment#4668

[DO NOT MERGE] GLM-4.5 MoE forward + backward bit-wise alignment#4668
zhanghonggeng wants to merge 1 commit into
PaddlePaddle:developfrom
zhanghonggeng:glm_algin

zhanghonggeng commented Jun 14, 2026

Uh oh!

Paddle-CI-Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhanghonggeng commented Jun 14, 2026

Before submitting

PR types

PR changes

Description

Uh oh!

Paddle-CI-Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Paddle-CI-Bot commented Jun 14, 2026 •

edited

Loading