[codex] Adapt DSv4 backward alignment to latest PaddleFormers by SigureMo · Pull Request #4611 · PaddlePaddle/PaddleFormers

SigureMo · 2026-06-06T06:45:06Z

Summary

Port the DSv4 backward-alignment changes onto the latest PaddleFormers develop baseline.

This branch includes the two backward-alignment commits ported from the old alignment branch, plus local run-entry/data-path adaptation used by the latest 10-step alignment run:

keep the SFT precision run from saving final model/state artifacts
add DSV4_FLEET_FIXED_TOKENS JSON fixed-token batch support in collate_fn
preserve the existing LOAD_FIXED_DATA_PATH path and print which fixed-data source is used

This PR is intended to be used together with the matching PaddleFleet PR from branch codex/dsv4-backward-align-latest.

Validation

git diff --check origin/develop..HEAD
10-step DSv4 old/latest comparison completed with paired PaddleFleet changes:
- main loss md5 diff: 0/10
- MTP loss md5 diff: 0/10
- latest log: /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22steps/outputs/dpskv4_ep8_4layer_1k_align_20260606_1318/output_0/workerlog.0
- old log: /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_align_22steps/outputs/dpskv4_ep8_4layer_1k_align_20260606_1325/output_0/workerlog.0

CLAassistant · 2026-06-06T06:45:12Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Paddle-CI-Bot · 2026-06-06T07:22:20Z

PaddleFormers Log Analysis

Run #27056714883 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Fleet Model Test - Integration test (H20, single card)	PP段切分 Bug	`glm4_moe/modeling.py` 中 `Glm4MoeDecoderLayer` 层名与 paddlefleet `_gen_layer_weight` 正则匹配失败，检查 `gpt_provider.py` 传入的 `seg_method` 或 decoder 层类名是否与 paddlefleet 期望一致	报错代码
Fleet Model Test - Integration test (H20, multi-card)	PP段切分 Bug	同上；另有 EP4 场景出现 `number of layers (1) should be divided by part number(2)`，说明 GLM4.5 MoE DSA 的层数配置与 PP stage 数不整除，需检查 `num_layers` 或 PP size 设置	报错代码
Fleet Model Test - Integration test (A100)	PP段切分 Bug	与 H20 同根因，GLM4.5 所有 task（pre-train/sft/lora/dpo/dpo_lora）均因 `weight_idxs` 为空或层数不整除失败	报错代码
Unittest GPU CI - unittest-gpu-ci	PP段切分 Bug + SwanLab 初始化缺失	1) `GlmMoeDsaModelTest` 同 Fleet 根因；2) `SwanLabCallback.on_train_begin` 未先调用 `swanlab.init()`，测试需在 mock swanlab.get_run() 或在 setUp 中先 init	报错代码

失败的测试 case:

# Fleet Model Test (H20 single-card)
- Integration test (GLM4.5 single-card)        → AssertionError: weight_idxs' length should be greater than 0
- Integration test (Qwen3-30B-A3B single-card) → AssertionError: weight_idxs' length should be greater than 0

# Fleet Model Test (H20 multi-card)
- GLM4.5 pre-train       → weight_idxs length 0 / exit code 241
- GLM4.5 sft             → exit code 241
- GLM4.5 sft cp          → exit code 1
- GLM4.5 lora            → exit code 1
- GLM4.5 dpo             → exit code 241
- GLM4.5 dpo_lora        → exit code 1
- GLM4.5 pre-train (EP4) → number of layers (1) should be divided by part number(2)
- GLM4.5 pre-train (FP8) → weight_idxs length 0
- GLM4.5 pre-train (Grouped GEMM) → weight_idxs length 0
- Qwen pre-train         → weight_idxs length 0
- Qwen sft               → exit code 1
- Qwen lora              → exit code 1
- Qwen vl lora           → exit code 1

# Fleet Model Test (A100)
- GLM4.5 pre-train       → weight_idxs length 0 / exit code 241
- GLM4.5 sft             → exit code 241
- GLM4.5 lora            → exit code 1
- GLM4.5 dpo             → exit code 241
- GLM4.5 dpo_lora        → number of layers (1) / exit code 241
- Qwen pre-train         → weight_idxs length 0
- Qwen sft               → exit code 1
- Qwen lora              → exit code 1
- Qwen vl lora           → exit code 1

# Unittest GPU CI
- tests/trainer/test_trainer_visualization.py::TestSwanlabCallback::test_swanlabcallback
- tests/transformers/glm_moe_dsa/test_modeling.py::GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model

根本原因分析:

本 PR 引入了 glm_moe_dsa（GLM4.5 MoE DSA）新模型，其 decoder 层命名为 Glm4MoeDecoderLayer，但 gpt_provider.py 传给 gpt_builder 的 seg_method（PP pipeline 分段依据的层名 pattern）仍然指向旧的层名（如 Glm4MoeDecoderLayer 无法被 paddlefleet _gen_layer_weight 中 regex.search(name) 匹配），导致 weight_idxs 为空，所有使用该 GPTProvider 路径的 GLM4.5 MoE 场景（pre-train / sft / lora / dpo 以及 Unittest 中的 GlmMoeDsaModelTest）全部崩溃；SwanLabCallback 测试则是新 case 忘记在 setUp 中初始化 swanlab run。

修复建议:

PP 层名匹配修复：在 paddleformers/transformers/gpt_provider.py 的 provide() 方法中，确认 seg_method 传入的 layername 与 GlmMoeDsaForCausalLM 实际使用的 decoder 层类名一致。排查方式：在 glm_moe_dsa/modeling.py 中打印 self._layers_desc 各层 __name__，确认 regex pattern 能命中。若 DSA 版本将 decoder 层改名（如改为 GlmMoeDsaDecoderLayer），则需同步更新 seg_method 中的 pattern 字符串。
EP4 层数整除问题：number of layers (1) should be divided by part number(2) 说明 CI 配置里 num_layers 设置为 1 而 PP size=2，需要检查 tests/config/ci/glm45_*_ep4*.yaml 中 num_hidden_layers 是否被正确配置为可被 PP stage 数整除的值。
SwanLabCallback 测试修复：在 tests/trainer/test_trainer_visualization.py 的 TestSwanlabCallback.setUp 中添加 swanlab.init(mode="disabled") 或 mock swanlab.get_run() 返回一个假 run 对象，避免 on_train_begin 中因无 active run 抛出 RuntimeError。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

root added 3 commits June 6, 2026 13:13

dsv4 backward align

d52381a

dsv4_backward_align_22steps

afbf270

Adapt DSv4 backward alignment to latest PaddleFormers

bbda6c6

Remove DSv4 alignment probe logs

47aa363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Adapt DSv4 backward alignment to latest PaddleFormers#4611

[codex] Adapt DSv4 backward alignment to latest PaddleFormers#4611
SigureMo wants to merge 4 commits into
developfrom
codex/dsv4-backward-align-latest

SigureMo commented Jun 6, 2026

Uh oh!

CLAassistant commented Jun 6, 2026

Uh oh!

Paddle-CI-Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SigureMo commented Jun 6, 2026

Summary

Validation

Uh oh!

CLAassistant commented Jun 6, 2026

Uh oh!

Paddle-CI-Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Paddle-CI-Bot commented Jun 6, 2026 •

edited

Loading