Skip to content

[codex] Adapt DSv4 backward alignment to latest PaddleFormers#4611

Draft
SigureMo wants to merge 4 commits into
developfrom
codex/dsv4-backward-align-latest
Draft

[codex] Adapt DSv4 backward alignment to latest PaddleFormers#4611
SigureMo wants to merge 4 commits into
developfrom
codex/dsv4-backward-align-latest

Conversation

@SigureMo

@SigureMo SigureMo commented Jun 6, 2026

Copy link
Copy Markdown
Member

Summary

Port the DSv4 backward-alignment changes onto the latest PaddleFormers develop baseline.

This branch includes the two backward-alignment commits ported from the old alignment branch, plus local run-entry/data-path adaptation used by the latest 10-step alignment run:

  • keep the SFT precision run from saving final model/state artifacts
  • add DSV4_FLEET_FIXED_TOKENS JSON fixed-token batch support in collate_fn
  • preserve the existing LOAD_FIXED_DATA_PATH path and print which fixed-data source is used

This PR is intended to be used together with the matching PaddleFleet PR from branch codex/dsv4-backward-align-latest.

Validation

  • git diff --check origin/develop..HEAD
  • 10-step DSv4 old/latest comparison completed with paired PaddleFleet changes:
    • main loss md5 diff: 0/10
    • MTP loss md5 diff: 0/10
    • latest log: /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22steps/outputs/dpskv4_ep8_4layer_1k_align_20260606_1318/output_0/workerlog.0
    • old log: /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_align_22steps/outputs/dpskv4_ep8_4layer_1k_align_20260606_1325/output_0/workerlog.0

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Paddle-CI-Bot

Paddle-CI-Bot commented Jun 6, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #27056714883 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Fleet Model Test - Integration test (H20, single card) PP段切分 Bug glm4_moe/modeling.pyGlm4MoeDecoderLayer 层名与 paddlefleet _gen_layer_weight 正则匹配失败,检查 gpt_provider.py 传入的 seg_method 或 decoder 层类名是否与 paddlefleet 期望一致 报错代码
Fleet Model Test - Integration test (H20, multi-card) PP段切分 Bug 同上;另有 EP4 场景出现 number of layers (1) should be divided by part number(2),说明 GLM4.5 MoE DSA 的层数配置与 PP stage 数不整除,需检查 num_layers 或 PP size 设置 报错代码
Fleet Model Test - Integration test (A100) PP段切分 Bug 与 H20 同根因,GLM4.5 所有 task(pre-train/sft/lora/dpo/dpo_lora)均因 weight_idxs 为空或层数不整除失败 报错代码
Unittest GPU CI - unittest-gpu-ci PP段切分 Bug + SwanLab 初始化缺失 1) GlmMoeDsaModelTest 同 Fleet 根因;2) SwanLabCallback.on_train_begin 未先调用 swanlab.init(),测试需在 mock swanlab.get_run() 或在 setUp 中先 init 报错代码

失败的测试 case:

# Fleet Model Test (H20 single-card)
- Integration test (GLM4.5 single-card)        → AssertionError: weight_idxs' length should be greater than 0
- Integration test (Qwen3-30B-A3B single-card) → AssertionError: weight_idxs' length should be greater than 0

# Fleet Model Test (H20 multi-card)
- GLM4.5 pre-train       → weight_idxs length 0 / exit code 241
- GLM4.5 sft             → exit code 241
- GLM4.5 sft cp          → exit code 1
- GLM4.5 lora            → exit code 1
- GLM4.5 dpo             → exit code 241
- GLM4.5 dpo_lora        → exit code 1
- GLM4.5 pre-train (EP4) → number of layers (1) should be divided by part number(2)
- GLM4.5 pre-train (FP8) → weight_idxs length 0
- GLM4.5 pre-train (Grouped GEMM) → weight_idxs length 0
- Qwen pre-train         → weight_idxs length 0
- Qwen sft               → exit code 1
- Qwen lora              → exit code 1
- Qwen vl lora           → exit code 1

# Fleet Model Test (A100)
- GLM4.5 pre-train       → weight_idxs length 0 / exit code 241
- GLM4.5 sft             → exit code 241
- GLM4.5 lora            → exit code 1
- GLM4.5 dpo             → exit code 241
- GLM4.5 dpo_lora        → number of layers (1) / exit code 241
- Qwen pre-train         → weight_idxs length 0
- Qwen sft               → exit code 1
- Qwen lora              → exit code 1
- Qwen vl lora           → exit code 1

# Unittest GPU CI
- tests/trainer/test_trainer_visualization.py::TestSwanlabCallback::test_swanlabcallback
- tests/transformers/glm_moe_dsa/test_modeling.py::GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model

根本原因分析:

本 PR 引入了 glm_moe_dsa(GLM4.5 MoE DSA)新模型,其 decoder 层命名为 Glm4MoeDecoderLayer,但 gpt_provider.py 传给 gpt_builderseg_method(PP pipeline 分段依据的层名 pattern)仍然指向旧的层名(如 Glm4MoeDecoderLayer 无法被 paddlefleet _gen_layer_weightregex.search(name) 匹配),导致 weight_idxs 为空,所有使用该 GPTProvider 路径的 GLM4.5 MoE 场景(pre-train / sft / lora / dpo 以及 Unittest 中的 GlmMoeDsaModelTest)全部崩溃;SwanLabCallback 测试则是新 case 忘记在 setUp 中初始化 swanlab run。

修复建议:

  1. PP 层名匹配修复:在 paddleformers/transformers/gpt_provider.pyprovide() 方法中,确认 seg_method 传入的 layernameGlmMoeDsaForCausalLM 实际使用的 decoder 层类名一致。排查方式:在 glm_moe_dsa/modeling.py 中打印 self._layers_desc 各层 __name__,确认 regex pattern 能命中。若 DSA 版本将 decoder 层改名(如改为 GlmMoeDsaDecoderLayer),则需同步更新 seg_method 中的 pattern 字符串。

  2. EP4 层数整除问题number of layers (1) should be divided by part number(2) 说明 CI 配置里 num_layers 设置为 1 而 PP size=2,需要检查 tests/config/ci/glm45_*_ep4*.yamlnum_hidden_layers 是否被正确配置为可被 PP stage 数整除的值。

  3. SwanLabCallback 测试修复:在 tests/trainer/test_trainer_visualization.pyTestSwanlabCallback.setUp 中添加 swanlab.init(mode="disabled") 或 mock swanlab.get_run() 返回一个假 run 对象,避免 on_train_begin 中因无 active run 抛出 RuntimeError


🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants