Skip to content

[DSv4] Support Megatron-aligned training loop#4623

Closed
huangjiyi wants to merge 1 commit into
developfrom
codex/dsv4-paddleformers-develop-a13fcde3-align
Closed

[DSv4] Support Megatron-aligned training loop#4623
huangjiyi wants to merge 1 commit into
developfrom
codex/dsv4-paddleformers-develop-a13fcde3-align

Conversation

@huangjiyi

@huangjiyi huangjiyi commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds the PaddleFormers-side DSv4 training loop support needed by the current Megatron-aligned PaddleFleet training path on develop.

Main areas covered:

  • Trainer / optimizer behavior used by the DSv4 alignment run.
  • DeepSeek V4 model-entry and GPT provider compatibility changes.
  • Data collate and SFT workflow adjustments needed by the aligned run script.
  • MoE hybrid optimizer integration with the aligned PaddleFleet numerical path.

Pairs with PaddlePaddle/PaddleFleet#1151.

Validation

Validated in /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22steps with PaddleFleet 60700ccb5edb1da29183ca27c4b34c7f06dbb9fc and PaddleFormers a13fcde3f809a011edff0a22c809bd0a1403f46f:

RUN_RANK=3 MASTER_PORT=40401 RUN_STAMP=latest_paddlefleet60700cc_paddleformers_a13fcde3_100step_20260608 OUTPUT_PREFIX=dpskv4_ep8_4layer_1k_latest_pfleet60700cc_pformers_a13fcde3_100step bash run_align_100step_lr1e3_mbs2_acc2.sh

Result:

main_matches=200/200
mtp_matches=200/200
first_main_diff=None
first_mtp_diff=None
step100 main final_loss=0.74668949842453002930 md5=9d82f1177a18c2810367a158003c281c
step100 mtp final_loss=4.16679096221923828125 md5=d36254d4f44da16479cafc2199f8d41b

Also ran git diff --cached --check and py_compile for the touched PaddleFormers files before commit.

Signed-off-by: huangjiyi <947613776@qq.com>
@Paddle-CI-Bot

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #27133430627 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Integration test (H20, single card) 其他 — PP段切分 AssertionError qwen3_pt.yaml 配置的 pp_size 与模型层数不匹配,导致 _gen_layer_weight 找不到对应层权重;检查 qwen3_moe 模型注册到 gpt_provider 时的层名是否与 seg_method 一致,并核查 CI 配置文件中的 pipeline_parallel_degree 报错代码
Integration test (H20, multi-card) 其他 — checkpoint 路径作为 repo_id / FSDP exit -6 qwen_lora.sh:上一步 SFT 生成的 checkpoint 路径 /workspace/checkpoints/qwen-sft 被传给 AutoConfig.from_pretrained,触发 HFValidationError,说明 lora 任务依赖前序 SFT checkpoint,而 SFT 步骤本次未成功产出; qwen3vl_sft_fsdp.yaml:进程退出码 -6(SIGABRT),训练本身已完成(10 steps 正常),FSDP 多卡在训练结束后 abort,需排查 FSDP save/cleanup 路径 报错代码
unittest-gpu-ci 其他 — Numba JIT 编译失败 GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model 在 Numba 编译 DSA kernel 时触发 TypingError / NotImplementedError,为 Numba 版本与 DSA kernel 写法不兼容;检查本 PR 新增的 DSA kernel 中 static_setitemresolve_hasattr 用法是否与当前 Numba 版本匹配,或升级 Numba 报错代码

失败的测试case:

# Integration test (H20, single card)
Qwen3-30B-A3B single card PT (qwen3_pt.yaml) — AssertionError: weight_idxs' length should be greater than 0 when instantiating GPTModel

# Integration test (H20, multi-card)
qwen_lora multi-card (qwen3vl_lora.sh tp8 h20) — HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/workspace/checkpoints/qwen-sft'
qwen3vl_sft FSDP multi-card (qwen3vl_sft_fsdp.yaml) — Container rank 0 exit code -6 (SIGABRT)

# unittest-gpu-ci
tests/transformers/glm_moe_dsa/test_modeling.py::GlmMoeDsaModelTest::test_GlmMoeDsa_lm_head_model
  NotImplementedError: No definition for lowering static_setitem(Array(bool,1,'A',...), slice<a:b>, Array(bool,1,'A',...)) -> none
  numba.core.errors.TypingError: No implementation of Function(<intrinsic resolve_hasattr>) for (int64, unicode_type)

根本原因分析:

本 PR 新增了 qwen3_moe 的 PT 训练路径和 GlmMoeDsa 模型实现:① qwen3_moe 走 Pipeline Parallel 时,gpt_providerpaddlefleet 传入的 seg_method 与新模型的层名不匹配,导致 PP 段切分断言失败;② lora 多卡测试以上一步 SFT checkpoint 为输入,SFT 失败后产生了无效路径,下游任务连锁失败;③ GlmMoeDsaModelTest 中 DSA kernel 使用了 Numba 当前版本不支持的 static_setitem(bool[:], slice, bool[:]) 操作,JIT 编译失败。

修复建议:

  1. PP 段切分(H20 single card):在 paddleformers/transformers/qwen3_moe/modeling.py__new__gpt_provider.pyprovide 中,打印 seg_method 实际使用的层名列表,与模型 named_children() 对比,确保 LayerName 完全匹配;若用了 Uniform 分段策略而层数太少,调大 num_micro_batches 或减小 pp_size

  2. Lora 依赖 SFT checkpoint(H20 multi-card):在 qwen3vl_lora.sh 前增加 checkpoint 存在性校验([[ -f /workspace/checkpoints/qwen-sft/config.json ]]),缺失时提前 exit 并给出明确错误而非传入非法 repo_id;同时修复上游 SFT 失败根因(同 issue 1)。

  3. FSDP SIGABRT(H20 multi-card):在 qwen3vl_sft_fsdp.yaml 对应的训练后处理中检查 FSDP state_dict 保存逻辑,确认多卡 barrier 正确同步后再退出;可先 rerun 确认是否偶现。

  4. GlmMoeDsa Numba 兼容性(unittest):在 tests/transformers/glm_moe_dsa/test_modeling.py 或对应 kernel 实现中,用 Python 原生 slice 赋值替换 Numba JIT 内的 bool[:] 切片赋值,或将该段代码改为 @numba.jit(nopython=False) / 移出 JIT 范围;同时固定 numba>=0.60 并确认 static_setitem(bool slice) 已在该版本支持。


🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

@huangjiyi huangjiyi closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants