[DSv4] Align training numerics with Megatron reference by huangjiyi · Pull Request #1151 · PaddlePaddle/PaddleFleet

huangjiyi · 2026-06-08T10:32:41Z

Summary

This PR adds the PaddleFleet-side DSv4 numerical alignment paths used to reproduce the Megatron reference training trajectory on current develop.

Main areas covered:

Megatron-compatible DSv4 SwiGLU forward / wgrad behavior.
Embedding, LM head, language loss, tensor-parallel linear, CSA/DSA attention, mHC, MoE router / dispatcher / MLP, MTP, norm, and utility alignment paths.
Revert3 DSv4 attention / CSA compressor / mHC numerical semantics matching the Megatron CleanAlign reference.
DeepEP / EP8 / mbs2 / acc2 training path support for bitwise loss reproduction.
A single-card repro test for CSA sink-softmax backward behavior.

Pairs with PaddlePaddle/PaddleFormers#4623.

Validation

Validated in /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4-flash-workspace against Megatron CleanAlign revert3:

RUN_STAMP=retest_source_revert3_100step_20260608 \
OUTPUT_PREFIX=dpskv4_ep8_4layer_1k_retest_source_revert3_100step \
MASTER_PORT=40531 \
bash paddlefleet_dsv4/run_align_100step_lr1e3_mbs2_acc2.sh

Result versus Megatron CleanAlign revert3 100-step log:

main_matches=200/200
mtp_matches=200/200
first_main_diff=None
first_mtp_diff=None
step100 main final_loss=1.04997014999389648438 md5=e24faebe91ac491b4be0b8c71060a06a
step100 mtp final_loss=2.01599717140197753906 md5=80c6ec222c562733922f22dabb4fc8fe

Also validated the same commit in /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22steps after syncing these changes:

main_matches=200/200
mtp_matches=200/200
first_main_diff=None
first_mtp_diff=None

Static checks:

python -m py_compile src/paddlefleet/transformer/csa_attention.py src/paddlefleet/transformer/dsv4_hybrid_attention.py src/paddlefleet/transformer/hyper_connection.py tests/single_card_tests/transformer/test_csa_sink_subtract_backward_repro.py
git diff --check

Signed-off-by: huangjiyi <947613776@qq.com>

Copilot

Pull request overview

该 PR 在 PaddleFleet 侧引入/调整 DSv4 的数值对齐路径，用于更精确复现 Megatron 参考训练轨迹（包括 sequence-first、MoE/HC/Attention/Linear 等关键算子的对齐分支与若干调试/日志路径调整）。

Changes:

新增一组 DSv4 环境变量开关与 “sequence-first” 对齐分支（RMSNorm、LM Head、MTP、MoE 等）。
调整 Transformer/MoE/HC/Attention/TP Linear 的若干实现细节以对齐 Megatron 的计算/梯度累积顺序，并清理部分旧的 MD5 probe 代码。
扩展/调整若干 fused/FP8 相关路径与损失侧日志输出，以支持 bitwise 对齐验证。

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/paddlefleet/utils.py	增加 DSv4 环境变量布尔开关与 sequence-first 特性开关辅助函数
src/paddlefleet/transformer/transformer_layer.py	TransformerLayer 对齐/精简路径调整（含 MTP 相关处理与 attention 调用形态变化）
src/paddlefleet/transformer/paddle_norm.py	RMSNorm/FusedRMSNorm 增加 sequence-first 数值对齐分支
src/paddlefleet/transformer/multi_token_prediction.py	MTP 相关初始化/线性对齐分支与 magic-send 路径调整
src/paddlefleet/transformer/moe/token_dispatcher.py	MoE dispatcher 增加对齐开关与概率路由张量的额外保存/传递
src/paddlefleet/transformer/moe/moe_utils.py	MoE permute/unpermute 的 fp32 累积与自定义 backward 对齐开关
src/paddlefleet/transformer/moe/moe_router.py	Router matmul/wgrad 的对齐分支（含 torch mm / fp32 wgrad 累积开关等）
src/paddlefleet/transformer/moe/moe_layer.py	MoE layer 对齐分支、输入分支拆分顺序调整、sonic moe/融合路径改造
src/paddlefleet/transformer/moe/fp8_utils.py	FP8/MoE fused 路径与 clamp/backward 逻辑调整、indices 生成与 dtype 调整
src/paddlefleet/transformer/hyper_connection.py	mHC/contract/sinkhorn 对齐逻辑大幅调整（含 torch backward 可选分支）
src/paddlefleet/transformer/dsv4_hybrid_attention.py	DSv4 hybrid attention 对齐分支（含 torch backward 可选路径）
src/paddlefleet/transformer/dsa_attention.py	DSA 的 hadamard/grad_k torch backward 可选对齐分支
src/paddlefleet/transformer/csa_attention.py	CSA 索引构建加入 doc mask、sink-softmax torch backward 可选分支等
src/paddlefleet/tensor_parallel/layers.py	TP linear/embedding/反传分支增加 TE dgrad、seq-first wgrad 等对齐开关
src/paddlefleet/models/gpt/lm_head.py	LM Head 增加 sequence-first linear 对齐分支并清理旧探针
src/paddlefleet/models/gpt/gpt_embedding.py	GPT embedding 的 MTP 对齐（pad/position_ids/可选 separate embedding 等）
src/paddlefleet/models/common/language_loss/language_loss.py	loss 侧对齐日志/MD5 打印整理与 MTP loss 合并逻辑调整
src/paddlefleet/fusions/fused_swiglu_scale.py	fused swiglu*scale 的 clamp 与 CPU/XPU fallback 实现调整
src/paddlefleet/fusions/fused_bias_swiglu.py	SwiGLU/backward 与小 chunk 求和对齐逻辑、clamp backward 细节调整
packages/paddlefleet_ops/src/paddlefleet_ops/utils.py	patch_module_namespace 行为从“移动”改为“拷贝”以保持原命名可用

Comments suppressed due to low confidence (1)

src/paddlefleet/transformer/transformer_layer.py:809

TransformerLayer._forward_attention() no longer forwards KV-cache arguments (past_key_values / layer_idx / use_cache) into self.self_attn. This breaks the native KV cache inference path (e.g. src/paddlefleet/generation/greedy_generator.py expects these kwargs to flow through to DotProductAttention.forward).

            attention_output_with_bias = self.self_attn(
                input_layernorm_output,
                attention_mask=attention_mask,
                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                rope_freqs_cis=rope_freqs_cis,

 from paddle.distributed.fleet.utils import recompute
-from paddle.distributed.fleet.utils.sequence_parallel_utils import (
-    ScatterOp,
-)



+    d_scale = paddle.sum(
+        out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32), axis=-1
+    ).cast(scale.dtype)


+        # Megatron clean path applies H_res.T to residual.
+        h_res_batched = h_res.astype(residual.dtype).transpose([0, 1, 3, 2]).reshape([num_tokens, n, n])
        # [..., n*C] -> [..., n, C] -> [batch, n, C]


@@ -924,16 +820,10 @@ def _forward_attention(
                rotary_pos_emb=rotary_pos_emb,


codecov-commenter · 2026-06-08T11:37:16Z

Codecov Report

❌ Patch coverage is 41.33710% with 623 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@60700cc). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/paddlefleet/transformer/hyper_connection.py	31.84%	252 Missing and 7 partials ⚠️
src/paddlefleet/tensor_parallel/layers.py	23.59%	62 Missing and 6 partials ⚠️
src/paddlefleet/transformer/csa_attention.py	31.11%	59 Missing and 3 partials ⚠️
...c/paddlefleet/transformer/dsv4_hybrid_attention.py	36.61%	44 Missing and 1 partial ⚠️
src/paddlefleet/transformer/moe/moe_layer.py	47.67%	40 Missing and 5 partials ⚠️
src/paddlefleet/transformer/dsa_attention.py	26.19%	28 Missing and 3 partials ⚠️
src/paddlefleet/transformer/moe/moe_router.py	51.02%	20 Missing and 4 partials ⚠️
src/paddlefleet/fusions/fused_bias_swiglu.py	51.21%	17 Missing and 3 partials ⚠️
src/paddlefleet/fusions/fused_swiglu_scale.py	69.76%	10 Missing and 3 partials ⚠️
...rc/paddlefleet/transformer/moe/token_dispatcher.py	54.16%	9 Missing and 2 partials ⚠️
... and 6 more

❌ Your patch status has failed because the patch coverage (41.33%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #1151   +/-   ##
==========================================
  Coverage           ?   44.02%           
==========================================
  Files              ?       28           
  Lines              ?     1129           
  Branches           ?      137           
==========================================
  Hits               ?      497           
  Misses             ?      582           
  Partials           ?       50

Flag	Coverage Δ
coverage_combine	`44.02% <41.33%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/paddlefleet/transformer/moe/fp8_utils.py	`100.00% <100.00%> (ø)`
src/paddlefleet/transformer/transformer_layer.py	`100.00% <100.00%> (ø)`
src/paddlefleet/utils.py	`100.00% <ø> (ø)`
src/paddlefleet/models/gpt/lm_head.py	`73.33% <73.33%> (ø)`
src/paddlefleet/models/gpt/gpt_embedding.py	`53.33% <53.33%> (ø)`
src/paddlefleet/transformer/moe/moe_utils.py	`56.25% <56.25%> (ø)`
src/paddlefleet/transformer/paddle_norm.py	`72.00% <72.00%> (ø)`
...fleet/models/common/language_loss/language_loss.py	`56.52% <56.52%> (ø)`
.../paddlefleet/transformer/multi_token_prediction.py	`41.17% <41.17%> (ø)`
...rc/paddlefleet/transformer/moe/token_dispatcher.py	`54.16% <54.16%> (ø)`
... and 9 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: huangjiyi <947613776@qq.com>

Paddle-CI-Bot · 2026-06-08T11:43:35Z

PaddleFleet Log Analysis

Run #27135240073 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Build Fleet whl	Docker 镜像拉取超时（网络异常）	与本PR无关，runner Paddle-cpu-0623-3 访问 CCR 镜像仓库超时，CI 维护人员检查该机器网络/代理配置	报错代码

失败的测试case:

Build Fleet whl / Check docker image and run container
  - docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle_manylinux_devel:cuda12.9-cudnn9.9-trt10.5-gcc11 失败
    net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

根本原因分析:
Runner Paddle-cpu-0623-3 在拉取构建镜像时，向 CCR 认证服务 ccr-auth.bj.baidubce.com 的 HTTPS 请求在等待响应头时超时，导致 docker pull 失败，容器未创建，后续所有构建步骤均跳过。与本PR代码改动无关。

修复建议:
CI 维护人员检查 Paddle-cpu-0623-3 机器的网络出口及 HTTP 代理（180.76.138.26:18801）是否能正常访问 ccr-auth.bj.baidubce.com，必要时重启代理服务或重新触发流水线。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

[DSv4] Align training numerics with Megatron reference

aa05648

Signed-off-by: huangjiyi <947613776@qq.com>

Copilot AI review requested due to automatic review settings June 8, 2026 10:32

Copilot started reviewing on behalf of huangjiyi June 8, 2026 10:32 View session

huangjiyi mentioned this pull request Jun 8, 2026

[DSv4] Support Megatron-aligned training loop PaddlePaddle/PaddleFormers#4623

Closed

Copilot AI reviewed Jun 8, 2026

View reviewed changes

[DSv4] Match revert3 reference numerics

6ff7151

Signed-off-by: huangjiyi <947613776@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSv4] Align training numerics with Megatron reference#1151

[DSv4] Align training numerics with Megatron reference#1151
huangjiyi wants to merge 2 commits into
developfrom
codex/dsv4-align-origin-develop-20260608_155649

huangjiyi commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 8, 2026

Uh oh!

Paddle-CI-Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -924,16 +820,10 @@ def _forward_attention(
		rotary_pos_emb=rotary_pos_emb,

Uh oh!

Conversation

huangjiyi commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 8, 2026

Codecov Report

Uh oh!

Paddle-CI-Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFleet Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huangjiyi commented Jun 8, 2026 •

edited

Loading

Paddle-CI-Bot commented Jun 8, 2026 •

edited

Loading