Skip to content

[DSv4] Align training numerics with Megatron reference#1151

Open
huangjiyi wants to merge 2 commits into
developfrom
codex/dsv4-align-origin-develop-20260608_155649
Open

[DSv4] Align training numerics with Megatron reference#1151
huangjiyi wants to merge 2 commits into
developfrom
codex/dsv4-align-origin-develop-20260608_155649

Conversation

@huangjiyi

@huangjiyi huangjiyi commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds the PaddleFleet-side DSv4 numerical alignment paths used to reproduce the Megatron reference training trajectory on current develop.

Main areas covered:

  • Megatron-compatible DSv4 SwiGLU forward / wgrad behavior.
  • Embedding, LM head, language loss, tensor-parallel linear, CSA/DSA attention, mHC, MoE router / dispatcher / MLP, MTP, norm, and utility alignment paths.
  • Revert3 DSv4 attention / CSA compressor / mHC numerical semantics matching the Megatron CleanAlign reference.
  • DeepEP / EP8 / mbs2 / acc2 training path support for bitwise loss reproduction.
  • A single-card repro test for CSA sink-softmax backward behavior.

Pairs with PaddlePaddle/PaddleFormers#4623.

Validation

Validated in /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4-flash-workspace against Megatron CleanAlign revert3:

RUN_STAMP=retest_source_revert3_100step_20260608 \
OUTPUT_PREFIX=dpskv4_ep8_4layer_1k_retest_source_revert3_100step \
MASTER_PORT=40531 \
bash paddlefleet_dsv4/run_align_100step_lr1e3_mbs2_acc2.sh

Result versus Megatron CleanAlign revert3 100-step log:

main_matches=200/200
mtp_matches=200/200
first_main_diff=None
first_mtp_diff=None
step100 main final_loss=1.04997014999389648438 md5=e24faebe91ac491b4be0b8c71060a06a
step100 mtp final_loss=2.01599717140197753906 md5=80c6ec222c562733922f22dabb4fc8fe

Also validated the same commit in /root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22steps after syncing these changes:

main_matches=200/200
mtp_matches=200/200
first_main_diff=None
first_mtp_diff=None

Static checks:

python -m py_compile src/paddlefleet/transformer/csa_attention.py src/paddlefleet/transformer/dsv4_hybrid_attention.py src/paddlefleet/transformer/hyper_connection.py tests/single_card_tests/transformer/test_csa_sink_subtract_backward_repro.py
git diff --check

Signed-off-by: huangjiyi <947613776@qq.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 在 PaddleFleet 侧引入/调整 DSv4 的数值对齐路径,用于更精确复现 Megatron 参考训练轨迹(包括 sequence-first、MoE/HC/Attention/Linear 等关键算子的对齐分支与若干调试/日志路径调整)。

Changes:

  • 新增一组 DSv4 环境变量开关与 “sequence-first” 对齐分支(RMSNorm、LM Head、MTP、MoE 等)。
  • 调整 Transformer/MoE/HC/Attention/TP Linear 的若干实现细节以对齐 Megatron 的计算/梯度累积顺序,并清理部分旧的 MD5 probe 代码。
  • 扩展/调整若干 fused/FP8 相关路径与损失侧日志输出,以支持 bitwise 对齐验证。

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/paddlefleet/utils.py 增加 DSv4 环境变量布尔开关与 sequence-first 特性开关辅助函数
src/paddlefleet/transformer/transformer_layer.py TransformerLayer 对齐/精简路径调整(含 MTP 相关处理与 attention 调用形态变化)
src/paddlefleet/transformer/paddle_norm.py RMSNorm/FusedRMSNorm 增加 sequence-first 数值对齐分支
src/paddlefleet/transformer/multi_token_prediction.py MTP 相关初始化/线性对齐分支与 magic-send 路径调整
src/paddlefleet/transformer/moe/token_dispatcher.py MoE dispatcher 增加对齐开关与概率路由张量的额外保存/传递
src/paddlefleet/transformer/moe/moe_utils.py MoE permute/unpermute 的 fp32 累积与自定义 backward 对齐开关
src/paddlefleet/transformer/moe/moe_router.py Router matmul/wgrad 的对齐分支(含 torch mm / fp32 wgrad 累积开关等)
src/paddlefleet/transformer/moe/moe_layer.py MoE layer 对齐分支、输入分支拆分顺序调整、sonic moe/融合路径改造
src/paddlefleet/transformer/moe/fp8_utils.py FP8/MoE fused 路径与 clamp/backward 逻辑调整、indices 生成与 dtype 调整
src/paddlefleet/transformer/hyper_connection.py mHC/contract/sinkhorn 对齐逻辑大幅调整(含 torch backward 可选分支)
src/paddlefleet/transformer/dsv4_hybrid_attention.py DSv4 hybrid attention 对齐分支(含 torch backward 可选路径)
src/paddlefleet/transformer/dsa_attention.py DSA 的 hadamard/grad_k torch backward 可选对齐分支
src/paddlefleet/transformer/csa_attention.py CSA 索引构建加入 doc mask、sink-softmax torch backward 可选分支等
src/paddlefleet/tensor_parallel/layers.py TP linear/embedding/反传分支增加 TE dgrad、seq-first wgrad 等对齐开关
src/paddlefleet/models/gpt/lm_head.py LM Head 增加 sequence-first linear 对齐分支并清理旧探针
src/paddlefleet/models/gpt/gpt_embedding.py GPT embedding 的 MTP 对齐(pad/position_ids/可选 separate embedding 等)
src/paddlefleet/models/common/language_loss/language_loss.py loss 侧对齐日志/MD5 打印整理与 MTP loss 合并逻辑调整
src/paddlefleet/fusions/fused_swiglu_scale.py fused swiglu*scale 的 clamp 与 CPU/XPU fallback 实现调整
src/paddlefleet/fusions/fused_bias_swiglu.py SwiGLU/backward 与小 chunk 求和对齐逻辑、clamp backward 细节调整
packages/paddlefleet_ops/src/paddlefleet_ops/utils.py patch_module_namespace 行为从“移动”改为“拷贝”以保持原命名可用
Comments suppressed due to low confidence (1)

src/paddlefleet/transformer/transformer_layer.py:809

  • TransformerLayer._forward_attention() no longer forwards KV-cache arguments (past_key_values / layer_idx / use_cache) into self.self_attn. This breaks the native KV cache inference path (e.g. src/paddlefleet/generation/greedy_generator.py expects these kwargs to flow through to DotProductAttention.forward).
            attention_output_with_bias = self.self_attn(
                input_layernorm_output,
                attention_mask=attention_mask,
                attn_mask_startend_row_indices=attn_mask_startend_row_indices,
                rope_freqs_cis=rope_freqs_cis,

Comment on lines 31 to 32
from paddle.distributed.fleet.utils import recompute
from paddle.distributed.fleet.utils.sequence_parallel_utils import (
ScatterOp,
)

Comment on lines +102 to +104
d_scale = paddle.sum(
out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32), axis=-1
).cast(scale.dtype)
Comment on lines 898 to 900
# Megatron clean path applies H_res.T to residual.
h_res_batched = h_res.astype(residual.dtype).transpose([0, 1, 3, 2]).reshape([num_tokens, n, n])
# [..., n*C] -> [..., n, C] -> [batch, n, C]
Comment on lines 816 to 820
@@ -924,16 +820,10 @@ def _forward_attention(
rotary_pos_emb=rotary_pos_emb,
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 41.33710% with 623 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@60700cc). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/paddlefleet/transformer/hyper_connection.py 31.84% 252 Missing and 7 partials ⚠️
src/paddlefleet/tensor_parallel/layers.py 23.59% 62 Missing and 6 partials ⚠️
src/paddlefleet/transformer/csa_attention.py 31.11% 59 Missing and 3 partials ⚠️
...c/paddlefleet/transformer/dsv4_hybrid_attention.py 36.61% 44 Missing and 1 partial ⚠️
src/paddlefleet/transformer/moe/moe_layer.py 47.67% 40 Missing and 5 partials ⚠️
src/paddlefleet/transformer/dsa_attention.py 26.19% 28 Missing and 3 partials ⚠️
src/paddlefleet/transformer/moe/moe_router.py 51.02% 20 Missing and 4 partials ⚠️
src/paddlefleet/fusions/fused_bias_swiglu.py 51.21% 17 Missing and 3 partials ⚠️
src/paddlefleet/fusions/fused_swiglu_scale.py 69.76% 10 Missing and 3 partials ⚠️
...rc/paddlefleet/transformer/moe/token_dispatcher.py 54.16% 9 Missing and 2 partials ⚠️
... and 6 more

❌ Your patch status has failed because the patch coverage (41.33%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             develop    #1151   +/-   ##
==========================================
  Coverage           ?   44.02%           
==========================================
  Files              ?       28           
  Lines              ?     1129           
  Branches           ?      137           
==========================================
  Hits               ?      497           
  Misses             ?      582           
  Partials           ?       50           
Flag Coverage Δ
coverage_combine 44.02% <41.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/paddlefleet/transformer/moe/fp8_utils.py 100.00% <100.00%> (ø)
src/paddlefleet/transformer/transformer_layer.py 100.00% <100.00%> (ø)
src/paddlefleet/utils.py 100.00% <ø> (ø)
src/paddlefleet/models/gpt/lm_head.py 73.33% <73.33%> (ø)
src/paddlefleet/models/gpt/gpt_embedding.py 53.33% <53.33%> (ø)
src/paddlefleet/transformer/moe/moe_utils.py 56.25% <56.25%> (ø)
src/paddlefleet/transformer/paddle_norm.py 72.00% <72.00%> (ø)
...fleet/models/common/language_loss/language_loss.py 56.52% <56.52%> (ø)
.../paddlefleet/transformer/multi_token_prediction.py 41.17% <41.17%> (ø)
...rc/paddlefleet/transformer/moe/token_dispatcher.py 54.16% <54.16%> (ø)
... and 9 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: huangjiyi <947613776@qq.com>
@Paddle-CI-Bot

Paddle-CI-Bot commented Jun 8, 2026

Copy link
Copy Markdown

PaddleFleet Log Analysis

Run #27135240073 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Build Fleet whl Docker 镜像拉取超时(网络异常) 与本PR无关,runner Paddle-cpu-0623-3 访问 CCR 镜像仓库超时,CI 维护人员检查该机器网络/代理配置 报错代码

失败的测试case:

Build Fleet whl / Check docker image and run container
  - docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle_manylinux_devel:cuda12.9-cudnn9.9-trt10.5-gcc11 失败
    net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

根本原因分析:
Runner Paddle-cpu-0623-3 在拉取构建镜像时,向 CCR 认证服务 ccr-auth.bj.baidubce.com 的 HTTPS 请求在等待响应头时超时,导致 docker pull 失败,容器未创建,后续所有构建步骤均跳过。与本PR代码改动无关。

修复建议:
CI 维护人员检查 Paddle-cpu-0623-3 机器的网络出口及 HTTP 代理(180.76.138.26:18801)是否能正常访问 ccr-auth.bj.baidubce.com,必要时重启代理服务或重新触发流水线。


🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants