[DSv4] Align training numerics with Megatron reference#1151
Open
huangjiyi wants to merge 2 commits into
Open
Conversation
Signed-off-by: huangjiyi <947613776@qq.com>
Contributor
There was a problem hiding this comment.
Pull request overview
该 PR 在 PaddleFleet 侧引入/调整 DSv4 的数值对齐路径,用于更精确复现 Megatron 参考训练轨迹(包括 sequence-first、MoE/HC/Attention/Linear 等关键算子的对齐分支与若干调试/日志路径调整)。
Changes:
- 新增一组 DSv4 环境变量开关与 “sequence-first” 对齐分支(RMSNorm、LM Head、MTP、MoE 等)。
- 调整 Transformer/MoE/HC/Attention/TP Linear 的若干实现细节以对齐 Megatron 的计算/梯度累积顺序,并清理部分旧的 MD5 probe 代码。
- 扩展/调整若干 fused/FP8 相关路径与损失侧日志输出,以支持 bitwise 对齐验证。
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paddlefleet/utils.py | 增加 DSv4 环境变量布尔开关与 sequence-first 特性开关辅助函数 |
| src/paddlefleet/transformer/transformer_layer.py | TransformerLayer 对齐/精简路径调整(含 MTP 相关处理与 attention 调用形态变化) |
| src/paddlefleet/transformer/paddle_norm.py | RMSNorm/FusedRMSNorm 增加 sequence-first 数值对齐分支 |
| src/paddlefleet/transformer/multi_token_prediction.py | MTP 相关初始化/线性对齐分支与 magic-send 路径调整 |
| src/paddlefleet/transformer/moe/token_dispatcher.py | MoE dispatcher 增加对齐开关与概率路由张量的额外保存/传递 |
| src/paddlefleet/transformer/moe/moe_utils.py | MoE permute/unpermute 的 fp32 累积与自定义 backward 对齐开关 |
| src/paddlefleet/transformer/moe/moe_router.py | Router matmul/wgrad 的对齐分支(含 torch mm / fp32 wgrad 累积开关等) |
| src/paddlefleet/transformer/moe/moe_layer.py | MoE layer 对齐分支、输入分支拆分顺序调整、sonic moe/融合路径改造 |
| src/paddlefleet/transformer/moe/fp8_utils.py | FP8/MoE fused 路径与 clamp/backward 逻辑调整、indices 生成与 dtype 调整 |
| src/paddlefleet/transformer/hyper_connection.py | mHC/contract/sinkhorn 对齐逻辑大幅调整(含 torch backward 可选分支) |
| src/paddlefleet/transformer/dsv4_hybrid_attention.py | DSv4 hybrid attention 对齐分支(含 torch backward 可选路径) |
| src/paddlefleet/transformer/dsa_attention.py | DSA 的 hadamard/grad_k torch backward 可选对齐分支 |
| src/paddlefleet/transformer/csa_attention.py | CSA 索引构建加入 doc mask、sink-softmax torch backward 可选分支等 |
| src/paddlefleet/tensor_parallel/layers.py | TP linear/embedding/反传分支增加 TE dgrad、seq-first wgrad 等对齐开关 |
| src/paddlefleet/models/gpt/lm_head.py | LM Head 增加 sequence-first linear 对齐分支并清理旧探针 |
| src/paddlefleet/models/gpt/gpt_embedding.py | GPT embedding 的 MTP 对齐(pad/position_ids/可选 separate embedding 等) |
| src/paddlefleet/models/common/language_loss/language_loss.py | loss 侧对齐日志/MD5 打印整理与 MTP loss 合并逻辑调整 |
| src/paddlefleet/fusions/fused_swiglu_scale.py | fused swiglu*scale 的 clamp 与 CPU/XPU fallback 实现调整 |
| src/paddlefleet/fusions/fused_bias_swiglu.py | SwiGLU/backward 与小 chunk 求和对齐逻辑、clamp backward 细节调整 |
| packages/paddlefleet_ops/src/paddlefleet_ops/utils.py | patch_module_namespace 行为从“移动”改为“拷贝”以保持原命名可用 |
Comments suppressed due to low confidence (1)
src/paddlefleet/transformer/transformer_layer.py:809
TransformerLayer._forward_attention()no longer forwards KV-cache arguments (past_key_values/layer_idx/use_cache) intoself.self_attn. This breaks the native KV cache inference path (e.g.src/paddlefleet/generation/greedy_generator.pyexpects these kwargs to flow through toDotProductAttention.forward).
attention_output_with_bias = self.self_attn(
input_layernorm_output,
attention_mask=attention_mask,
attn_mask_startend_row_indices=attn_mask_startend_row_indices,
rope_freqs_cis=rope_freqs_cis,
Comment on lines
31
to
32
| from paddle.distributed.fleet.utils import recompute | ||
| from paddle.distributed.fleet.utils.sequence_parallel_utils import ( | ||
| ScatterOp, | ||
| ) | ||
|
|
Comment on lines
+102
to
+104
| d_scale = paddle.sum( | ||
| out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32), axis=-1 | ||
| ).cast(scale.dtype) |
Comment on lines
898
to
900
| # Megatron clean path applies H_res.T to residual. | ||
| h_res_batched = h_res.astype(residual.dtype).transpose([0, 1, 3, 2]).reshape([num_tokens, n, n]) | ||
| # [..., n*C] -> [..., n, C] -> [batch, n, C] |
Comment on lines
816
to
820
| @@ -924,16 +820,10 @@ def _forward_attention( | |||
| rotary_pos_emb=rotary_pos_emb, | |||
Signed-off-by: huangjiyi <947613776@qq.com>
PaddleFleet Log Analysis
日志分析报告
失败的测试case: 根本原因分析: 修复建议:
🔄 每次 Re-run 后自动更新 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the PaddleFleet-side DSv4 numerical alignment paths used to reproduce the Megatron reference training trajectory on current
develop.Main areas covered:
Pairs with PaddlePaddle/PaddleFormers#4623.
Validation
Validated in
/root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4-flash-workspaceagainst Megatron CleanAlign revert3:Result versus Megatron CleanAlign revert3 100-step log:
Also validated the same commit in
/root/paddlejob/share-storage/gpfs/system-public/huangjiyi/dsv4_backward_align/0605_latest_align_22stepsafter syncing these changes:Static checks: