DD: refactor finishMoveKeys and finishMoveShards into focused helpers#13608
Open
saintstack wants to merge 1 commit into
Open
DD: refactor finishMoveKeys and finishMoveShards into focused helpers#13608saintstack wants to merge 1 commit into
saintstack wants to merge 1 commit into
Conversation
Both functions grew past 400-500 lines of a single deeply-nested try/loop.
Extract the four sequential phases each performs — decode planning snapshot,
build dest SS interfaces, wait outside a transaction for destination
readiness, re-verify and commit — into standalone helpers so the top-level
functions read as orchestration loops.
finishMoveKeys (~440 → ~180 lines) delegates to:
* DecodedKeyServersState + decodeKeyServersState() —
two-pass keyServers validation with the empty-dest / subset-of-team
tolerance from the planning loop; returns nullopt when the iteration's
range is already fully moved.
* buildKeysDestServerInterfaces() — reads serverList entries and returns
(interfaces, read version) so the caller can drop the txn before the
long wait.
* waitForKeysDestServers() — issues waitForShardReady() for SSes and
TSSes, applies the counter-based TSS ignore policy, returns the ready
count.
* reverifyKeysDestAndCommit() — post-wait re-read, KRM-boundary
re-truncation, destUnchanged() check, keyServers/serverKeys writes,
commit. Returns false on dest change so the caller retries via the
existing retryAfterPostWaitChange().
finishMoveShards (~525 → ~271 lines) delegates to shards-specific analogs
(the two flavors' TSS policy, dest-decode invariants, and post-wait writes
differ enough that forcing shared code hurts readability):
* DecodedShardsKeyServers + decodeAndPreCheckShards() — dataMoveId stamp
check on every sub-range and the per-sub-range AUDIT_DATAMOVE_PRE_CHECK
(coroutine because of the audit call). Throws retry() and sets
cancelDataMove on stamp mismatch, matching prior behavior.
* buildShardsDestServerInterfaces() — like the keys flavor but throws
retry() on missing serverList entries rather than asserting, matching
the pre-refactor shards code.
* waitForShardsDestServers() — implements the time-based skipTss latch
(all SSes ready for DD_WAIT_TSS_DATA_MOVE_DELAY without TSS catching
up → skip TSS for the remainder of the move).
* ReverifyShardsResult (enum) + reverifyShardsAndCommit() — handles all
three concurrent-change branches (dataMove deleted / phase changed /
dest changed) uniformly via retryAfterPostWaitChange(), plus KRM
re-truncation, bulk-load task completion, partial-vs-full commit, and
dataMove metadata clear. Returns {RetryLoop, PartialCommitted,
FullyCommitted} so the outer loop can distinguish "keep going on the
same move" from "move complete".
Pure refactor: no behavior change intended. Every CODE_PROBE, TraceEvent,
ASSERT, error-code branch, and control-flow decision from the current
finishMoveKeys / finishMoveShards on origin/main is preserved — including
the newer post-apple#13364 machinery (retryAfterPostWaitChange, KRM boundary
re-truncation with rare probes, ASSERT rereadEnd <= end, per-iteration
FlowLock take/release). The catch handlers, backoff paths, and retry-cap
semantics are untouched.
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos on macOS 14.x
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Both functions grew past 400-500 lines of a single deeply-nested try/loop. Extract the four sequential phases each performs — decode planning snapshot, build dest SS interfaces, wait outside a transaction for destination readiness, re-verify and commit — into standalone helpers so the top-level functions read as orchestration loops.
finishMoveKeys (~440 → ~180 lines) delegates to:
finishMoveShards (~525 → ~271 lines) delegates to shards-specific analogs (the two flavors' TSS policy, dest-decode invariants, and post-wait writes differ enough that forcing shared code hurts readability):
Pure refactor: no behavior change intended. Every CODE_PROBE, TraceEvent, ASSERT, error-code branch, and control-flow decision from the current finishMoveKeys / finishMoveShards on origin/main is preserved — including the newer post-#13364 machinery (retryAfterPostWaitChange, KRM boundary re-truncation with rare probes, ASSERT rereadEnd <= end, per-iteration FlowLock take/release). The catch handlers, backoff paths, and retry-cap semantics are untouched.
Two joshua runs back-to-back:
20260701-195314-stack-refactor-8d063fe367fe89d0 compressed=True data_size=37315233 duration=2111916 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:33:10 sanity=False started=100000 stopped=20260701-202624 submitted=20260701-195314 timeout=5400 username=stack-refactor20260701-211819-stack-refactor2-dbca4f2c3122dcd3 compressed=True data_size=37316315 duration=2610790 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:41:54 sanity=False started=100000 stopped=20260701-220013 submitted=20260701-211819 timeout=5400 username=stack-refactor2Code review
review-findings.md
(This PR includes attention for item #2 in the review-findings.md list)