Skip to content

HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky without wait between batches#10581

Open
rhalm wants to merge 2 commits into
apache:masterfrom
rhalm:HDDS-11063
Open

HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky without wait between batches#10581
rhalm wants to merge 2 commits into
apache:masterfrom
rhalm:HDDS-11063

Conversation

@rhalm

@rhalm rhalm commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Previously in HDDS-10604 there were changes made to OzoneConfiguration initialization that briefly increased the cost of constructing a new instance, which caused TestSnapshotDiffManager#testThreadPoolIsFull to fail in the no-wait-between-batches scenario. That change was reverted later, but it exposed two issues that are addressed by this PR:

  • testThreadPoolIsFull relied on timing (using Thread.sleep) to verify pool rejection behavior under load, making the test fragile to any potential latency in the snapshot diff submission path (such as ozone config initialization). The test was rewritten to use a CountDownLatch, making the IN_PROGRESS/REJECTED split deterministic.
  • getSnapshotRootPath constructed a new OzoneConfiguration() on every call just to build a path. It now uses ozoneManager.getConfiguration() instead.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11063

How was this patch tested?

@adoroszlai adoroszlai changed the title HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky when there is no wait between the batches HDDS-11063. Flaky TestSnapshotDiffManager#testThreadPoolIsFull with no wait between the batches Jun 22, 2026
@adoroszlai adoroszlai added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 22, 2026
true, 45, 0),
// 10 running + 10 queued = 20 accepted, remaining 25 rejected
Arguments.of("When the pool does not drain between job batches",
false, 20, 25)

@SaketaChalamchala SaketaChalamchala Jun 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @rhalm.
nit: Use full_thread_pool_size = 2 * OZONE_OM_SNAPSHOT_DIFF_THREAD_POOL_SIZE and 45 - full_thread_pool_size for the expected accepted and rejected jobs here.
Otherwise, LGTM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @SaketaChalamchala, done.

totalSubmitted++;
}
if (drainBetweenBatches) {
final int expected = totalSubmitted;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in scope for the current PR to fix flakiness but may be considered as a follow up test improvement:
In the drainBetweenBatches scenario would a better check be to

  1. Keep the latch closed until full_thread_pool_size jobs are submitted
  2. Open the latch
  3. Submit new jobs when totalSubmitted - completedJobs.get() < full_thread_pool_size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I opened HDDS-15648 as a follow-up.

@adoroszlai adoroszlai changed the title HDDS-11063. Flaky TestSnapshotDiffManager#testThreadPoolIsFull with no wait between the batches HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky without wait between batches Jun 23, 2026
@adoroszlai

adoroszlai commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Thanks @SaketaChalamchala for the review. When requesting changes, CI workflow should not be approved on the PR, since it will anyway need to be re-run after addressing review comments.

@rhalm rhalm requested a review from SaketaChalamchala June 23, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants