HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky without wait between batches#10581
HDDS-11063. TestSnapshotDiffManager#testThreadPoolIsFull is flaky without wait between batches#10581rhalm wants to merge 2 commits into
Conversation
…n there is no wait between the batches
| true, 45, 0), | ||
| // 10 running + 10 queued = 20 accepted, remaining 25 rejected | ||
| Arguments.of("When the pool does not drain between job batches", | ||
| false, 20, 25) |
There was a problem hiding this comment.
Thanks for the patch @rhalm.
nit: Use full_thread_pool_size = 2 * OZONE_OM_SNAPSHOT_DIFF_THREAD_POOL_SIZE and 45 - full_thread_pool_size for the expected accepted and rejected jobs here.
Otherwise, LGTM.
There was a problem hiding this comment.
Thanks for the review @SaketaChalamchala, done.
| totalSubmitted++; | ||
| } | ||
| if (drainBetweenBatches) { | ||
| final int expected = totalSubmitted; |
There was a problem hiding this comment.
Not in scope for the current PR to fix flakiness but may be considered as a follow up test improvement:
In the drainBetweenBatches scenario would a better check be to
- Keep the latch closed until
full_thread_pool_sizejobs are submitted - Open the latch
- Submit new jobs when
totalSubmitted - completedJobs.get() < full_thread_pool_size
There was a problem hiding this comment.
Great idea, I opened HDDS-15648 as a follow-up.
|
Thanks @SaketaChalamchala for the review. When requesting changes, CI workflow should not be approved on the PR, since it will anyway need to be re-run after addressing review comments. |
What changes were proposed in this pull request?
Previously in HDDS-10604 there were changes made to
OzoneConfigurationinitialization that briefly increased the cost of constructing a new instance, which causedTestSnapshotDiffManager#testThreadPoolIsFullto fail in the no-wait-between-batches scenario. That change was reverted later, but it exposed two issues that are addressed by this PR:testThreadPoolIsFullrelied on timing (usingThread.sleep) to verify pool rejection behavior under load, making the test fragile to any potential latency in the snapshot diff submission path (such as ozone config initialization). The test was rewritten to use aCountDownLatch, making the IN_PROGRESS/REJECTED split deterministic.getSnapshotRootPathconstructed anew OzoneConfiguration()on every call just to build a path. It now usesozoneManager.getConfiguration()instead.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11063
How was this patch tested?
OzoneConfigurationconstructor: the old test fails on master, the fixed test consistently passes with the same setup.