Fix deadlocks in bgzf reading#704
Merged
Merged
Conversation
BgzfMtReader's destructor stored mStop and then called notify_all() on the produce/decompress/consume condition variables without holding their mutexes. mStop is part of the wait predicate for all three CVs, so mutating it outside the locks defeats the usual guarantee that a notification cannot be lost between a waiter's predicate check and its park. A reader or decompress worker that had evaluated its predicate as false but had not yet parked could miss the shutdown notification and block forever; the destructor then hung in join() and the whole process deadlocked while tearing down the BGZF input reader. Publish mStop while holding all three mutexes before notifying, so any waiter sitting in the predicate-check-to-park window is serialized against the stop. The teardown runs whenever a BgzfMtReader is destroyed: the evaluator constructs and discards several per run, and the main read pass tears one down at the end, so the hang could surface either at startup or at the very end of a run. It reproduced intermittently at low frequency -- on the order of one in a thousand runs -- with BGZF-compressed inputs, and was independent of output settings (it is purely on the input-decompression side). An isolated stress harness that loops construct/read/destroy confirms both halves: the stock teardown hangs within tens of thousands of cycles with the captured stack (decompress worker parked in condition_variable::wait, main thread blocked in ~BgzfMtReader -> join), while the fixed teardown runs cleanly across millions of cycles. This affects every release since the parallel BGZF reader was introduced.
The previous commit fixed the shutdown (mStop) lost wakeup, but the same pattern remained at the producer/consumer hand-off points: each stage published a slot's new state with an atomic store and then notified the next stage's condition variable without holding that CV's mutex. A waiter that had evaluated its predicate as false but not yet parked could miss the notification and block forever. Continued hammering surfaced a second deadlock with BGZF input, again at low frequency (on the order of one in a few thousand runs): the reader thread blocked forever in BgzfMtReader::read() on mConsumeCv while every downstream worker sat idle in the backpressure wait and the main thread hung joining the reader. The lost wakeup was the READY/DONE publication for the slot the consumer was waiting on. Publish each slot transition under the mutex of the CV whose predicate reads it, before notifying: - read(): FREE under mProduceMtx (readerLoop waits on mProduceCv) - readerLoop(): COMPRESSED under mDecompMtx (workers wait on mDecompCv) - decompWorker(): READY under mConsumeMtx (consumer waits on mConsumeCv) - markDone(): DONE under mConsumeMtx (terminal EOF notify to consumer) No thread holds more than one of these mutexes at a time, so there is no new lock-ordering hazard, and the locks sit outside the per-block decompress work so the added cost is negligible.
Member
|
thank you very much! @tfenne merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While working on chelae I spent quite a bit of time benchmarking fastp, and ran into hangs in fastp often enough that it was problematic for me. They are infrequent, but painful since fastp just never returns, and has to then be monitored and killed if it sits at 0% CPU for long enough.
I finally got around to spending some time debugging it today (with Claude's assistance). We set up a harness to read small (30k reads) paired-end bgzipped fastqs, with several threads and compressed outputs. With the current version on
mainI hit a deadlock something like once in every thousand invocations. The first commit fixes the one that hit most frequently. Once that was fixed, my observed rate dropped to maybe 1/2500-5000. The second commit resolves all places with similar deadlock-capable setups in the bgzf read layer.After both commits, I've run >> 10,000 iterations of my test, with no deadlocks.