Skip to content

Fix deadlocks in bgzf reading#704

Merged
sfchen merged 2 commits into
OpenGene:masterfrom
tfenne:tf_bgzf_deadlock_fixes
Jun 21, 2026
Merged

Fix deadlocks in bgzf reading#704
sfchen merged 2 commits into
OpenGene:masterfrom
tfenne:tf_bgzf_deadlock_fixes

Conversation

@tfenne

@tfenne tfenne commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

While working on chelae I spent quite a bit of time benchmarking fastp, and ran into hangs in fastp often enough that it was problematic for me. They are infrequent, but painful since fastp just never returns, and has to then be monitored and killed if it sits at 0% CPU for long enough.

I finally got around to spending some time debugging it today (with Claude's assistance). We set up a harness to read small (30k reads) paired-end bgzipped fastqs, with several threads and compressed outputs. With the current version on main I hit a deadlock something like once in every thousand invocations. The first commit fixes the one that hit most frequently. Once that was fixed, my observed rate dropped to maybe 1/2500-5000. The second commit resolves all places with similar deadlock-capable setups in the bgzf read layer.

After both commits, I've run >> 10,000 iterations of my test, with no deadlocks.

tfenne added 2 commits June 16, 2026 12:02
BgzfMtReader's destructor stored mStop and then called notify_all() on the
produce/decompress/consume condition variables without holding their mutexes.
mStop is part of the wait predicate for all three CVs, so mutating it outside
the locks defeats the usual guarantee that a notification cannot be lost
between a waiter's predicate check and its park. A reader or decompress worker
that had evaluated its predicate as false but had not yet parked could miss the
shutdown notification and block forever; the destructor then hung in join() and
the whole process deadlocked while tearing down the BGZF input reader.

Publish mStop while holding all three mutexes before notifying, so any waiter
sitting in the predicate-check-to-park window is serialized against the stop.

The teardown runs whenever a BgzfMtReader is destroyed: the evaluator
constructs and discards several per run, and the main read pass tears one down
at the end, so the hang could surface either at startup or at the very end of a
run. It reproduced intermittently at low frequency -- on the order of one in a
thousand runs -- with BGZF-compressed inputs, and was independent of output
settings (it is purely on the input-decompression side). An isolated stress
harness that loops construct/read/destroy confirms both halves: the stock
teardown hangs within tens of thousands of cycles with the captured stack
(decompress worker parked in condition_variable::wait, main thread blocked in
~BgzfMtReader -> join), while the fixed teardown runs cleanly across millions of
cycles.

This affects every release since the parallel BGZF reader was introduced.
The previous commit fixed the shutdown (mStop) lost wakeup, but the same
pattern remained at the producer/consumer hand-off points: each stage
published a slot's new state with an atomic store and then notified the next
stage's condition variable without holding that CV's mutex. A waiter that had
evaluated its predicate as false but not yet parked could miss the
notification and block forever.

Continued hammering surfaced a second deadlock with BGZF input, again at low
frequency (on the order of one in a few thousand runs): the reader thread
blocked forever in BgzfMtReader::read() on mConsumeCv while every downstream
worker sat idle in the backpressure wait and the main thread hung joining the
reader. The lost wakeup was the READY/DONE publication for the slot the
consumer was waiting on.

Publish each slot transition under the mutex of the CV whose predicate reads
it, before notifying:
  - read():         FREE       under mProduceMtx  (readerLoop waits on mProduceCv)
  - readerLoop():   COMPRESSED under mDecompMtx   (workers   wait on mDecompCv)
  - decompWorker(): READY      under mConsumeMtx  (consumer  waits on mConsumeCv)
  - markDone():     DONE       under mConsumeMtx  (terminal EOF notify to consumer)

No thread holds more than one of these mutexes at a time, so there is no new
lock-ordering hazard, and the locks sit outside the per-block decompress work
so the added cost is negligible.
@KimYannn KimYannn mentioned this pull request Jun 20, 2026
@sfchen sfchen merged commit 9b8a0b9 into OpenGene:master Jun 21, 2026
2 checks passed
@sfchen

sfchen commented Jun 21, 2026

Copy link
Copy Markdown
Member

thank you very much! @tfenne

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants