Skip to content

smp: fix proxy reconnection to relay after restart#1806

Open
shumvgolove wants to merge 3 commits into
masterfrom
sh/fix-proxy-reconnect
Open

smp: fix proxy reconnection to relay after restart#1806
shumvgolove wants to merge 3 commits into
masterfrom
sh/fix-proxy-reconnect

Conversation

@shumvgolove

Copy link
Copy Markdown
Collaborator

Problem

An SMP proxy permanently stops reconnecting to a destination relay after the relay restarts. The logs show repeated PCEResponseTimeout for that relay, and only restarting the proxy server recovers it.

Cause

A PRXY request makes the proxy open a connection to the relay in a worker forked from the sender's client. The worker inserts an empty session var into smpClients and then blocks in the connection/handshake. If the sender disconnects while that connect is in flight, the worker is killed by an async exception before the session var is ever filled.

Nothing removes an empty session var, so every later request to that relay waits on it until the connection timeout and fails with PROXY (BROKER TIMEOUT) - forever, even once the relay is healthy again.

Reproduces the proxy failing to reconnect to a destination relay when the
sender disconnects mid-connection (empty session var left in smpClients).
getSessVar inserts an empty session var that the connect path fills with
putTMVar. If the connecting thread is killed by an async exception before
that (e.g. a proxy worker on client disconnect, or an agent worker on
cancel), the empty var was left in the map forever and every later request
for that server blocked on it until timing out (permanent PCEResponseTimeout).

Add clearSessVarOnInterrupt and run it via onException at the SMP proxy
(newSMPClient), agent (newProtocolClient, newProxiedRelay) and ntf push
worker (getOrCreatePushWorker) connect sites: on interrupt before fill,
release waiters with an error and drop the var so the next request reconnects.
UtilTests: tryAllErrors rethrows ThreadKilled/StackOverflow (the mechanism
that skips putTMVar). SMPProxyTests: agent client reconnection after a
cancelled connect, plus a control proving the stalling relay alone does not
cause the failure; refine the relay reconnection tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant