Skip to content

fix: recover device Realtime channel from half-open socket on reconnect#520

Open
edgarsskore wants to merge 2 commits into
mainfrom
fix/remote-device-reconnect-loop
Open

fix: recover device Realtime channel from half-open socket on reconnect#520
edgarsskore wants to merge 2 commits into
mainfrom
fix/remote-device-reconnect-loop

Conversation

@edgarsskore

@edgarsskore edgarsskore commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

After idle / wifi-loss / sleep the device's Supabase Realtime socket can go half-open (conn.readyState stays OPEN but the peer is gone). recreateChannel() removed the old channel un-awaited and synchronously pushed a new one, so the channel registry never reached 0, realtime-js never tore the dead socket down, and every re-subscribe TIMED_OUT forever -- only a process restart recovered.

Fix (remote-channel.ts):

  • recreateChannel(): add a re-entrancy guard, await removeChannel(), and force a fresh WebSocket via realtime.disconnect() before re-subscribing.
  • checkConnectionHealth(): treat 'joining' as healthy so realtime-js's own rejoin backoff can converge instead of being torn down mid-join.

Also enrich the existing reconnect/timeout logs with a compact connState() line (socket state + readyState + channel state + attempt), turn on the previously commented-out CHANNEL_ERROR log, and add a CLOSED branch.

Add test/remote-channel-reconnect.test.ts -- a deterministic repro that fails on the old behavior (8x TIMED_OUT, dead socket reused) and passes with the fix.

Summary by CodeRabbit

  • Bug Fixes

    • Improved channel recreation with automatic retry handling and timeout protection to prevent stuck operations
    • Enhanced connection health monitoring to prevent overlapping recreation attempts
    • Better recovery from half-open connection states
  • Tests

    • Added regression tests for subscription recovery scenarios

After idle / wifi-loss / sleep the device's Supabase Realtime socket can go
half-open (conn.readyState stays OPEN but the peer is gone). recreateChannel()
removed the old channel un-awaited and synchronously pushed a new one, so the
channel registry never reached 0, realtime-js never tore the dead socket down,
and every re-subscribe TIMED_OUT forever -- only a process restart recovered.

Fix (remote-channel.ts):
- recreateChannel(): add a re-entrancy guard, await removeChannel(), and force a
  fresh WebSocket via realtime.disconnect() before re-subscribing.
- checkConnectionHealth(): treat 'joining' as healthy so realtime-js's own rejoin
  backoff can converge instead of being torn down mid-join.

Also enrich the existing reconnect/timeout logs with a compact connState() line
(socket state + readyState + channel state + attempt), turn on the previously
commented-out CHANNEL_ERROR log, and add a CLOSED branch.

Add test/remote-channel-reconnect.test.ts -- a deterministic repro that fails on
the old behavior (8x TIMED_OUT, dead socket reused) and passes with the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

RemoteChannel gains a reconnectAttempt counter and isRecreatingChannel re-entrancy guard. recreateChannel() is rewritten as an async, timeout-capped operation. Subscription status handlers are expanded with connState() snapshots and targeted telemetry. A new regression test file validates half-open socket recovery using in-file realtime-js fakes.

Changes

RemoteChannel Reconnection Hardening

Layer / File(s) Summary
Reconnection state fields and connState() helper
src/remote-device/remote-channel.ts
Adds RECREATE_TIMEOUT_MS constant, reconnectAttempt counter, and isRecreatingChannel re-entrancy guard as the foundational state for the new reconnection logic.
Subscription status logging and failure telemetry
src/remote-device/remote-channel.ts
Updates SUBSCRIBED to reset reconnectAttempt and log recovery; expands CHANNEL_ERROR, TIMED_OUT, and CLOSED to include connState(), mark device offline, emit captureRemote telemetry, and reject the channel-creation promise. Updates registerDevice retry log with connState().
Health check gating and async recreateChannel()
src/remote-device/remote-channel.ts
Health check treats joined and joining as healthy; other states emit attempt-aware telemetry and trigger recreation. recreateChannel() becomes async and guarded by isRecreatingChannel, awaits old channel removal, best-effort disconnects the socket, awaits createChannel(), captures errors, and clears the guard in finally. Introduces withTimeout().
Fake realtime-js infrastructure for regression test
test/test-remote-channel-reconnect.js
Describes the half-open socket problem; imports RemoteChannel with telemetry disabled; implements FakeChannel, FakeRealtime, and FakeClient modeling deferred teardown, socket rebuild on disconnect(), and TIMED_OUT vs SUBSCRIBED callbacks.
Test harness utilities and regression scenarios
test/test-remote-channel-reconnect.js
Adds flush(), makeRemoteChannel(), goHalfOpen(), driveHealthChecks(), withQuietLogs(), and a test() helper. Runs three scenarios: disconnect-first recovery, health-check-driven half-open recovery, and joining skips recreation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • wonderwhy-er/DesktopCommanderMCP#332: Modifies the same createChannel/reconnect logic in remote-channel.ts, including device offline marking and retry diagnostics that this PR directly extends.

Suggested labels

size:M

🐇 A socket half-open, the channel went dark,
So I hopped in to add a guard and a mark.
With reconnectAttempt I count every try,
connState() snapshots the who, what, and why.
Now fakes and health checks make recovery bright —
The channel comes back, and all tests go right! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing recovery of device Realtime channels from half-open sockets on reconnect, which is the primary focus of the substantial changes to remote-channel.ts and the new regression test.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/remote-device-reconnect-loop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/remote-channel-reconnect.test.ts (1)

245-248: 💤 Low value

Consider documenting the magic number for maxAttempts.

The test uses maxAttempts: 8 but doesn't explain why this value was chosen. Adding a brief comment would help future maintainers understand whether this represents a worst-case scenario, matches a production timeout budget, or is simply sufficient to demonstrate convergence.

📝 Suggested documentation
 async function goHalfOpenThenDrive(rc: any, client: FakeClient) {
   await goHalfOpen(rc, client);
-  return driveHealthChecks(rc, 8);
+  // 8 iterations is sufficient to observe the recreate→recover cycle;
+  // in practice, recovery happens on the first attempt after the fix.
+  return driveHealthChecks(rc, 8);
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/remote-channel-reconnect.test.ts` around lines 245 - 248, The
goHalfOpenThenDrive function passes a magic number 8 to driveHealthChecks as
maxAttempts without explaining the reasoning behind this value. Add a brief
comment above or inline with the driveHealthChecks(rc, 8) call that documents
why the value 8 was chosen—whether it represents a worst-case scenario, matches
a production timeout budget, or is simply sufficient to demonstrate convergence
in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/remote-device/remote-channel.ts`:
- Around line 237-239: In the else if block that checks for status === 'CLOSED',
the promise is not being settled, causing the await on createChannel() to hang
indefinitely. After the console.warn statement in the CLOSED state handler, add
a reject() call with an appropriate error (similar to how CHANNEL_ERROR and
TIMED_OUT states are handled). Additionally, consider calling
setOnlineStatus(this.deviceId!, 'offline') before rejecting to mark the device
as offline, ensuring proper state management when the channel is terminated by
the server. This aligns the CLOSED handling with the existing error handling
patterns in the code.

---

Nitpick comments:
In `@test/remote-channel-reconnect.test.ts`:
- Around line 245-248: The goHalfOpenThenDrive function passes a magic number 8
to driveHealthChecks as maxAttempts without explaining the reasoning behind this
value. Add a brief comment above or inline with the driveHealthChecks(rc, 8)
call that documents why the value 8 was chosen—whether it represents a
worst-case scenario, matches a production timeout budget, or is simply
sufficient to demonstrate convergence in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d29a1be-6e32-4d56-8fe9-c73d4235b670

📥 Commits

Reviewing files that changed from the base of the PR and between be82290 and 305c9d2.

📒 Files selected for processing (2)
  • src/remote-device/remote-channel.ts
  • test/remote-channel-reconnect.test.ts

Comment thread src/remote-device/remote-channel.ts
…ED, suite-native test)

- recreateChannel(): wrap the awaited section in a 30s timeout (Promise.race,
  mirrors closeWithTimeout) so a never-settling await can't pin
  isRecreatingChannel=true and silently disable the 10s watchdog.
- createChannel() CLOSED branch: reject() (was un-settled -> could hang the
  recreate) and setOnlineStatus('offline') for parity with CHANNEL_ERROR/TIMED_OUT.
- Rewrite the repro test as test/test-remote-channel-reconnect.js (plain JS,
  imports compiled dist, runs under node) so run-all-tests.js discovers it and it
  runs in `npm test` -- no runner/package.json special-casing. Removes the tsx-only
  .ts version. Adds a 'joining is treated as healthy' case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant