fix: recover device Realtime channel from half-open socket on reconnect#520
fix: recover device Realtime channel from half-open socket on reconnect#520edgarsskore wants to merge 2 commits into
Conversation
After idle / wifi-loss / sleep the device's Supabase Realtime socket can go half-open (conn.readyState stays OPEN but the peer is gone). recreateChannel() removed the old channel un-awaited and synchronously pushed a new one, so the channel registry never reached 0, realtime-js never tore the dead socket down, and every re-subscribe TIMED_OUT forever -- only a process restart recovered. Fix (remote-channel.ts): - recreateChannel(): add a re-entrancy guard, await removeChannel(), and force a fresh WebSocket via realtime.disconnect() before re-subscribing. - checkConnectionHealth(): treat 'joining' as healthy so realtime-js's own rejoin backoff can converge instead of being torn down mid-join. Also enrich the existing reconnect/timeout logs with a compact connState() line (socket state + readyState + channel state + attempt), turn on the previously commented-out CHANNEL_ERROR log, and add a CLOSED branch. Add test/remote-channel-reconnect.test.ts -- a deterministic repro that fails on the old behavior (8x TIMED_OUT, dead socket reused) and passes with the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthrough
ChangesRemoteChannel Reconnection Hardening
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/remote-channel-reconnect.test.ts (1)
245-248: 💤 Low valueConsider documenting the magic number for maxAttempts.
The test uses
maxAttempts: 8but doesn't explain why this value was chosen. Adding a brief comment would help future maintainers understand whether this represents a worst-case scenario, matches a production timeout budget, or is simply sufficient to demonstrate convergence.📝 Suggested documentation
async function goHalfOpenThenDrive(rc: any, client: FakeClient) { await goHalfOpen(rc, client); - return driveHealthChecks(rc, 8); + // 8 iterations is sufficient to observe the recreate→recover cycle; + // in practice, recovery happens on the first attempt after the fix. + return driveHealthChecks(rc, 8); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/remote-channel-reconnect.test.ts` around lines 245 - 248, The goHalfOpenThenDrive function passes a magic number 8 to driveHealthChecks as maxAttempts without explaining the reasoning behind this value. Add a brief comment above or inline with the driveHealthChecks(rc, 8) call that documents why the value 8 was chosen—whether it represents a worst-case scenario, matches a production timeout budget, or is simply sufficient to demonstrate convergence in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/remote-device/remote-channel.ts`:
- Around line 237-239: In the else if block that checks for status === 'CLOSED',
the promise is not being settled, causing the await on createChannel() to hang
indefinitely. After the console.warn statement in the CLOSED state handler, add
a reject() call with an appropriate error (similar to how CHANNEL_ERROR and
TIMED_OUT states are handled). Additionally, consider calling
setOnlineStatus(this.deviceId!, 'offline') before rejecting to mark the device
as offline, ensuring proper state management when the channel is terminated by
the server. This aligns the CLOSED handling with the existing error handling
patterns in the code.
---
Nitpick comments:
In `@test/remote-channel-reconnect.test.ts`:
- Around line 245-248: The goHalfOpenThenDrive function passes a magic number 8
to driveHealthChecks as maxAttempts without explaining the reasoning behind this
value. Add a brief comment above or inline with the driveHealthChecks(rc, 8)
call that documents why the value 8 was chosen—whether it represents a
worst-case scenario, matches a production timeout budget, or is simply
sufficient to demonstrate convergence in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6d29a1be-6e32-4d56-8fe9-c73d4235b670
📒 Files selected for processing (2)
src/remote-device/remote-channel.tstest/remote-channel-reconnect.test.ts
…ED, suite-native test)
- recreateChannel(): wrap the awaited section in a 30s timeout (Promise.race,
mirrors closeWithTimeout) so a never-settling await can't pin
isRecreatingChannel=true and silently disable the 10s watchdog.
- createChannel() CLOSED branch: reject() (was un-settled -> could hang the
recreate) and setOnlineStatus('offline') for parity with CHANNEL_ERROR/TIMED_OUT.
- Rewrite the repro test as test/test-remote-channel-reconnect.js (plain JS,
imports compiled dist, runs under node) so run-all-tests.js discovers it and it
runs in `npm test` -- no runner/package.json special-casing. Removes the tsx-only
.ts version. Adds a 'joining is treated as healthy' case.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After idle / wifi-loss / sleep the device's Supabase Realtime socket can go half-open (conn.readyState stays OPEN but the peer is gone). recreateChannel() removed the old channel un-awaited and synchronously pushed a new one, so the channel registry never reached 0, realtime-js never tore the dead socket down, and every re-subscribe TIMED_OUT forever -- only a process restart recovered.
Fix (remote-channel.ts):
Also enrich the existing reconnect/timeout logs with a compact connState() line (socket state + readyState + channel state + attempt), turn on the previously commented-out CHANNEL_ERROR log, and add a CLOSED branch.
Add test/remote-channel-reconnect.test.ts -- a deterministic repro that fails on the old behavior (8x TIMED_OUT, dead socket reused) and passes with the fix.
Summary by CodeRabbit
Bug Fixes
Tests