feat(operator): discovered-endpoint enrichment and per-cluster PersistentKeepalive#20
Conversation
…tent keepalive ## Discovered-endpoint enrichment For clusters behind NAT (e.g. OpenStack tenants without floating IPs) the node's kilo.squat.ai/endpoint annotation only carries the internal IP, which is unreachable from the peer cluster. The real egress address is visible in kilo.squat.ai/discovered-endpoints on the target cluster's nodes: Kilo records the WireGuard handshake source IP there after each successful session. Add enrichEndpointsFromDiscovered: after building the desired Peer slice, the controller reads discovered-endpoints from every node in the target cluster and replaces the configured endpoint with the observed NAT/egress IP when a more-specific address is available. The enrichment is best-effort — a lookup failure falls back to the original endpoint and logs a warning. This makes the operator self-healing for NAT traversal: the source cluster's Kilo initiates the first handshake (it already knows the target's public IPs), the target nodes record the real source IP, and subsequent operator reconciles automatically publish the correct endpoint in the Peer CRD so the target can initiate back. ## Per-cluster PersistentKeepalive Add PersistentKeepalive int to ClusterEntry. When non-zero the value is forwarded to the PersistentKeepalive field of every Peer CRD the operator creates for nodes in that cluster. This keeps the stateful NAT mapping alive between reconciles — without it the mapping expires after the NAT timeout and the discovered endpoint can no longer be reached from the target side. Typical value for OpenStack-behind-NAT clusters: 25 (seconds). Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
|
Warning Review limit reached
More reviews will be available in 53 minutes and 49 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds ChangesWireGuard NAT Connectivity
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for configuring WireGuard's PersistentKeepalive interval and dynamically enriches peer endpoints using NAT-observed IPs discovered from target cluster nodes. The review feedback identifies a critical correctness bug where overwriting the shared 'desired' slice inside a loop leaks target-specific endpoints to subsequent target clusters. Additionally, the reviewer recommends replacing the custom 'itoa' implementation in 'internal/kilonode/discovered.go' with the standard library's 'strconv.Itoa' for better maintainability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired) | ||
| if enrichErr != nil { | ||
| // Non-fatal: log and continue with the original endpoints. | ||
| log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints", | ||
| slog.String("source", srcEntry.Name), | ||
| slog.String("target", tgtEntry.Name), | ||
| slog.String("error", enrichErr.Error()), | ||
| ) | ||
| } else { | ||
| desired = enriched | ||
| } | ||
|
|
||
| err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, desired) |
There was a problem hiding this comment.
Overwriting the desired slice with enriched inside the loop causes a correctness bug when there are multiple target clusters. The enriched peers (which contain discovered endpoints specific to the current target cluster) will be passed as the baseline desired peers to the next target cluster in the subsequent iteration. If a subsequent target cluster has not discovered a peer's endpoint yet, it will incorrectly receive the previous target's discovered endpoint instead of falling back to the original configured endpoint. Use a loop-local variable to hold the target-specific desired peers.
targetDesired := desired
enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired)
if enrichErr != nil {
// Non-fatal: log and continue with the original endpoints.
log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints",
slog.String("source", srcEntry.Name),
slog.String("target", tgtEntry.Name),
slog.String("error", enrichErr.Error()),
)
} else {
targetDesired = enriched
}
err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, targetDesired)| import ( | ||
| "context" | ||
| "encoding/json" | ||
| "net" | ||
|
|
||
| "github.com/cockroachdb/errors" | ||
| corev1 "k8s.io/api/core/v1" | ||
| "sigs.k8s.io/controller-runtime/pkg/client" | ||
| ) |
There was a problem hiding this comment.
Import the standard strconv package to format the port number instead of using a custom itoa implementation.
| import ( | |
| "context" | |
| "encoding/json" | |
| "net" | |
| "github.com/cockroachdb/errors" | |
| corev1 "k8s.io/api/core/v1" | |
| "sigs.k8s.io/controller-runtime/pkg/client" | |
| ) | |
| import ( | |
| "context" | |
| "encoding/json" | |
| "net" | |
| "strconv" | |
| "github.com/cockroachdb/errors" | |
| corev1 "k8s.io/api/core/v1" | |
| "sigs.k8s.io/controller-runtime/pkg/client" | |
| ) |
| // the same key should have seen the same source IP (the peer's NAT | ||
| // egress), so any entry is equally authoritative. | ||
| if _, seen := result[pubKey]; !seen { | ||
| result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port)) |
| // itoa converts an int to its decimal string representation without importing | ||
| // strconv at the call site. | ||
| func itoa(n int) string { | ||
| if n == 0 { | ||
| return "0" | ||
| } | ||
|
|
||
| buf := [20]byte{} | ||
| pos := len(buf) | ||
| neg := n < 0 | ||
|
|
||
| if neg { | ||
| n = -n | ||
| } | ||
|
|
||
| for n > 0 { | ||
| pos-- | ||
| buf[pos] = byte('0' + n%10) | ||
| n /= 10 | ||
| } | ||
|
|
||
| if neg { | ||
| pos-- | ||
| buf[pos] = '-' | ||
| } | ||
|
|
||
| return string(buf[pos:]) | ||
| } |
Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 670-683: The issue is that assigning desired = enriched mutates a
shared slice across multiple target cluster iterations in the loop, causing
endpoint decisions from one target to leak into subsequent targets. Instead of
modifying the shared desired variable when enrichment succeeds, pass the
enriched endpoints directly to the peer.ReconcilePeers call for that specific
target cluster, or use a locally-scoped variable for the current target's
endpoints. This ensures each target cluster only uses endpoints appropriate to
itself without affecting other iterations.
In `@internal/kilonode/discovered.go`:
- Around line 85-87: The IP validation logic in the discovered.go file around
the ParseIP call is incomplete and allows unspecified addresses like 0.0.0.0 and
:: as well as multicast addresses to pass through, which can later propagate
unusable endpoints. Extend the existing conditional check that validates
ip.IsLoopback() and ip.IsLinkLocalUnicast() to also call ip.IsUnspecified() and
ip.IsMulticast() to reject these address types before accepting them into the
endpoints list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 61223899-44ad-48dc-bf83-02dc150e450e
📒 Files selected for processing (4)
api/v1alpha1/clustermesh_types.gointernal/controller/clustermesh_controller.gointernal/kilonode/discovered.gointernal/peer/builder.go
| ip := net.ParseIP(entry.IP) | ||
| if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() { | ||
| continue |
There was a problem hiding this comment.
Reject unspecified and multicast discovered IPs before accepting them.
Current validation still accepts addresses like 0.0.0.0 / :: (and multicast), which can propagate unusable endpoints into reconciliation.
Proposed fix
- ip := net.ParseIP(entry.IP)
- if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
+ ip := net.ParseIP(entry.IP)
+ if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ip := net.ParseIP(entry.IP) | |
| if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() { | |
| continue | |
| ip := net.ParseIP(entry.IP) | |
| if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() { | |
| continue |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/kilonode/discovered.go` around lines 85 - 87, The IP validation
logic in the discovered.go file around the ParseIP call is incomplete and allows
unspecified addresses like 0.0.0.0 and :: as well as multicast addresses to pass
through, which can later propagate unusable endpoints. Extend the existing
conditional check that validates ip.IsLoopback() and ip.IsLinkLocalUnicast() to
also call ip.IsUnspecified() and ip.IsMulticast() to reject these address types
before accepting them into the endpoints list.
… itoa - desired is no longer mutated across target cluster iterations: each target now receives independently enriched peers (pushDesired) so that different target clusters observing different NAT IPs for the same source peer don't clobber each other. - Replace hand-rolled itoa in kilonode/discovered.go with strconv.Itoa. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Summary
Discovered-endpoint enrichment
For clusters behind NAT (OpenStack tenants without floating IPs), nodes only advertise their internal IP via
kilo.squat.ai/endpoint, which is unreachable from peer clusters. The real egress address is visible inkilo.squat.ai/discovered-endpointson target cluster nodes: Kilo records the WireGuard handshake source IP after each successful session.The controller now reads
discovered-endpointsfrom every target node after building the desired Peer slice, and replaces the configured endpoint with the observed NAT/egress IP. This is best-effort — lookup failures fall back to the original endpoint with a warning log.Self-healing flow:
discovered-endpointsPer-cluster PersistentKeepalive
Adds
persistentKeepalive: inttoClusterEntry. When non-zero, the value is forwarded to every Peer CRD the operator creates for nodes in that cluster. This keeps the stateful NAT mapping alive between reconciles. Without it the mapping expires and the discovered endpoint can no longer be reached from the target side.Typical value for NAT clusters:
persistentKeepalive: 25Test plan
go test ./internal/...passesgolangci-lint runreports 0 issuespersistentKeepalive: 25in ClusterEntry propagates to Peer CRDsSummary by CodeRabbit