feat(operator): discovered-endpoint enrichment and per-cluster PersistentKeepalive by IvanHunters · Pull Request #20 · cozystack/kilo-clustermesh-operator

IvanHunters · 2026-06-19T09:41:34Z

Summary

Discovered-endpoint enrichment

For clusters behind NAT (OpenStack tenants without floating IPs), nodes only advertise their internal IP via kilo.squat.ai/endpoint, which is unreachable from peer clusters. The real egress address is visible in kilo.squat.ai/discovered-endpoints on target cluster nodes: Kilo records the WireGuard handshake source IP after each successful session.

The controller now reads discovered-endpoints from every target node after building the desired Peer slice, and replaces the configured endpoint with the observed NAT/egress IP. This is best-effort — lookup failures fall back to the original endpoint with a warning log.

Self-healing flow:

Source cluster's Kilo initiates first handshake (it knows the target's public IPs)
Target nodes record the real source IP in discovered-endpoints
Next operator reconcile publishes the correct endpoint in the Peer CRD
Target cluster can now initiate back → bidirectional traffic restored

Per-cluster PersistentKeepalive

Adds persistentKeepalive: int to ClusterEntry. When non-zero, the value is forwarded to every Peer CRD the operator creates for nodes in that cluster. This keeps the stateful NAT mapping alive between reconciles. Without it the mapping expires and the discovered endpoint can no longer be reached from the target side.

Typical value for NAT clusters: persistentKeepalive: 25

Test plan

go test ./internal/... passes
golangci-lint run reports 0 issues
Deploy to cluster with NAT tenant cluster → verify bidirectional WireGuard connectivity self-heals after first handshake from source side
Verify persistentKeepalive: 25 in ClusterEntry propagates to Peer CRDs

Summary by CodeRabbit

New Features
- Added WireGuard persistent keepalive interval configuration (0-65535 seconds) to enhance peer connectivity for NAT-traversed connections
- Enabled automatic discovery and usage of NAT-discovered peer endpoints from cluster nodes, improving mesh connectivity and reliability

…tent keepalive ## Discovered-endpoint enrichment For clusters behind NAT (e.g. OpenStack tenants without floating IPs) the node's kilo.squat.ai/endpoint annotation only carries the internal IP, which is unreachable from the peer cluster. The real egress address is visible in kilo.squat.ai/discovered-endpoints on the target cluster's nodes: Kilo records the WireGuard handshake source IP there after each successful session. Add enrichEndpointsFromDiscovered: after building the desired Peer slice, the controller reads discovered-endpoints from every node in the target cluster and replaces the configured endpoint with the observed NAT/egress IP when a more-specific address is available. The enrichment is best-effort — a lookup failure falls back to the original endpoint and logs a warning. This makes the operator self-healing for NAT traversal: the source cluster's Kilo initiates the first handshake (it already knows the target's public IPs), the target nodes record the real source IP, and subsequent operator reconciles automatically publish the correct endpoint in the Peer CRD so the target can initiate back. ## Per-cluster PersistentKeepalive Add PersistentKeepalive int to ClusterEntry. When non-zero the value is forwarded to the PersistentKeepalive field of every Peer CRD the operator creates for nodes in that cluster. This keeps the stateful NAT mapping alive between reconciles — without it the mapping expires after the NAT timeout and the discovered endpoint can no longer be reached from the target side. Typical value for OpenStack-behind-NAT clusters: 25 (seconds). Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai · 2026-06-19T09:41:43Z

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 49 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 721db270-007d-4a3e-b24b-b665fc5ed32a

📥 Commits

Reviewing files that changed from the base of the PR and between ddcaa37 and 34f4507.

📒 Files selected for processing (3)

config/crd/bases/kilo.squat.ai_clustermeshes.yaml
internal/controller/clustermesh_controller.go
internal/kilonode/discovered.go

📝 Walkthrough

Walkthrough

Adds PersistentKeepalive to ClusterEntry with kubebuilder validation and propagates it through BuildPeer into PeerSpec. Introduces internal/kilonode/discovered.go to read NAT-observed WireGuard endpoints from node annotations, and integrates endpoint enrichment into pushPeersToTargets in the controller, overriding peer endpoints best-effort when discovered addresses differ.

Changes

WireGuard NAT Connectivity

Layer / File(s)	Summary
PersistentKeepalive field and builder wiring `api/v1alpha1/clustermesh_types.go`, `internal/peer/builder.go`	`ClusterEntry` adds an optional `PersistentKeepalive` int field with kubebuilder min/max validation; `BuildPeer` sets `PeerSpec.PersistentKeepalive` from the entry.
kilonode: discovered endpoint lookup `internal/kilonode/discovered.go`	New file defines `AnnotationDiscoveredEndpoints`, the `discoveredEndpointEntry` JSON struct, and `DiscoveredEndpointsByKey` which lists nodes, decodes the annotation, validates IP scope (rejects loopback and link-local), and returns a `pubKey→host:port` map using first-seen semantics. Includes a local `itoa` helper.
Controller endpoint enrichment `internal/controller/clustermesh_controller.go`	`pushPeersToTargets` switches to index-based iteration and calls `enrichEndpointsFromDiscovered` per target cluster before peer reconciliation. Adds `enrichEndpointsFromDiscovered` (looks up discovered endpoints, deep-copies and overrides `Peer.Spec.Endpoint` when absent or differing) and `parseDiscoveredEndpoint` (parses `host:port` into `*PeerEndpoint`). `handleDeletion` loop is also refactored to index-based pointer iteration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

cozystack/kilo-clustermesh-operator#11: Modifies BuildPeer signature and AllowedIPs construction in the same internal/peer/builder.go file where this PR adds PersistentKeepalive population.

Suggested reviewers

Arsolitt

Poem

🐇 Hop behind the NAT, no fear!
The keepalive ticks so peers stay near,
Discovered endpoints, fresh from the node,
Override the config, lighten the load.
Through tunnels and mesh the packets fly free—
A well-connected warren, happy as can be! 🌐

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures both main features: discovered-endpoint enrichment and per-cluster PersistentKeepalive support for NAT scenarios, matching the changeset additions across API types, controller logic, and peer building.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces support for configuring WireGuard's PersistentKeepalive interval and dynamically enriches peer endpoints using NAT-observed IPs discovered from target cluster nodes. The review feedback identifies a critical correctness bug where overwriting the shared 'desired' slice inside a loop leaks target-specific endpoints to subsequent target clusters. Additionally, the reviewer recommends replacing the custom 'itoa' implementation in 'internal/kilonode/discovered.go' with the standard library's 'strconv.Itoa' for better maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T09:43:05Z

+		enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired)
+		if enrichErr != nil {
+			// Non-fatal: log and continue with the original endpoints.
+			log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints",
+				slog.String("source", srcEntry.Name),
+				slog.String("target", tgtEntry.Name),
+				slog.String("error", enrichErr.Error()),
+			)
+		} else {
+			desired = enriched
+		}
+
 		err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, desired)


Overwriting the desired slice with enriched inside the loop causes a correctness bug when there are multiple target clusters. The enriched peers (which contain discovered endpoints specific to the current target cluster) will be passed as the baseline desired peers to the next target cluster in the subsequent iteration. If a subsequent target cluster has not discovered a peer's endpoint yet, it will incorrectly receive the previous target's discovered endpoint instead of falling back to the original configured endpoint. Use a loop-local variable to hold the target-specific desired peers.

targetDesired := desired enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired) if enrichErr != nil { // Non-fatal: log and continue with the original endpoints. log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints", slog.String("source", srcEntry.Name), slog.String("target", tgtEntry.Name), slog.String("error", enrichErr.Error()), ) } else { targetDesired = enriched } err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, targetDesired)

gemini-code-assist · 2026-06-19T09:43:05Z

+import (
+	"context"
+	"encoding/json"
+	"net"
+
+	"github.com/cockroachdb/errors"
+	corev1 "k8s.io/api/core/v1"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+)


Import the standard strconv package to format the port number instead of using a custom itoa implementation.

Suggested change

import (

"context"

"encoding/json"

"net"

"github.com/cockroachdb/errors"

corev1 "k8s.io/api/core/v1"

"sigs.k8s.io/controller-runtime/pkg/client"

)

import (

"context"

"encoding/json"

"net"

"strconv"

"github.com/cockroachdb/errors"

corev1 "k8s.io/api/core/v1"

"sigs.k8s.io/controller-runtime/pkg/client"

)

gemini-code-assist · 2026-06-19T09:43:05Z

+			// the same key should have seen the same source IP (the peer's NAT
+			// egress), so any entry is equally authoritative.
+			if _, seen := result[pubKey]; !seen {
+				result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port))


Use strconv.Itoa from the standard library instead of the custom itoa function.

Suggested change

result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port))

result[pubKey] = net.JoinHostPort(entry.IP, strconv.Itoa(entry.Port))

gemini-code-assist · 2026-06-19T09:43:05Z

+// itoa converts an int to its decimal string representation without importing
+// strconv at the call site.
+func itoa(n int) string {
+	if n == 0 {
+		return "0"
+	}
+
+	buf := [20]byte{}
+	pos := len(buf)
+	neg := n < 0
+
+	if neg {
+		n = -n
+	}
+
+	for n > 0 {
+		pos--
+		buf[pos] = byte('0' + n%10)
+		n /= 10
+	}
+
+	if neg {
+		pos--
+		buf[pos] = '-'
+	}
+
+	return string(buf[pos:])
+}


Remove the custom itoa function as it is no longer needed.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 670-683: The issue is that assigning desired = enriched mutates a
shared slice across multiple target cluster iterations in the loop, causing
endpoint decisions from one target to leak into subsequent targets. Instead of
modifying the shared desired variable when enrichment succeeds, pass the
enriched endpoints directly to the peer.ReconcilePeers call for that specific
target cluster, or use a locally-scoped variable for the current target's
endpoints. This ensures each target cluster only uses endpoints appropriate to
itself without affecting other iterations.

In `@internal/kilonode/discovered.go`:
- Around line 85-87: The IP validation logic in the discovered.go file around
the ParseIP call is incomplete and allows unspecified addresses like 0.0.0.0 and
:: as well as multicast addresses to pass through, which can later propagate
unusable endpoints. Extend the existing conditional check that validates
ip.IsLoopback() and ip.IsLinkLocalUnicast() to also call ip.IsUnspecified() and
ip.IsMulticast() to reject these address types before accepting them into the
endpoints list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61223899-44ad-48dc-bf83-02dc150e450e

📥 Commits

Reviewing files that changed from the base of the PR and between b6d45c2 and ddcaa37.

📒 Files selected for processing (4)

api/v1alpha1/clustermesh_types.go
internal/controller/clustermesh_controller.go
internal/kilonode/discovered.go
internal/peer/builder.go

coderabbitai · 2026-06-19T09:53:44Z

+			ip := net.ParseIP(entry.IP)
+			if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
+				continue


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unspecified and multicast discovered IPs before accepting them.

Current validation still accepts addresses like 0.0.0.0 / :: (and multicast), which can propagate unusable endpoints into reconciliation.

Proposed fix

- ip := net.ParseIP(entry.IP) - if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() { + ip := net.ParseIP(entry.IP) + if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() { continue }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ip := net.ParseIP(entry.IP)

if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {

continue

ip := net.ParseIP(entry.IP)

if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() {

continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/kilonode/discovered.go` around lines 85 - 87, The IP validation logic in the discovered.go file around the ParseIP call is incomplete and allows unspecified addresses like 0.0.0.0 and :: as well as multicast addresses to pass through, which can later propagate unusable endpoints. Extend the existing conditional check that validates ip.IsLoopback() and ip.IsLinkLocalUnicast() to also call ip.IsUnspecified() and ip.IsMulticast() to reject these address types before accepting them into the endpoints list.

… itoa - desired is no longer mutated across target cluster iterations: each target now receives independently enriched peers (pushDesired) so that different target clusters observing different NAT IPs for the same source peer don't clobber each other. - Replace hand-rolled itoa in kilonode/discovered.go with strconv.Itoa. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

IvanHunters mentioned this pull request Jun 19, 2026

feat(clustermesh): add persistentKeepalive support for NAT traversal aenix-org/kubernetes-switchcloud#151

Closed

IvanHunters marked this pull request as ready for review June 19, 2026 09:49

chore: regenerate CRD manifests for PersistentKeepalive field

24cd3c1

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Andrei Kvapil (kvaps) approved these changes Jun 19, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit ebf6f55 into cozystack:main Jun 19, 2026
9 checks passed

coderabbitai Bot mentioned this pull request Jun 20, 2026

fix(peer): host-netns reachability and NAT bootstrap reliability #25

Merged

	result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port))
	result[pubKey] = net.JoinHostPort(entry.IP, strconv.Itoa(entry.Port))

Uh oh!

Conversation

IvanHunters commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Discovered-endpoint enrichment

Per-cluster PersistentKeepalive

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IvanHunters commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading