Skip to content

feat(operator): discovered-endpoint enrichment and per-cluster PersistentKeepalive#20

Merged
Andrei Kvapil (kvaps) merged 3 commits into
cozystack:mainfrom
IvanHunters:feat/discovered-endpoint-and-keepalive
Jun 19, 2026
Merged

feat(operator): discovered-endpoint enrichment and per-cluster PersistentKeepalive#20
Andrei Kvapil (kvaps) merged 3 commits into
cozystack:mainfrom
IvanHunters:feat/discovered-endpoint-and-keepalive

Conversation

@IvanHunters

@IvanHunters IvanHunters commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Discovered-endpoint enrichment

For clusters behind NAT (OpenStack tenants without floating IPs), nodes only advertise their internal IP via kilo.squat.ai/endpoint, which is unreachable from peer clusters. The real egress address is visible in kilo.squat.ai/discovered-endpoints on target cluster nodes: Kilo records the WireGuard handshake source IP after each successful session.

The controller now reads discovered-endpoints from every target node after building the desired Peer slice, and replaces the configured endpoint with the observed NAT/egress IP. This is best-effort — lookup failures fall back to the original endpoint with a warning log.

Self-healing flow:

  1. Source cluster's Kilo initiates first handshake (it knows the target's public IPs)
  2. Target nodes record the real source IP in discovered-endpoints
  3. Next operator reconcile publishes the correct endpoint in the Peer CRD
  4. Target cluster can now initiate back → bidirectional traffic restored

Per-cluster PersistentKeepalive

Adds persistentKeepalive: int to ClusterEntry. When non-zero, the value is forwarded to every Peer CRD the operator creates for nodes in that cluster. This keeps the stateful NAT mapping alive between reconciles. Without it the mapping expires and the discovered endpoint can no longer be reached from the target side.

Typical value for NAT clusters: persistentKeepalive: 25

Test plan

  • go test ./internal/... passes
  • golangci-lint run reports 0 issues
  • Deploy to cluster with NAT tenant cluster → verify bidirectional WireGuard connectivity self-heals after first handshake from source side
  • Verify persistentKeepalive: 25 in ClusterEntry propagates to Peer CRDs

Summary by CodeRabbit

  • New Features
    • Added WireGuard persistent keepalive interval configuration (0-65535 seconds) to enhance peer connectivity for NAT-traversed connections
    • Enabled automatic discovery and usage of NAT-discovered peer endpoints from cluster nodes, improving mesh connectivity and reliability

…tent keepalive

## Discovered-endpoint enrichment

For clusters behind NAT (e.g. OpenStack tenants without floating IPs)
the node's kilo.squat.ai/endpoint annotation only carries the internal
IP, which is unreachable from the peer cluster. The real egress address
is visible in kilo.squat.ai/discovered-endpoints on the target cluster's
nodes: Kilo records the WireGuard handshake source IP there after each
successful session.

Add enrichEndpointsFromDiscovered: after building the desired Peer slice,
the controller reads discovered-endpoints from every node in the target
cluster and replaces the configured endpoint with the observed NAT/egress
IP when a more-specific address is available. The enrichment is
best-effort — a lookup failure falls back to the original endpoint and
logs a warning.

This makes the operator self-healing for NAT traversal: the source
cluster's Kilo initiates the first handshake (it already knows the
target's public IPs), the target nodes record the real source IP, and
subsequent operator reconciles automatically publish the correct endpoint
in the Peer CRD so the target can initiate back.

## Per-cluster PersistentKeepalive

Add PersistentKeepalive int to ClusterEntry. When non-zero the value
is forwarded to the PersistentKeepalive field of every Peer CRD the
operator creates for nodes in that cluster. This keeps the stateful NAT
mapping alive between reconciles — without it the mapping expires after
the NAT timeout and the discovered endpoint can no longer be reached
from the target side.

Typical value for OpenStack-behind-NAT clusters: 25 (seconds).

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 49 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 721db270-007d-4a3e-b24b-b665fc5ed32a

📥 Commits

Reviewing files that changed from the base of the PR and between ddcaa37 and 34f4507.

📒 Files selected for processing (3)
  • config/crd/bases/kilo.squat.ai_clustermeshes.yaml
  • internal/controller/clustermesh_controller.go
  • internal/kilonode/discovered.go
📝 Walkthrough

Walkthrough

Adds PersistentKeepalive to ClusterEntry with kubebuilder validation and propagates it through BuildPeer into PeerSpec. Introduces internal/kilonode/discovered.go to read NAT-observed WireGuard endpoints from node annotations, and integrates endpoint enrichment into pushPeersToTargets in the controller, overriding peer endpoints best-effort when discovered addresses differ.

Changes

WireGuard NAT Connectivity

Layer / File(s) Summary
PersistentKeepalive field and builder wiring
api/v1alpha1/clustermesh_types.go, internal/peer/builder.go
ClusterEntry adds an optional PersistentKeepalive int field with kubebuilder min/max validation; BuildPeer sets PeerSpec.PersistentKeepalive from the entry.
kilonode: discovered endpoint lookup
internal/kilonode/discovered.go
New file defines AnnotationDiscoveredEndpoints, the discoveredEndpointEntry JSON struct, and DiscoveredEndpointsByKey which lists nodes, decodes the annotation, validates IP scope (rejects loopback and link-local), and returns a pubKey→host:port map using first-seen semantics. Includes a local itoa helper.
Controller endpoint enrichment
internal/controller/clustermesh_controller.go
pushPeersToTargets switches to index-based iteration and calls enrichEndpointsFromDiscovered per target cluster before peer reconciliation. Adds enrichEndpointsFromDiscovered (looks up discovered endpoints, deep-copies and overrides Peer.Spec.Endpoint when absent or differing) and parseDiscoveredEndpoint (parses host:port into *PeerEndpoint). handleDeletion loop is also refactored to index-based pointer iteration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/kilo-clustermesh-operator#11: Modifies BuildPeer signature and AllowedIPs construction in the same internal/peer/builder.go file where this PR adds PersistentKeepalive population.

Suggested reviewers

  • Arsolitt

Poem

🐇 Hop behind the NAT, no fear!
The keepalive ticks so peers stay near,
Discovered endpoints, fresh from the node,
Override the config, lighten the load.
Through tunnels and mesh the packets fly free—
A well-connected warren, happy as can be! 🌐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures both main features: discovered-endpoint enrichment and per-cluster PersistentKeepalive support for NAT scenarios, matching the changeset additions across API types, controller logic, and peer building.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for configuring WireGuard's PersistentKeepalive interval and dynamically enriches peer endpoints using NAT-observed IPs discovered from target cluster nodes. The review feedback identifies a critical correctness bug where overwriting the shared 'desired' slice inside a loop leaks target-specific endpoints to subsequent target clusters. Additionally, the reviewer recommends replacing the custom 'itoa' implementation in 'internal/kilonode/discovered.go' with the standard library's 'strconv.Itoa' for better maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 670 to 682
enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired)
if enrichErr != nil {
// Non-fatal: log and continue with the original endpoints.
log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints",
slog.String("source", srcEntry.Name),
slog.String("target", tgtEntry.Name),
slog.String("error", enrichErr.Error()),
)
} else {
desired = enriched
}

err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, desired)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Overwriting the desired slice with enriched inside the loop causes a correctness bug when there are multiple target clusters. The enriched peers (which contain discovered endpoints specific to the current target cluster) will be passed as the baseline desired peers to the next target cluster in the subsequent iteration. If a subsequent target cluster has not discovered a peer's endpoint yet, it will incorrectly receive the previous target's discovered endpoint instead of falling back to the original configured endpoint. Use a loop-local variable to hold the target-specific desired peers.

		targetDesired := desired
		enriched, enrichErr := r.enrichEndpointsFromDiscovered(ctx, log, tgtClient, desired)
		if enrichErr != nil {
			// Non-fatal: log and continue with the original endpoints.
			log.Warn("enriching peer endpoints from discovered-endpoints failed; using configured endpoints",
				slog.String("source", srcEntry.Name),
				slog.String("target", tgtEntry.Name),
				slog.String("error", enrichErr.Error()),
			)
		} else {
			targetDesired = enriched
		}

		err = peer.ReconcilePeers(ctx, tgtClient, mesh.Name, srcEntry.Name, targetDesired)

Comment on lines +19 to +27
import (
"context"
"encoding/json"
"net"

"github.com/cockroachdb/errors"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import the standard strconv package to format the port number instead of using a custom itoa implementation.

Suggested change
import (
"context"
"encoding/json"
"net"
"github.com/cockroachdb/errors"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
import (
"context"
"encoding/json"
"net"
"strconv"
"github.com/cockroachdb/errors"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)

Comment thread internal/kilonode/discovered.go Outdated
// the same key should have seen the same source IP (the peer's NAT
// egress), so any entry is equally authoritative.
if _, seen := result[pubKey]; !seen {
result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use strconv.Itoa from the standard library instead of the custom itoa function.

Suggested change
result[pubKey] = net.JoinHostPort(entry.IP, itoa(entry.Port))
result[pubKey] = net.JoinHostPort(entry.IP, strconv.Itoa(entry.Port))

Comment thread internal/kilonode/discovered.go Outdated
Comment on lines +102 to +129
// itoa converts an int to its decimal string representation without importing
// strconv at the call site.
func itoa(n int) string {
if n == 0 {
return "0"
}

buf := [20]byte{}
pos := len(buf)
neg := n < 0

if neg {
n = -n
}

for n > 0 {
pos--
buf[pos] = byte('0' + n%10)
n /= 10
}

if neg {
pos--
buf[pos] = '-'
}

return string(buf[pos:])
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove the custom itoa function as it is no longer needed.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 670-683: The issue is that assigning desired = enriched mutates a
shared slice across multiple target cluster iterations in the loop, causing
endpoint decisions from one target to leak into subsequent targets. Instead of
modifying the shared desired variable when enrichment succeeds, pass the
enriched endpoints directly to the peer.ReconcilePeers call for that specific
target cluster, or use a locally-scoped variable for the current target's
endpoints. This ensures each target cluster only uses endpoints appropriate to
itself without affecting other iterations.

In `@internal/kilonode/discovered.go`:
- Around line 85-87: The IP validation logic in the discovered.go file around
the ParseIP call is incomplete and allows unspecified addresses like 0.0.0.0 and
:: as well as multicast addresses to pass through, which can later propagate
unusable endpoints. Extend the existing conditional check that validates
ip.IsLoopback() and ip.IsLinkLocalUnicast() to also call ip.IsUnspecified() and
ip.IsMulticast() to reject these address types before accepting them into the
endpoints list.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61223899-44ad-48dc-bf83-02dc150e450e

📥 Commits

Reviewing files that changed from the base of the PR and between b6d45c2 and ddcaa37.

📒 Files selected for processing (4)
  • api/v1alpha1/clustermesh_types.go
  • internal/controller/clustermesh_controller.go
  • internal/kilonode/discovered.go
  • internal/peer/builder.go

Comment thread internal/controller/clustermesh_controller.go
Comment on lines +85 to +87
ip := net.ParseIP(entry.IP)
if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unspecified and multicast discovered IPs before accepting them.

Current validation still accepts addresses like 0.0.0.0 / :: (and multicast), which can propagate unusable endpoints into reconciliation.

Proposed fix
-			ip := net.ParseIP(entry.IP)
-			if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
+			ip := net.ParseIP(entry.IP)
+			if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
 				continue
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ip := net.ParseIP(entry.IP)
if ip == nil || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
continue
ip := net.ParseIP(entry.IP)
if ip == nil || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLoopback() || ip.IsLinkLocalUnicast() {
continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/kilonode/discovered.go` around lines 85 - 87, The IP validation
logic in the discovered.go file around the ParseIP call is incomplete and allows
unspecified addresses like 0.0.0.0 and :: as well as multicast addresses to pass
through, which can later propagate unusable endpoints. Extend the existing
conditional check that validates ip.IsLoopback() and ip.IsLinkLocalUnicast() to
also call ip.IsUnspecified() and ip.IsMulticast() to reject these address types
before accepting them into the endpoints list.

… itoa

- desired is no longer mutated across target cluster iterations: each
  target now receives independently enriched peers (pushDesired) so that
  different target clusters observing different NAT IPs for the same source
  peer don't clobber each other.
- Replace hand-rolled itoa in kilonode/discovered.go with strconv.Itoa.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@kvaps Andrei Kvapil (kvaps) merged commit ebf6f55 into cozystack:main Jun 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants