Skip to content

feat: full stale-Peer cleanup + fold anchor CIDRs into node Peer#11

Merged
IvanHunters merged 10 commits into
cozystack:mainfrom
IvanHunters:feat/stale-peer-cleanup-and-anchor-fold-in
May 27, 2026
Merged

feat: full stale-Peer cleanup + fold anchor CIDRs into node Peer#11
IvanHunters merged 10 commits into
cozystack:mainfrom
IvanHunters:feat/stale-peer-cleanup-and-anchor-fold-in

Conversation

@IvanHunters

@IvanHunters IvanHunters commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Five related fixes that together let `kilo-clustermesh-operator` handle a real-world hub-and-spoke topology end-to-end, validated live on dev3 with three switchcloud tenants peering with a shared external Ceph hub.

1. multicluster.Build: tolerate missing kubeconfig secrets

`Build` used to bail on the first cluster entry whose kubeconfig Secret could not be read. That crashed the operator at startup during tenant teardown — the upstream Helm release deletes the admin-kubeconfig Secret before the `ClusterMesh` finalizer reconcile runs, so the leftover `KubeconfigSecretRef` made `buildInitialRegistry` fail. The operator then crashloop'd, the finalizer never ran, and `ClusterMesh` objects wedged with `finalizers:[kilo-clustermesh.io/cleanup]` forever.

`Build` now logs a warning per problematic entry and skips it, returning the rest of the registry as usable. The reconciler already does best-effort sweeps via `r.Registry.Client(name)` with `if !ok { continue }`, so a partial registry preserves forward progress: finalizers clean peers in clusters they can still reach, the `ClusterMesh` releases, and the operator can come back up clean.

2. cleanupStaleSourceClusters: sweep peers when a cluster leaves spec

`ReconcilePeers` only sweeps orphans inside one `(mesh, source-cluster)` pair. Once a cluster entry is removed from `spec.Clusters`, no per-pair sweep ever runs for it again and its `Peer` objects in surviving clusters persist — with stale endpoints, conflicting AllowedIPs, and unreachable public keys that Kilo still picks up.

New `DeleteStaleSourceClusters` lists every `Peer` labelled with this mesh and deletes those whose `source-cluster` label is no longer in spec. Wired into `Reconcile` after `reconcileAllClusters`. Cross-CR coverage: the sweep also walks clusters present only in OTHER `ClusterMesh` resources' specs (via the merged registry), and clusters fully removed from THIS mesh's spec get every peer wiped (not just stale-source ones).

3. cleanupOrphanMeshPeers: sweep peers of deleted ClusterMesh CRs

`handleDeletion` is the primary cleanup path, but the finalizer can be skipped — force-delete, manual finalizer removal (to unblock a crashloop'd operator), legacy CR predating the finalizer. Without this sweep, the cluster accumulates ghost peers that no per-CR reconcile would ever notice — none of the per-CR cleanup paths look at peers labelled for CRs other than their own.

Every live reconcile of any `ClusterMesh` now also collects the set of living mesh names in the namespace and deletes any `Peer` whose `mesh` label points outside that set. Failures log per peer and continue.

4. peer.BuildPeer: fold anchor CIDRs into the first node Peer

WireGuard identifies peers exclusively by their public key and keeps only one peer entry per pubkey. The operator had been emitting two `Peer` objects per source-cluster anchor — one node Peer with `allowedIPs=[podCIDR, wgIP/32]` and one separate "anchor" Peer with `allowedIPs=[serviceCIDR, additionalCIDRs]` — both reusing the anchor node's pubkey. On the receiving cluster, `kilo` applied whichever `Peer` it processed last; the other entry's AllowedIPs were silently dropped from kernel WG state. The result was a racy half-outage: sometimes the receiving cluster could route the pod-CIDR but lost the service-CIDR, sometimes the inverse. We observed this live — mesh1 happened to land in the favourable order; mesh2's handshake completed (visible in tcpdump) but inner packets were dropped at ingress because their source IP was outside the surviving WG peer's AllowedIPs.

Stop emitting the separate anchor `Peer`. `BuildPeer` gains an `extraAllowedIPs []string` parameter; `buildDesiredPeers` passes `CollectAnchorCIDRs(entry)` to the first valid node and `nil` to the rest. One WG peer entry per pubkey with the full union of AllowedIPs.

5. Tests

Unit:

  • `internal/peer/reconciler_test.go`: `TestDeleteStaleSourceClusters_*` (removal / no-op / empty valid set / mesh isolation)
  • `internal/multicluster/registry_test.go`: `TestBuild_*` (missing-secret tolerance / local entry fast-path / nil logger)
  • `internal/peer/builder_test.go`: `TestBuildPeer_AnchorExtras_` replaces the legacy `TestBuildAnchorPeer_` set with fold-in semantics (with-serviceCIDR / with-additionalCIDRs / no-extras / no-endpoint-source error / ExternalIP fallback)

Integration (`test/integration/cleanup_test.go`):

  • `TestCleanupStaleSourceClusters_RemovesPeersOfRemovedCluster` — reproduces the production regression
  • `TestCleanupStaleSourceClusters_SweepsClustersOutsideSpec` — pins the cross-CR behaviour
  • `TestCleanupStaleSourceClusters_IgnoresPeersFromOtherMeshes` — guards isolation between meshes
  • `TestCleanupOrphanMeshPeers_DeletesGhostsAfterCRGone` — defense-in-depth sweep
  • `TestCleanupOrphanMeshPeers_LeavesPeersOfLivingMeshAlone` — regression guard

Live validation (dev3)

  • Three switchcloud tenants (mesh1 / mesh2 / mesh3) peering with one external Ceph cluster as hub
  • `mesh1 → ceph`, `mesh2 → ceph`, `mesh3 → ceph` — PING-OK across all three
  • `mesh1 → mesh2`, `mesh2 → mesh3`, `mesh3 → mesh1` — isolated by WG AllowedIPs as expected
  • Tenant rotation (add → remove cluster from spec) cleanly removes stale peers; CR teardown via finalizer cleans every cluster; force-deletion handled by orphan sweep

Test plan

  • `go build ./...`, `go vet ./...` pass locally
  • `go test ./internal/...` pass locally (envtest integration runs in CI)
  • `helm-unittest charts/kilo-clustermesh-operator` still 41/41 passing
  • CI green on `build`, `lint`, `test`, `integration`, `helm`

Summary by CodeRabbit

  • New Features

    • Automatic cleanup of orphaned peers referencing removed meshes.
    • Anchor/service CIDRs are folded into peers so each node has a single consolidated peer entry.
  • Bug Fixes

    • Better stale-peer reconciliation across cluster boundaries.
    • Cluster registry initialization now logs and skips invalid cluster entries (tolerates missing configs).
  • Tests

    • New unit and integration tests covering registry resilience, peer-building behavior, and orphan/stale-peer cleanup.

Review Change Stack

Walk every cluster in the merged Registry (built from all ClusterMesh
specs), not just this mesh's spec.Clusters. This closes the second half
of the stale-peer problem: peers a mesh pushed into a cluster that has
since been removed from its spec stay reachable via any sibling
ClusterMesh that still references the cluster, so the sweep can reach
them.

Adds a cross-CR integration case to test/integration/cleanup_test.go
that pins this contract: a peer planted in 'remote' for a mesh whose
spec contains only 'local' is removed after reconcile.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Build returned an error from the first cluster entry whose kubeconfig
Secret could not be read. During tenant teardown this crashed the
operator at startup: the upstream KubernetesSwitchcloud Helm release
deletes the tenant's admin-kubeconfig Secret before the ClusterMesh CR
that references it is finalized, leaving a stale secretRef that
buildInitialRegistry then dereferences. The operator crashlooped, the
finalizer never ran, the ClusterMesh CR stuck with finalizers forever,
and downstream teardown stalled.

Build now logs a warning per problematic entry and skips it, returning
the rest of the registry as usable. cluster.Cluster construction
failures are treated the same way. The reconciler already does
best-effort sweeps via r.Registry.Client(name) with 'if !ok { continue }',
so a partial registry preserves forward progress: finalizers clean
peers in the clusters they can still reach, the ClusterMesh CR
releases, and the operator can come back up clean on the next reconcile.

A nil logger is accepted to keep the signature ergonomic; it routes to
slog.DiscardHandler so test callers and other call sites do not need
to thread a logger through to use Build.

cmd/main.go threads the existing slogger through buildInitialRegistry
so missing-secret warnings show up in the operator log at WARN level.

Three unit tests cover the new behaviour:
- TestBuild_SkipsEntryWithMissingSecret: reproduces the crashloop bug
  with a mesh+ceph pair where only ceph's kubeconfig Secret exists.
- TestBuild_LocalEntryDoesNotNeedSecret: guards the Local: true
  fast-path, which must keep succeeding regardless of Secret state.
- TestBuild_NilLoggerIsAccepted: protects against nil-deref for
  call sites that have not yet wired a logger.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
handleDeletion only runs when the finalizer reconcile gets to fire. In
production we have already seen the finalizer skipped:
  - the operator crashlooped (missing kubeconfig Secret blocked
    buildInitialRegistry), so the finalizer reconcile never ran, and
    the user manually removed the finalizer to unblock teardown.
  - force-delete via the API or controller-runtime client.

In all three cases the cleanup half of teardown is bypassed, leaving
Peer objects labeled with a mesh name whose ClusterMesh CR no longer
exists. The per-CR cleanupStaleSourceClusters pass added in the
previous commits cannot touch those: it only looks at peers labeled
with its OWN mesh name. The ghosts persist until a human notices.

Add cleanupOrphanMeshPeers: every reconcile of any ClusterMesh now
lists every other ClusterMesh in the namespace, collects the set of
living mesh names, walks every cluster in the registry, and deletes
any Peer whose kilo-clustermesh.io/mesh label points at a name not in
that set. Failures (list, delete) are logged per-peer and the pass
continues. The sweep is namespace-scoped: it only considers ClusterMesh
objects in the reconciling CR's namespace, so it cannot accidentally
clobber peers managed from another namespace by another operator.

Wired into Reconcile() after the per-CR cleanupStaleSourceClusters
pass so that every live reconcile contributes to the global sweep —
the cluster reconverges as soon as any mesh's reconcile fires.

Two integration cases in test/integration/cleanup_test.go:
  - DeletesGhostsAfterCRGone: plants a peer labeled with a ghost-mesh
    name; live mesh reconcile removes it.
  - LeavesPeersOfLivingMeshAlone: regression guard ensuring the sweep
    keeps peers whose mesh label still corresponds to an existing CR.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
…Peer

Kilo (and WireGuard generally) identifies peers exclusively by their
public key. The receiving cluster's kilo daemon issues `wg setconf` to
load Peer CRs into kernel state, and a second peer entry with the same
pubkey REPLACES the first — AllowedIPs are not merged. The operator had
been emitting two Peer objects per source-cluster node anchor: one
"node" Peer with AllowedIPs=[podCIDR, wgIP/32] and one "anchor" Peer
with AllowedIPs=[serviceCIDR, additionalCIDRs], both reusing the anchor
node's pubkey.

On the receiving cluster (e.g. nuvolos-ceph in our live test), kilo
applied whichever Peer it processed last; the other entry's AllowedIPs
were silently dropped from the WG kernel state. The result was a racy
half-outage: sometimes pod-CIDR traffic worked but the serviceCIDR was
unrouted; sometimes the inverse. The mesh1 path happened to land in the
favourable order; mesh2 didn't, and its handshake completed (we
observed it via tcpdump) yet inner packets were dropped on ingress
because their source IP was outside the WG peer's AllowedIPs.

Stop emitting the separate anchor Peer. Fold the cluster-wide CIDRs
into the first valid node's Peer.AllowedIPs via a new extraAllowedIPs
parameter on BuildPeer, with CollectAnchorCIDRs exposing the shape that
was previously private. The buildDesiredPeers loop picks the first node
as the anchor carrier; other nodes get nil extras.

Old anchor Peer objects on remote clusters disappear naturally: their
NAME (<mesh>--<cluster>--anchor) is no longer in the desired set, so
ReconcilePeers' per-pair orphan sweep deletes them on the next
reconcile.

Knock-on changes:
- builder_test.go updates every BuildPeer call site for the new
  4-argument signature (existing tests pass nil extras).
- TestBuildAnchorPeer_* renamed to TestBuildPeer_AnchorExtras_* and
  rewritten to exercise the new fold-in semantics. The NoEndpointSource
  case is now a hard error from BuildPeer instead of a nil from
  BuildAnchorPeer, which is strictly stricter: misconfiguration
  surfaces at reconcile time rather than silently losing
  cluster-wide CIDR routing.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors peer generation by folding cluster-wide CIDRs directly into the first valid node's Peer instead of emitting a separate anchor Peer, which prevents WireGuard public key deduplication from clobbering AllowedIPs. It also enhances the operator's resilience by skipping failed cluster entries during registry builds rather than aborting, and introduces self-healing passes to clean up stale and orphan peers. The review feedback highlights critical improvements, including adding a defensive check for empty node.Spec.PodCIDRs to prevent runtime panics, and wrapping remote cluster operations with timeout contexts to avoid blocking the reconciliation loop.

Comment thread internal/peer/builder.go
Comment on lines +72 to +74
allowedIPs := make([]string, 0, 2+len(extraAllowedIPs))
allowedIPs = append(allowedIPs, node.Spec.PodCIDRs[0], netutil.HostRoute(hostIP))
allowedIPs = append(allowedIPs, extraAllowedIPs...)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing node.Spec.PodCIDRs[0] directly without checking if node.Spec.PodCIDRs is empty can cause a runtime panic if a node is registered but does not have its PodCIDRs allocated yet. We should add a defensive check to ensure node.Spec.PodCIDRs has at least one element.

	if len(node.Spec.PodCIDRs) == 0 {
		return nil, errors.Newf("node %q has no PodCIDRs allocated", node.Name)
	}

	allowedIPs := make([]string, 0, 2+len(extraAllowedIPs))
	allowedIPs = append(allowedIPs, node.Spec.PodCIDRs[0], netutil.HostRoute(hostIP))
	allowedIPs = append(allowedIPs, extraAllowedIPs...)

Comment on lines 178 to 184
err := peer.DeleteStaleSourceClusters(ctx, tgtClient, mesh.Name, sources)
if err != nil {
log.Warn("cleaning stale source-cluster peers",
slog.String("target", tgtEntry.Name),
slog.String("target", name),
slog.String("error", err.Error()),
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Reconciling stale source-cluster peers synchronously across all clusters in the registry can block the entire reconciliation loop if any remote cluster is offline or slow. To prevent a single unhealthy cluster from degrading the operator, we should use a context with a short timeout for the cleanup call.

		cleanupCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
		err := peer.DeleteStaleSourceClusters(cleanupCtx, tgtClient, mesh.Name, sources)
		cancel()
		if err != nil {
			log.Warn("cleaning stale source-cluster peers",
				slog.String("target", name),
				slog.String("error", err.Error()),
			)
		}

Comment on lines +226 to +273
for _, clusterName := range r.Registry.Clusters() {
tgtClient, ok := r.Registry.Client(clusterName)
if !ok {
continue
}

var peers kilov1alpha1.PeerList

err := tgtClient.List(ctx, &peers)
if err != nil {
log.Warn("listing peers for orphan sweep",
slog.String("target", clusterName),
slog.String("error", err.Error()),
)

continue
}

for i := range peers.Items {
peerObj := &peers.Items[i]

meshLabel := peerObj.Labels[peer.LabelMesh]
if meshLabel == "" {
continue
}

if _, alive := living[meshLabel]; alive {
continue
}

deleteErr := tgtClient.Delete(ctx, peerObj)
if deleteErr != nil && !apierrors.IsNotFound(deleteErr) {
log.Warn("deleting orphan peer",
slog.String("target", clusterName),
slog.String("peer", peerObj.Name),
slog.String("error", deleteErr.Error()),
)

continue
}

log.Info("deleted orphan peer whose ClusterMesh CR no longer exists",
slog.String("target", clusterName),
slog.String("peer", peerObj.Name),
slog.String("orphan-mesh", meshLabel),
)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Listing and deleting orphan peers synchronously across all clusters in the registry can block the reconciliation loop if any remote cluster is offline or slow. We should wrap the operations for each cluster with a short timeout context. Make sure to call cancel() at the end of each iteration to avoid context leaks.

	for _, clusterName := range r.Registry.Clusters() {
		tgtClient, ok := r.Registry.Client(clusterName)
		if !ok {
			continue
		}

		clusterCtx, cancel := context.WithTimeout(ctx, 5*time.Second)

		var peers kilov1alpha1.PeerList

		err := tgtClient.List(clusterCtx, &peers)
		if err != nil {
			log.Warn("listing peers for orphan sweep",
				slog.String("target", clusterName),
				slog.String("error", err.Error()),
			)
			cancel()

			continue
		}

		for i := range peers.Items {
			peerObj := &peers.Items[i]

			meshLabel := peerObj.Labels[peer.LabelMesh]
			if meshLabel == "" {
				continue
			}

			if _, alive := living[meshLabel]; alive {
				continue
			}

			deleteErr := tgtClient.Delete(clusterCtx, peerObj)
			if deleteErr != nil && !apierrors.IsNotFound(deleteErr) {
				log.Warn("deleting orphan peer",
					slog.String("target", clusterName),
					slog.String("peer", peerObj.Name),
					slog.String("error", deleteErr.Error()),
				)

				continue
			}

			log.Info("deleted orphan peer whose ClusterMesh CR no longer exists",
				slog.String("target", clusterName),
				slog.String("peer", peerObj.Name),
				slog.String("orphan-mesh", meshLabel),
			)
		}
		cancel()
	}

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a1dacb4-281b-4335-bb29-d822d7e5e721

📥 Commits

Reviewing files that changed from the base of the PR and between addc84b and be32cde.

📒 Files selected for processing (1)
  • internal/controller/clustermesh_controller.go
📝 Walkthrough

Walkthrough

This PR threads a slog.Logger into multicluster.Build, makes Build skip and warn on per-entry failures, changes peer construction to fold cluster-wide CIDRs into per-node extras, and adds controller sweeps to delete stale and orphan peers across registry clusters. Tests (unit and integration) cover these behaviors.

Changes

ClusterMesh peer lifecycle and registry robustness

Layer / File(s) Summary
Logger threading to registry Build
cmd/main.go, internal/multicluster/registry.go
run() passes initialized slogger into buildInitialRegistry; multicluster.Build signature now accepts *slog.Logger.
Registry partial build on per-entry failure
internal/multicluster/registry.go, internal/multicluster/registry_test.go
Build defaults nil logger to discard, logs structured warnings for per-entry config/cluster init failures, skips those entries, and returns a partial registry. Unit tests cover missing-secret skipping, local entry behavior, and nil-logger acceptance.
Peer builder anchor CIDR folding and extras
internal/peer/builder.go, internal/peer/builder_test.go
BuildPeer gains extraAllowedIPs []string; CollectAnchorCIDRs(entry) added; builder validates PodCIDRs and appends node routes then extras. Anchor CIDRs are folded into the first node's peer instead of emitting separate anchor peers; tests updated to anchor-extras model.
Controller stale and orphan peer cleanup
internal/controller/clustermesh_controller.go, test/integration/cleanup_test.go
cleanupStaleSourceClusters rewritten to sweep all registry clusters using spec-derived valid source names; new cleanupOrphanMeshPeers deletes peers labeled to non-existent ClusterMesh CRs (ignoring NotFound); Reconcile invokes orphan sweep after per-cluster reconcile. Integration tests added/adjusted for stale and orphan scenarios.

Sequence Diagram

sequenceDiagram
  participant Reconcile
  participant PerCluster
  participant StaleCleanup as cleanupStaleSourceClusters
  participant OrphanCleanup as cleanupOrphanMeshPeers
  participant Registry
  participant APIServer as Peer objects
  Reconcile->>PerCluster: reconcile each mesh.Spec.Clusters
  PerCluster->>APIServer: create/update peers
  Reconcile->>StaleCleanup: after per-cluster done
  StaleCleanup->>Registry: list all known clusters
  StaleCleanup->>APIServer: delete peers in removed clusters or with invalid source labels
  Reconcile->>OrphanCleanup: after stale cleanup
  OrphanCleanup->>APIServer: list all Peer objects with kilo-clustermesh.io/mesh label
  OrphanCleanup->>APIServer: for each peer, check if mesh label references live ClusterMesh CR
  OrphanCleanup->>APIServer: delete peers labeled to non-existent ClusterMesh CRs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Arsolitt

Poem

🐰 I folded CIDRs into a single peer so bright,
The registry whispers warnings in the night,
Skipped clusters slip away with grace,
Orphan ghosts are cleared from every place,
Hoppity hops — cleanup finished, systems light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the two main themes of the changeset: stale-Peer cleanup across cluster boundaries and the refactored anchor CIDR handling in peer construction.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

CI feedback on the first push:

1. funlen: cleanupOrphanMeshPeers was 65 lines (limit 60). Split into
   three: cleanupOrphanMeshPeers (orchestrator), collectLivingMeshes
   (CR-list helper), sweepOrphanPeersInCluster (per-cluster scan).
   Behaviour identical; just shorter functions.

2. TestCleanupStaleSourceClusters_IgnoresPeersFromOtherMeshes had been
   planting a peer labelled with a non-existent mesh and asserting the
   per-source sweep leaves it alone. That was true before this PR
   added the orphan-mesh sweep — which legitimately deletes peers
   whose mesh CR doesn't exist. The test now creates a sibling
   ClusterMesh in the same namespace so the foreign peer's mesh label
   references a LIVING CR. The property under test (per-source sweep
   doesn't touch peers owned by another mesh) is unchanged; orphan
   handling is covered by the dedicated TestCleanupOrphanMeshPeers_*
   cases.

3. TestCleanupOrphanMeshPeers_LeavesPeersOfLivingMeshAlone planted a
   peer in localClient without a t.Cleanup, so it survived the
   reconciling mesh's deletion and polluted subsequent integration
   tests whose reconciles legitimately swept it as an orphan. Added
   the missing Delete in t.Cleanup.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@IvanHunters IvanHunters marked this pull request as ready for review May 27, 2026 12:22

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 208-246: collectLivingMeshes is listing ClusterMesh only in the
reconciler's namespace which causes peers owned by same-name meshes in other
namespaces to be treated as orphans; change the ownership check to be
namespace-qualified or use a cluster-wide live-mesh lookup. Concretely: modify
collectLivingMeshes to include namespace in the returned identity (e.g.,
map["<namespace>/<meshName>"]struct{} or accept a cluster-wide list by removing
client.InNamespace), update all callers (cleanupOrphanMeshPeers and
sweepOrphanPeersInCluster) to build and check owner keys using
mesh.Namespace+mesh.Name, and ensure the peer.Labels creation/inspection
(peer.Labels or wherever labels are generated) uses the namespace-qualified mesh
identity so deletes only remove peers whose namespace+name owner is truly absent
cluster-wide.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b5af09e-1a1a-4edd-9704-c689981be8ca

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd8d56 and a15c139.

📒 Files selected for processing (7)
  • cmd/main.go
  • internal/controller/clustermesh_controller.go
  • internal/multicluster/registry.go
  • internal/multicluster/registry_test.go
  • internal/peer/builder.go
  • internal/peer/builder_test.go
  • test/integration/cleanup_test.go

Comment thread internal/controller/clustermesh_controller.go
…r-wide CR list)

Three concrete findings from gemini-code-assist and one critical from
coderabbitai on PR cozystack#11.

1. peer.BuildPeer: guard against empty Spec.PodCIDRs

   Accessing PodCIDRs[0] would panic on a node that joined before the
   kube-controller-manager had allocated a pod CIDR. Surface this as a
   clean BuildPeer error so the reconciler's per-node validation path
   skips the node and continues, instead of crash-looping the operator.

2. cleanupStaleSourceClusters: per-target timeout

   Synchronous List/Delete calls across the merged registry could stall
   the whole reconcile loop if one remote cluster's apiserver is slow
   or unreachable. Wrap each target's sweep in a context.WithTimeout
   bounded by cleanupSweepTimeout (5s), so a single bad cluster degrades
   only its own per-tick progress.

3. cleanupOrphanMeshPeers: per-target timeout (same rationale).

4. cleanupOrphanMeshPeers: list ClusterMesh CLUSTER-WIDE

   Critical bug spotted by coderabbitai. collectLivingMeshes previously
   restricted the lookup to the reconciler's own namespace, but the
   operator's registry is built cluster-wide from every ClusterMesh it
   sees. A Peer labelled mesh=foo on a target cluster could legitimately
   belong to a foo CR in another namespace, and reconciling a CR in
   namespace A would treat that peer as orphaned and delete it. Drop
   the InNamespace filter and consider every ClusterMesh in the cluster
   as a living mesh. Namespace-qualified Peer ownership (a dedicated
   label) is a cleaner long-term fix and worth a follow-up, but the
   cluster-wide list is enough to stop the data-loss path described in
   the review.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/controller/clustermesh_controller.go (2)

550-570: 💤 Low value

Consider sorting nodes for deterministic anchor assignment.

The first node in the slice receives anchor CIDRs, but node ordering depends on API server list response which can vary. If the order changes between reconciles, different nodes become the "anchor", causing peer updates and brief routing churn.

Sorting nodes by name before iteration would make anchor assignment deterministic:

🔧 Proposed fix to sort nodes
+import "sort"

 func buildDesiredPeers(meshName string, entry *v1alpha1.ClusterEntry, nodes []*corev1.Node) ([]*kilov1alpha1.Peer, error) {
 	peers := make([]*kilov1alpha1.Peer, 0, len(nodes))
 
+	// Sort nodes by name for deterministic anchor assignment. The first
+	// node carries the cluster-wide CIDRs, so a stable order avoids
+	// peer churn when the API server returns nodes in a different order.
+	sort.Slice(nodes, func(i, j int) bool {
+		return nodes[i].Name < nodes[j].Name
+	})
+
 	anchorExtras := peer.CollectAnchorCIDRs(entry)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/clustermesh_controller.go` around lines 550 - 570,
buildDesiredPeers currently assigns anchor CIDRs to the first element of the
nodes slice, but node list order from the API server is non-deterministic; to
avoid flapping anchors sort the nodes slice by node.Name before the loop. Update
buildDesiredPeers to perform a stable sort (e.g., sort.Slice) on the incoming
nodes slice by each corev1.Node.Name, then keep the rest of the logic (use
peer.CollectAnchorCIDRs(entry) and call peer.BuildPeer(meshName, entry, node,
extras)) unchanged so anchorExtras are consistently applied to the same node
across reconciles.

185-198: 💤 Low value

Lint: missing blank line before cancel().

The wsl linter flags a missing blank line. Given the tight coupling between the sweepCtx creation and its immediate use/cleanup, consider either adding a blank line or using a helper to satisfy the linter without cluttering the loop.

🧹 Proposed fix to add blank line
 		sweepCtx, cancel := context.WithTimeout(ctx, cleanupSweepTimeout)
 
 		err := peer.DeleteStaleSourceClusters(sweepCtx, tgtClient, mesh.Name, sources)
+
 		cancel()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/clustermesh_controller.go` around lines 185 - 198, The
linter flags a missing blank line before the cancel() call; update the loop
around the WithTimeout usage (symbols: sweepCtx, cancel, and
peer.DeleteStaleSourceClusters) to either insert a blank line between the "err
:= peer.DeleteStaleSourceClusters(...)" statement and the "cancel()" call, or
refactor into a small helper that uses defer cancel() immediately after
context.WithTimeout to satisfy the linter without changing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 550-570: buildDesiredPeers currently assigns anchor CIDRs to the
first element of the nodes slice, but node list order from the API server is
non-deterministic; to avoid flapping anchors sort the nodes slice by node.Name
before the loop. Update buildDesiredPeers to perform a stable sort (e.g.,
sort.Slice) on the incoming nodes slice by each corev1.Node.Name, then keep the
rest of the logic (use peer.CollectAnchorCIDRs(entry) and call
peer.BuildPeer(meshName, entry, node, extras)) unchanged so anchorExtras are
consistently applied to the same node across reconciles.
- Around line 185-198: The linter flags a missing blank line before the cancel()
call; update the loop around the WithTimeout usage (symbols: sweepCtx, cancel,
and peer.DeleteStaleSourceClusters) to either insert a blank line between the
"err := peer.DeleteStaleSourceClusters(...)" statement and the "cancel()" call,
or refactor into a small helper that uses defer cancel() immediately after
context.WithTimeout to satisfy the linter without changing behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52629c81-5449-405e-b0c4-18256975e427

📥 Commits

Reviewing files that changed from the base of the PR and between a15c139 and addc84b.

📒 Files selected for processing (2)
  • internal/controller/clustermesh_controller.go
  • internal/peer/builder.go

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@IvanHunters IvanHunters merged commit 16a8f64 into cozystack:main May 27, 2026
9 checks passed
@IvanHunters IvanHunters deleted the feat/stale-peer-cleanup-and-anchor-fold-in branch May 27, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants