feat: full stale-Peer cleanup + fold anchor CIDRs into node Peer by IvanHunters · Pull Request #11 · cozystack/kilo-clustermesh-operator

IvanHunters · 2026-05-27T12:07:26Z

Summary

Five related fixes that together let `kilo-clustermesh-operator` handle a real-world hub-and-spoke topology end-to-end, validated live on dev3 with three switchcloud tenants peering with a shared external Ceph hub.

1. multicluster.Build: tolerate missing kubeconfig secrets

`Build` used to bail on the first cluster entry whose kubeconfig Secret could not be read. That crashed the operator at startup during tenant teardown — the upstream Helm release deletes the admin-kubeconfig Secret before the `ClusterMesh` finalizer reconcile runs, so the leftover `KubeconfigSecretRef` made `buildInitialRegistry` fail. The operator then crashloop'd, the finalizer never ran, and `ClusterMesh` objects wedged with `finalizers:[kilo-clustermesh.io/cleanup]` forever.

`Build` now logs a warning per problematic entry and skips it, returning the rest of the registry as usable. The reconciler already does best-effort sweeps via `r.Registry.Client(name)` with `if !ok { continue }`, so a partial registry preserves forward progress: finalizers clean peers in clusters they can still reach, the `ClusterMesh` releases, and the operator can come back up clean.

2. cleanupStaleSourceClusters: sweep peers when a cluster leaves spec

`ReconcilePeers` only sweeps orphans inside one `(mesh, source-cluster)` pair. Once a cluster entry is removed from `spec.Clusters`, no per-pair sweep ever runs for it again and its `Peer` objects in surviving clusters persist — with stale endpoints, conflicting AllowedIPs, and unreachable public keys that Kilo still picks up.

New `DeleteStaleSourceClusters` lists every `Peer` labelled with this mesh and deletes those whose `source-cluster` label is no longer in spec. Wired into `Reconcile` after `reconcileAllClusters`. Cross-CR coverage: the sweep also walks clusters present only in OTHER `ClusterMesh` resources' specs (via the merged registry), and clusters fully removed from THIS mesh's spec get every peer wiped (not just stale-source ones).

3. cleanupOrphanMeshPeers: sweep peers of deleted ClusterMesh CRs

`handleDeletion` is the primary cleanup path, but the finalizer can be skipped — force-delete, manual finalizer removal (to unblock a crashloop'd operator), legacy CR predating the finalizer. Without this sweep, the cluster accumulates ghost peers that no per-CR reconcile would ever notice — none of the per-CR cleanup paths look at peers labelled for CRs other than their own.

Every live reconcile of any `ClusterMesh` now also collects the set of living mesh names in the namespace and deletes any `Peer` whose `mesh` label points outside that set. Failures log per peer and continue.

4. peer.BuildPeer: fold anchor CIDRs into the first node Peer

WireGuard identifies peers exclusively by their public key and keeps only one peer entry per pubkey. The operator had been emitting two `Peer` objects per source-cluster anchor — one node Peer with `allowedIPs=[podCIDR, wgIP/32]` and one separate "anchor" Peer with `allowedIPs=[serviceCIDR, additionalCIDRs]` — both reusing the anchor node's pubkey. On the receiving cluster, `kilo` applied whichever `Peer` it processed last; the other entry's AllowedIPs were silently dropped from kernel WG state. The result was a racy half-outage: sometimes the receiving cluster could route the pod-CIDR but lost the service-CIDR, sometimes the inverse. We observed this live — mesh1 happened to land in the favourable order; mesh2's handshake completed (visible in tcpdump) but inner packets were dropped at ingress because their source IP was outside the surviving WG peer's AllowedIPs.

Stop emitting the separate anchor `Peer`. `BuildPeer` gains an `extraAllowedIPs []string` parameter; `buildDesiredPeers` passes `CollectAnchorCIDRs(entry)` to the first valid node and `nil` to the rest. One WG peer entry per pubkey with the full union of AllowedIPs.

5. Tests

Unit:

`internal/peer/reconciler_test.go`: `TestDeleteStaleSourceClusters_*` (removal / no-op / empty valid set / mesh isolation)
`internal/multicluster/registry_test.go`: `TestBuild_*` (missing-secret tolerance / local entry fast-path / nil logger)
`internal/peer/builder_test.go`: `TestBuildPeer_AnchorExtras_` replaces the legacy `TestBuildAnchorPeer_` set with fold-in semantics (with-serviceCIDR / with-additionalCIDRs / no-extras / no-endpoint-source error / ExternalIP fallback)

Integration (`test/integration/cleanup_test.go`):

`TestCleanupStaleSourceClusters_RemovesPeersOfRemovedCluster` — reproduces the production regression
`TestCleanupStaleSourceClusters_SweepsClustersOutsideSpec` — pins the cross-CR behaviour
`TestCleanupStaleSourceClusters_IgnoresPeersFromOtherMeshes` — guards isolation between meshes
`TestCleanupOrphanMeshPeers_DeletesGhostsAfterCRGone` — defense-in-depth sweep
`TestCleanupOrphanMeshPeers_LeavesPeersOfLivingMeshAlone` — regression guard

Live validation (dev3)

Three switchcloud tenants (mesh1 / mesh2 / mesh3) peering with one external Ceph cluster as hub
`mesh1 → ceph`, `mesh2 → ceph`, `mesh3 → ceph` — PING-OK across all three
`mesh1 → mesh2`, `mesh2 → mesh3`, `mesh3 → mesh1` — isolated by WG AllowedIPs as expected
Tenant rotation (add → remove cluster from spec) cleanly removes stale peers; CR teardown via finalizer cleans every cluster; force-deletion handled by orphan sweep

Test plan

`go build ./...`, `go vet ./...` pass locally
`go test ./internal/...` pass locally (envtest integration runs in CI)
`helm-unittest charts/kilo-clustermesh-operator` still 41/41 passing
CI green on `build`, `lint`, `test`, `integration`, `helm`

Summary by CodeRabbit

New Features
- Automatic cleanup of orphaned peers referencing removed meshes.
- Anchor/service CIDRs are folded into peers so each node has a single consolidated peer entry.
Bug Fixes
- Better stale-peer reconciliation across cluster boundaries.
- Cluster registry initialization now logs and skips invalid cluster entries (tolerates missing configs).
Tests
- New unit and integration tests covering registry resilience, peer-building behavior, and orphan/stale-peer cleanup.

Walk every cluster in the merged Registry (built from all ClusterMesh specs), not just this mesh's spec.Clusters. This closes the second half of the stale-peer problem: peers a mesh pushed into a cluster that has since been removed from its spec stay reachable via any sibling ClusterMesh that still references the cluster, so the sweep can reach them. Adds a cross-CR integration case to test/integration/cleanup_test.go that pins this contract: a peer planted in 'remote' for a mesh whose spec contains only 'local' is removed after reconcile. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

Build returned an error from the first cluster entry whose kubeconfig Secret could not be read. During tenant teardown this crashed the operator at startup: the upstream KubernetesSwitchcloud Helm release deletes the tenant's admin-kubeconfig Secret before the ClusterMesh CR that references it is finalized, leaving a stale secretRef that buildInitialRegistry then dereferences. The operator crashlooped, the finalizer never ran, the ClusterMesh CR stuck with finalizers forever, and downstream teardown stalled. Build now logs a warning per problematic entry and skips it, returning the rest of the registry as usable. cluster.Cluster construction failures are treated the same way. The reconciler already does best-effort sweeps via r.Registry.Client(name) with 'if !ok { continue }', so a partial registry preserves forward progress: finalizers clean peers in the clusters they can still reach, the ClusterMesh CR releases, and the operator can come back up clean on the next reconcile. A nil logger is accepted to keep the signature ergonomic; it routes to slog.DiscardHandler so test callers and other call sites do not need to thread a logger through to use Build. cmd/main.go threads the existing slogger through buildInitialRegistry so missing-secret warnings show up in the operator log at WARN level. Three unit tests cover the new behaviour: - TestBuild_SkipsEntryWithMissingSecret: reproduces the crashloop bug with a mesh+ceph pair where only ceph's kubeconfig Secret exists. - TestBuild_LocalEntryDoesNotNeedSecret: guards the Local: true fast-path, which must keep succeeding regardless of Secret state. - TestBuild_NilLoggerIsAccepted: protects against nil-deref for call sites that have not yet wired a logger. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

handleDeletion only runs when the finalizer reconcile gets to fire. In production we have already seen the finalizer skipped: - the operator crashlooped (missing kubeconfig Secret blocked buildInitialRegistry), so the finalizer reconcile never ran, and the user manually removed the finalizer to unblock teardown. - force-delete via the API or controller-runtime client. In all three cases the cleanup half of teardown is bypassed, leaving Peer objects labeled with a mesh name whose ClusterMesh CR no longer exists. The per-CR cleanupStaleSourceClusters pass added in the previous commits cannot touch those: it only looks at peers labeled with its OWN mesh name. The ghosts persist until a human notices. Add cleanupOrphanMeshPeers: every reconcile of any ClusterMesh now lists every other ClusterMesh in the namespace, collects the set of living mesh names, walks every cluster in the registry, and deletes any Peer whose kilo-clustermesh.io/mesh label points at a name not in that set. Failures (list, delete) are logged per-peer and the pass continues. The sweep is namespace-scoped: it only considers ClusterMesh objects in the reconciling CR's namespace, so it cannot accidentally clobber peers managed from another namespace by another operator. Wired into Reconcile() after the per-CR cleanupStaleSourceClusters pass so that every live reconcile contributes to the global sweep — the cluster reconverges as soon as any mesh's reconcile fires. Two integration cases in test/integration/cleanup_test.go: - DeletesGhostsAfterCRGone: plants a peer labeled with a ghost-mesh name; live mesh reconcile removes it. - LeavesPeersOfLivingMeshAlone: regression guard ensuring the sweep keeps peers whose mesh label still corresponds to an existing CR. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

…Peer Kilo (and WireGuard generally) identifies peers exclusively by their public key. The receiving cluster's kilo daemon issues `wg setconf` to load Peer CRs into kernel state, and a second peer entry with the same pubkey REPLACES the first — AllowedIPs are not merged. The operator had been emitting two Peer objects per source-cluster node anchor: one "node" Peer with AllowedIPs=[podCIDR, wgIP/32] and one "anchor" Peer with AllowedIPs=[serviceCIDR, additionalCIDRs], both reusing the anchor node's pubkey. On the receiving cluster (e.g. nuvolos-ceph in our live test), kilo applied whichever Peer it processed last; the other entry's AllowedIPs were silently dropped from the WG kernel state. The result was a racy half-outage: sometimes pod-CIDR traffic worked but the serviceCIDR was unrouted; sometimes the inverse. The mesh1 path happened to land in the favourable order; mesh2 didn't, and its handshake completed (we observed it via tcpdump) yet inner packets were dropped on ingress because their source IP was outside the WG peer's AllowedIPs. Stop emitting the separate anchor Peer. Fold the cluster-wide CIDRs into the first valid node's Peer.AllowedIPs via a new extraAllowedIPs parameter on BuildPeer, with CollectAnchorCIDRs exposing the shape that was previously private. The buildDesiredPeers loop picks the first node as the anchor carrier; other nodes get nil extras. Old anchor Peer objects on remote clusters disappear naturally: their NAME (<mesh>--<cluster>--anchor) is no longer in the desired set, so ReconcilePeers' per-pair orphan sweep deletes them on the next reconcile. Knock-on changes: - builder_test.go updates every BuildPeer call site for the new 4-argument signature (existing tests pass nil extras). - TestBuildAnchorPeer_* renamed to TestBuildPeer_AnchorExtras_* and rewritten to exercise the new fold-in semantics. The NoEndpointSource case is now a hard error from BuildPeer instead of a nil from BuildAnchorPeer, which is strictly stricter: misconfiguration surfaces at reconcile time rather than silently losing cluster-wide CIDR routing. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

gemini-code-assist

Code Review

This pull request refactors peer generation by folding cluster-wide CIDRs directly into the first valid node's Peer instead of emitting a separate anchor Peer, which prevents WireGuard public key deduplication from clobbering AllowedIPs. It also enhances the operator's resilience by skipping failed cluster entries during registry builds rather than aborting, and introduces self-healing passes to clean up stale and orphan peers. The review feedback highlights critical improvements, including adding a defensive check for empty node.Spec.PodCIDRs to prevent runtime panics, and wrapping remote cluster operations with timeout contexts to avoid blocking the reconciliation loop.

gemini-code-assist · 2026-05-27T12:08:38Z

+	allowedIPs := make([]string, 0, 2+len(extraAllowedIPs))
+	allowedIPs = append(allowedIPs, node.Spec.PodCIDRs[0], netutil.HostRoute(hostIP))
+	allowedIPs = append(allowedIPs, extraAllowedIPs...)


Accessing node.Spec.PodCIDRs[0] directly without checking if node.Spec.PodCIDRs is empty can cause a runtime panic if a node is registered but does not have its PodCIDRs allocated yet. We should add a defensive check to ensure node.Spec.PodCIDRs has at least one element.

if len(node.Spec.PodCIDRs) == 0 { return nil, errors.Newf("node %q has no PodCIDRs allocated", node.Name) } allowedIPs := make([]string, 0, 2+len(extraAllowedIPs)) allowedIPs = append(allowedIPs, node.Spec.PodCIDRs[0], netutil.HostRoute(hostIP)) allowedIPs = append(allowedIPs, extraAllowedIPs...)

gemini-code-assist · 2026-05-27T12:08:38Z

+		err := peer.DeleteStaleSourceClusters(ctx, tgtClient, mesh.Name, sources)
 		if err != nil {
 			log.Warn("cleaning stale source-cluster peers",
-				slog.String("target", tgtEntry.Name),
+				slog.String("target", name),
 				slog.String("error", err.Error()),
 			)
 		}


Reconciling stale source-cluster peers synchronously across all clusters in the registry can block the entire reconciliation loop if any remote cluster is offline or slow. To prevent a single unhealthy cluster from degrading the operator, we should use a context with a short timeout for the cleanup call.

cleanupCtx, cancel := context.WithTimeout(ctx, 5*time.Second) err := peer.DeleteStaleSourceClusters(cleanupCtx, tgtClient, mesh.Name, sources) cancel() if err != nil { log.Warn("cleaning stale source-cluster peers", slog.String("target", name), slog.String("error", err.Error()), ) }

gemini-code-assist · 2026-05-27T12:08:38Z

+	for _, clusterName := range r.Registry.Clusters() {
+		tgtClient, ok := r.Registry.Client(clusterName)
+		if !ok {
+			continue
+		}
+
+		var peers kilov1alpha1.PeerList
+
+		err := tgtClient.List(ctx, &peers)
+		if err != nil {
+			log.Warn("listing peers for orphan sweep",
+				slog.String("target", clusterName),
+				slog.String("error", err.Error()),
+			)
+
+			continue
+		}
+
+		for i := range peers.Items {
+			peerObj := &peers.Items[i]
+
+			meshLabel := peerObj.Labels[peer.LabelMesh]
+			if meshLabel == "" {
+				continue
+			}
+
+			if _, alive := living[meshLabel]; alive {
+				continue
+			}
+
+			deleteErr := tgtClient.Delete(ctx, peerObj)
+			if deleteErr != nil && !apierrors.IsNotFound(deleteErr) {
+				log.Warn("deleting orphan peer",
+					slog.String("target", clusterName),
+					slog.String("peer", peerObj.Name),
+					slog.String("error", deleteErr.Error()),
+				)
+
+				continue
+			}
+
+			log.Info("deleted orphan peer whose ClusterMesh CR no longer exists",
+				slog.String("target", clusterName),
+				slog.String("peer", peerObj.Name),
+				slog.String("orphan-mesh", meshLabel),
+			)
+		}
+	}


Listing and deleting orphan peers synchronously across all clusters in the registry can block the reconciliation loop if any remote cluster is offline or slow. We should wrap the operations for each cluster with a short timeout context. Make sure to call cancel() at the end of each iteration to avoid context leaks.

for _, clusterName := range r.Registry.Clusters() { tgtClient, ok := r.Registry.Client(clusterName) if !ok { continue } clusterCtx, cancel := context.WithTimeout(ctx, 5*time.Second) var peers kilov1alpha1.PeerList err := tgtClient.List(clusterCtx, &peers) if err != nil { log.Warn("listing peers for orphan sweep", slog.String("target", clusterName), slog.String("error", err.Error()), ) cancel() continue } for i := range peers.Items { peerObj := &peers.Items[i] meshLabel := peerObj.Labels[peer.LabelMesh] if meshLabel == "" { continue } if _, alive := living[meshLabel]; alive { continue } deleteErr := tgtClient.Delete(clusterCtx, peerObj) if deleteErr != nil && !apierrors.IsNotFound(deleteErr) { log.Warn("deleting orphan peer", slog.String("target", clusterName), slog.String("peer", peerObj.Name), slog.String("error", deleteErr.Error()), ) continue } log.Info("deleted orphan peer whose ClusterMesh CR no longer exists", slog.String("target", clusterName), slog.String("peer", peerObj.Name), slog.String("orphan-mesh", meshLabel), ) } cancel() }

coderabbitai · 2026-05-27T12:09:52Z

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a1dacb4-281b-4335-bb29-d822d7e5e721

📥 Commits

Reviewing files that changed from the base of the PR and between addc84b and be32cde.

📒 Files selected for processing (1)

internal/controller/clustermesh_controller.go

📝 Walkthrough

Walkthrough

This PR threads a slog.Logger into multicluster.Build, makes Build skip and warn on per-entry failures, changes peer construction to fold cluster-wide CIDRs into per-node extras, and adds controller sweeps to delete stale and orphan peers across registry clusters. Tests (unit and integration) cover these behaviors.

Changes

ClusterMesh peer lifecycle and registry robustness

Layer / File(s)	Summary
Logger threading to registry Build `cmd/main.go`, `internal/multicluster/registry.go`	`run()` passes initialized `slogger` into `buildInitialRegistry`; `multicluster.Build` signature now accepts `*slog.Logger`.
Registry partial build on per-entry failure `internal/multicluster/registry.go`, `internal/multicluster/registry_test.go`	`Build` defaults nil logger to discard, logs structured warnings for per-entry config/cluster init failures, skips those entries, and returns a partial registry. Unit tests cover missing-secret skipping, local entry behavior, and nil-logger acceptance.
Peer builder anchor CIDR folding and extras `internal/peer/builder.go`, `internal/peer/builder_test.go`	`BuildPeer` gains `extraAllowedIPs []string`; `CollectAnchorCIDRs(entry)` added; builder validates PodCIDRs and appends node routes then extras. Anchor CIDRs are folded into the first node's peer instead of emitting separate anchor peers; tests updated to anchor-extras model.
Controller stale and orphan peer cleanup `internal/controller/clustermesh_controller.go`, `test/integration/cleanup_test.go`	`cleanupStaleSourceClusters` rewritten to sweep all registry clusters using spec-derived valid source names; new `cleanupOrphanMeshPeers` deletes peers labeled to non-existent ClusterMesh CRs (ignoring NotFound); Reconcile invokes orphan sweep after per-cluster reconcile. Integration tests added/adjusted for stale and orphan scenarios.

Sequence Diagram

sequenceDiagram
  participant Reconcile
  participant PerCluster
  participant StaleCleanup as cleanupStaleSourceClusters
  participant OrphanCleanup as cleanupOrphanMeshPeers
  participant Registry
  participant APIServer as Peer objects
  Reconcile->>PerCluster: reconcile each mesh.Spec.Clusters
  PerCluster->>APIServer: create/update peers
  Reconcile->>StaleCleanup: after per-cluster done
  StaleCleanup->>Registry: list all known clusters
  StaleCleanup->>APIServer: delete peers in removed clusters or with invalid source labels
  Reconcile->>OrphanCleanup: after stale cleanup
  OrphanCleanup->>APIServer: list all Peer objects with kilo-clustermesh.io/mesh label
  OrphanCleanup->>APIServer: for each peer, check if mesh label references live ClusterMesh CR
  OrphanCleanup->>APIServer: delete peers labeled to non-existent ClusterMesh CRs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cozystack/kilo-clustermesh-operator#6: Modifies multicluster.Build API and registry-construction wiring similar to this PR.
cozystack/kilo-clustermesh-operator#2: Touches peer-construction and anchor/BuildPeer behavior related to anchor CIDR handling.
cozystack/kilo-clustermesh-operator#10: Related controller cleanup changes for stale source-cluster peer deletion.

Suggested reviewers

Arsolitt

Poem

🐰 I folded CIDRs into a single peer so bright,
The registry whispers warnings in the night,
Skipped clusters slip away with grace,
Orphan ghosts are cleared from every place,
Hoppity hops — cleanup finished, systems light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the two main themes of the changeset: stale-Peer cleanup across cluster boundaries and the refactored anchor CIDR handling in peer construction.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CI feedback on the first push: 1. funlen: cleanupOrphanMeshPeers was 65 lines (limit 60). Split into three: cleanupOrphanMeshPeers (orchestrator), collectLivingMeshes (CR-list helper), sweepOrphanPeersInCluster (per-cluster scan). Behaviour identical; just shorter functions. 2. TestCleanupStaleSourceClusters_IgnoresPeersFromOtherMeshes had been planting a peer labelled with a non-existent mesh and asserting the per-source sweep leaves it alone. That was true before this PR added the orphan-mesh sweep — which legitimately deletes peers whose mesh CR doesn't exist. The test now creates a sibling ClusterMesh in the same namespace so the foreign peer's mesh label references a LIVING CR. The property under test (per-source sweep doesn't touch peers owned by another mesh) is unchanged; orphan handling is covered by the dedicated TestCleanupOrphanMeshPeers_* cases. 3. TestCleanupOrphanMeshPeers_LeavesPeersOfLivingMeshAlone planted a peer in localClient without a t.Cleanup, so it survived the reconciling mesh's deletion and polluted subsequent integration tests whose reconciles legitimately swept it as an orphan. Added the missing Delete in t.Cleanup. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 208-246: collectLivingMeshes is listing ClusterMesh only in the
reconciler's namespace which causes peers owned by same-name meshes in other
namespaces to be treated as orphans; change the ownership check to be
namespace-qualified or use a cluster-wide live-mesh lookup. Concretely: modify
collectLivingMeshes to include namespace in the returned identity (e.g.,
map["<namespace>/<meshName>"]struct{} or accept a cluster-wide list by removing
client.InNamespace), update all callers (cleanupOrphanMeshPeers and
sweepOrphanPeersInCluster) to build and check owner keys using
mesh.Namespace+mesh.Name, and ensure the peer.Labels creation/inspection
(peer.Labels or wherever labels are generated) uses the namespace-qualified mesh
identity so deletes only remove peers whose namespace+name owner is truly absent
cluster-wide.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b5af09e-1a1a-4edd-9704-c689981be8ca

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd8d56 and a15c139.

📒 Files selected for processing (7)

cmd/main.go
internal/controller/clustermesh_controller.go
internal/multicluster/registry.go
internal/multicluster/registry_test.go
internal/peer/builder.go
internal/peer/builder_test.go
test/integration/cleanup_test.go

…r-wide CR list) Three concrete findings from gemini-code-assist and one critical from coderabbitai on PR cozystack#11. 1. peer.BuildPeer: guard against empty Spec.PodCIDRs Accessing PodCIDRs[0] would panic on a node that joined before the kube-controller-manager had allocated a pod CIDR. Surface this as a clean BuildPeer error so the reconciler's per-node validation path skips the node and continues, instead of crash-looping the operator. 2. cleanupStaleSourceClusters: per-target timeout Synchronous List/Delete calls across the merged registry could stall the whole reconcile loop if one remote cluster's apiserver is slow or unreachable. Wrap each target's sweep in a context.WithTimeout bounded by cleanupSweepTimeout (5s), so a single bad cluster degrades only its own per-tick progress. 3. cleanupOrphanMeshPeers: per-target timeout (same rationale). 4. cleanupOrphanMeshPeers: list ClusterMesh CLUSTER-WIDE Critical bug spotted by coderabbitai. collectLivingMeshes previously restricted the lookup to the reconciler's own namespace, but the operator's registry is built cluster-wide from every ClusterMesh it sees. A Peer labelled mesh=foo on a target cluster could legitimately belong to a foo CR in another namespace, and reconciling a CR in namespace A would treat that peer as orphaned and delete it. Drop the InNamespace filter and consider every ClusterMesh in the cluster as a living mesh. Namespace-qualified Peer ownership (a dedicated label) is a cleaner long-term fix and worth a follow-up, but the cluster-wide list is enough to stop the data-loss path described in the review. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai

🧹 Nitpick comments (2)

internal/controller/clustermesh_controller.go (2)

550-570: 💤 Low value

Consider sorting nodes for deterministic anchor assignment.

The first node in the slice receives anchor CIDRs, but node ordering depends on API server list response which can vary. If the order changes between reconciles, different nodes become the "anchor", causing peer updates and brief routing churn.

Sorting nodes by name before iteration would make anchor assignment deterministic:

🔧 Proposed fix to sort nodes

+import "sort"

 func buildDesiredPeers(meshName string, entry *v1alpha1.ClusterEntry, nodes []*corev1.Node) ([]*kilov1alpha1.Peer, error) {
 	peers := make([]*kilov1alpha1.Peer, 0, len(nodes))
 
+	// Sort nodes by name for deterministic anchor assignment. The first
+	// node carries the cluster-wide CIDRs, so a stable order avoids
+	// peer churn when the API server returns nodes in a different order.
+	sort.Slice(nodes, func(i, j int) bool {
+		return nodes[i].Name < nodes[j].Name
+	})
+
 	anchorExtras := peer.CollectAnchorCIDRs(entry)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/clustermesh_controller.go` around lines 550 - 570,
buildDesiredPeers currently assigns anchor CIDRs to the first element of the
nodes slice, but node list order from the API server is non-deterministic; to
avoid flapping anchors sort the nodes slice by node.Name before the loop. Update
buildDesiredPeers to perform a stable sort (e.g., sort.Slice) on the incoming
nodes slice by each corev1.Node.Name, then keep the rest of the logic (use
peer.CollectAnchorCIDRs(entry) and call peer.BuildPeer(meshName, entry, node,
extras)) unchanged so anchorExtras are consistently applied to the same node
across reconciles.

185-198: 💤 Low value

Lint: missing blank line before cancel().

The wsl linter flags a missing blank line. Given the tight coupling between the sweepCtx creation and its immediate use/cleanup, consider either adding a blank line or using a helper to satisfy the linter without cluttering the loop.

🧹 Proposed fix to add blank line

 		sweepCtx, cancel := context.WithTimeout(ctx, cleanupSweepTimeout)
 
 		err := peer.DeleteStaleSourceClusters(sweepCtx, tgtClient, mesh.Name, sources)
+
 		cancel()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/clustermesh_controller.go` around lines 185 - 198, The
linter flags a missing blank line before the cancel() call; update the loop
around the WithTimeout usage (symbols: sweepCtx, cancel, and
peer.DeleteStaleSourceClusters) to either insert a blank line between the "err
:= peer.DeleteStaleSourceClusters(...)" statement and the "cancel()" call, or
refactor into a small helper that uses defer cancel() immediately after
context.WithTimeout to satisfy the linter without changing behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/clustermesh_controller.go`:
- Around line 550-570: buildDesiredPeers currently assigns anchor CIDRs to the
first element of the nodes slice, but node list order from the API server is
non-deterministic; to avoid flapping anchors sort the nodes slice by node.Name
before the loop. Update buildDesiredPeers to perform a stable sort (e.g.,
sort.Slice) on the incoming nodes slice by each corev1.Node.Name, then keep the
rest of the logic (use peer.CollectAnchorCIDRs(entry) and call
peer.BuildPeer(meshName, entry, node, extras)) unchanged so anchorExtras are
consistently applied to the same node across reconciles.
- Around line 185-198: The linter flags a missing blank line before the cancel()
call; update the loop around the WithTimeout usage (symbols: sweepCtx, cancel,
and peer.DeleteStaleSourceClusters) to either insert a blank line between the
"err := peer.DeleteStaleSourceClusters(...)" statement and the "cancel()" call,
or refactor into a small helper that uses defer cancel() immediately after
context.WithTimeout to satisfy the linter without changing behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52629c81-5449-405e-b0c4-18256975e427

📥 Commits

Reviewing files that changed from the base of the PR and between a15c139 and addc84b.

📒 Files selected for processing (2)

internal/controller/clustermesh_controller.go
internal/peer/builder.go

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

IvanHunters added 5 commits May 27, 2026 11:48

wip(controller): also wipe peers in targets fully removed from spec

ef923e1

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

IvanHunters added 2 commits May 27, 2026 15:13

fixup: satisfy CRD minItems=2 on spec.clusters in cross-CR sweep test

a15c139

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

IvanHunters marked this pull request as ready for review May 27, 2026 12:22

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread internal/controller/clustermesh_controller.go

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

IvanHunters added 2 commits May 27, 2026 16:30

fixup: wsl_v5 lint — collapse blank lines around sweepCtx block

72243dc

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

fixup: wrap per-target sweeps in anon-funcs for defer cancel() pairing

be32cde

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

Arsolitt (Arsolitt) approved these changes May 27, 2026

View reviewed changes

IvanHunters merged commit 16a8f64 into cozystack:main May 27, 2026
9 checks passed

IvanHunters deleted the feat/stale-peer-cleanup-and-anchor-fold-in branch May 27, 2026 15:13

This was referenced Jun 19, 2026

feat(operator): discovered-endpoint enrichment and per-cluster PersistentKeepalive #20

Merged

feat(api)!: collapse typed CIDR fields into flat allowedNetworks #21

Merged

fix(peer): host-netns reachability and NAT bootstrap reliability #25

Merged

Uh oh!

Conversation

IvanHunters commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. multicluster.Build: tolerate missing kubeconfig secrets

2. cleanupStaleSourceClusters: sweep peers when a cluster leaves spec

3. cleanupOrphanMeshPeers: sweep peers of deleted ClusterMesh CRs

4. peer.BuildPeer: fold anchor CIDRs into the first node Peer

5. Tests

Live validation (dev3)

Test plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IvanHunters commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading