Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions test/resources/c2cc.resource
Original file line number Diff line number Diff line change
Expand Up @@ -434,3 +434,58 @@ Curl Remote Service Via DNS
${stdout}= Curl DNS From Cluster ${source}
... hello-microshift.${NAMESPACES}[${destination}].svc.${DOMAIN_MAP}[${destination}] 8080
Should Contain ${stdout} Hello from

Reconnect Cluster
[Documentation] Re-establish SSH connection to a cluster after reboot.
... Closes the old connection (if any) and re-registers via Register Remote Cluster.
[Arguments] ${alias}
IF '${alias}' == 'cluster-a'
VAR ${host} ${USHIFT_HOST}
VAR ${port} ${SSH_PORT}
ELSE IF '${alias}' == 'cluster-b'
VAR ${host} ${HOST2_IP}
VAR ${port} ${HOST2_SSH_PORT}
ELSE IF '${alias}' == 'cluster-c'
VAR ${host} ${HOST3_IP}
VAR ${port} ${HOST3_SSH_PORT}
ELSE
Fail Unknown cluster alias: ${alias}
END
Run Keyword And Ignore Error SSHLibrary.Switch Connection ${alias}
Run Keyword And Ignore Error SSHLibrary.Close Connection
${kubeconfig}= Get From Dictionary ${C2CC_KUBECONFIGS} ${alias}
Remove Values From List ${C2CC_REMOTE_ALIASES} ${alias}
Register Remote Cluster ${alias} ${host} ${port} ${kubeconfig}
Comment on lines +456 to +458

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Re-registration is not failure-safe for teardown.

Remove Values From List drops the alias from ${C2CC_REMOTE_ALIASES} before Register Remote Cluster re-adds it. If Register Remote Cluster errors (host still down), the alias is gone from the tracked list. Within Wait Until Keyword Succeeds retries this self-heals, but if all retries exhaust, Teardown All Remote Clusters will never switch to / close that connection, leaking it and leaving teardown state inconsistent.

Consider only mutating the tracking list after a successful re-registration, or guarding with TRY/FINALLY so the list is reconciled even on the failure path.

Based on learnings: teardown state (the alias/interface list consumed by teardown keywords) must be populated reliably even when the mutating keyword errors before completing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/resources/c2cc.resource` around lines 456 - 458, The re-registration
flow in the remote cluster setup is mutating `${C2CC_REMOTE_ALIASES}` before
`Register Remote Cluster` succeeds, which can leave teardown state inconsistent
if that keyword fails. Update the logic around `Remove Values From List` and
`Register Remote Cluster` so the alias is only removed after a successful
re-registration, or use `TRY/FINALLY` to restore/reconcile
`${C2CC_REMOTE_ALIASES}` on failure. Keep the teardown-tracked alias list in
sync in the re-registration path used by `Wait Until Keyword Succeeds`.

Source: Learnings


Reboot Clusters Simultaneously
[Documentation] Reboot one or more clusters and wait for all to come back
... with a new boot ID and greenboot health check passed.
... Issues reboot on all targets before waiting, so reboots overlap.
[Arguments] @{cluster_aliases}
&{boot_ids}= Create Dictionary
FOR ${alias} IN @{cluster_aliases}
${conn_id}= Get From Dictionary ${C2CC_SSH_IDS} ${alias}
SSHLibrary.Switch Connection ${conn_id}
${bootid}= Get Current Boot Id
Set To Dictionary ${boot_ids} ${alias} ${bootid}
END
FOR ${alias} IN @{cluster_aliases}
${conn_id}= Get From Dictionary ${C2CC_SSH_IDS} ${alias}
SSHLibrary.Switch Connection ${conn_id}
SSHLibrary.Start Command reboot sudo=True
END
FOR ${alias} IN @{cluster_aliases}
${old_bootid}= Get From Dictionary ${boot_ids} ${alias}
Wait Until Keyword Succeeds 10m 15s
... Cluster Should Be Rebooted ${alias} ${old_bootid}
END

Cluster Should Be Rebooted
[Documentation] Assert that a cluster has rebooted and greenboot health check has passed.
[Arguments] ${alias} ${old_bootid}
Reconnect Cluster ${alias}
${new_bootid}= Get Current Boot Id
Should Not Be Equal ${new_bootid} ${old_bootid}
${stdout}= Command On Cluster ${alias}
... systemctl show -p SubState greenboot-healthcheck.service --value
Should Be Equal As Strings ${stdout} exited strip_spaces=True
146 changes: 146 additions & 0 deletions test/suites/c2cc/reboot.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
*** Settings ***
Documentation Verify C2CC survives VM reboots in escalating scenarios.
... Tests single cluster, two clusters simultaneously, and all three
... clusters simultaneously. After each reboot cycle, full-stack
... verification confirms connectivity, infrastructure, health probes,
... and DNS all recover.

Resource ../../resources/c2cc.resource

Suite Setup Setup
Suite Teardown Teardown

Test Tags c2cc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add here disruptive tag to add these suite to Disruptive runs, same as the ones Patryk created on the other PR, example

@pmtk wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they can fit into single scenario, then definitely. If not, it should be separate one.

Except I'd replace c2cc with disruptive. If c2cc is present, it would run within regular scenarios.



*** Test Cases ***
Reboot Single Cluster
[Documentation] Reboot cluster-a while cluster-b and cluster-c remain up.
... Verifies that the rebooted cluster re-establishes C2CC connectivity
... with both peers.
[Setup] Ensure All Clusters Healthy
Reboot Clusters Simultaneously cluster-a
Wait For Clusters Ready
Verify Full C2CC Stack

Reboot Two Clusters Simultaneously
[Documentation] Reboot cluster-b and cluster-c at the same time.
... The surviving cluster-a must wait for both peers to recover.
... The two rebooted clusters must also reconnect with each other.
[Setup] Ensure All Clusters Healthy
Reboot Clusters Simultaneously cluster-b cluster-c
Wait For Clusters Ready
Verify Full C2CC Stack

Reboot All Three Clusters Simultaneously
[Documentation] Reboot all three clusters at once.
... Every cluster starts from scratch simultaneously — no running peer
... to reference. All must independently reconstruct C2CC state.
[Setup] Ensure All Clusters Healthy
Reboot Clusters Simultaneously cluster-a cluster-b cluster-c
Wait For Clusters Ready
Verify Full C2CC Stack


*** Keywords ***
Setup
[Documentation] Set up clusters and deploy test workloads on all.
Check Required Env Variables
Login MicroShift Host
Setup Kubeconfig
Logout MicroShift Host

Register Remote Cluster cluster-a ${USHIFT_HOST} ${SSH_PORT} ${KUBECONFIG}
Register Remote Cluster cluster-b ${HOST2_IP} ${HOST2_SSH_PORT} ${KUBECONFIG_B}
Register Remote Cluster cluster-c ${HOST3_IP} ${HOST3_SSH_PORT} ${KUBECONFIG_C}
Deploy Test Workloads
Verify Full C2CC Stack

Teardown
[Documentation] Remove test workloads and close connections.
Cleanup Test Workloads
Teardown All Remote Clusters
Remove Kubeconfig

Wait For Clusters Ready
[Documentation] Wait for test pods and service endpoints to become ready
... after a reboot cycle.
Wait For Test Pods
Wait For Service Endpoints

Verify Full C2CC Stack

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimauro @pmtk I feel we are rewriting (almost) the sameVerify keywords to check the cluster are ok on every PR, don't you?

This PR is ok as it is. I prefer to merge it now and later, on a follow up PR, do a refactor to group better these Verify checks. What do you think?

[Documentation] Comprehensive verification of all C2CC components across all clusters.
Wait Until Keyword Succeeds 10m 10s Verify C2CC Connectivity
Wait Until Keyword Succeeds 10m 10s Verify C2CC Infrastructure
Wait Until Keyword Succeeds 10m 10s Verify C2CC Health Probes
Wait Until Keyword Succeeds 10m 10s Verify C2CC DNS

Verify C2CC Connectivity
[Documentation] Verify pod-to-pod, pod-to-service connectivity and source IP preservation
... across all 6 cluster pairs.
FOR ${src} ${dst} IN
... cluster-a cluster-b
... cluster-a cluster-c
... cluster-b cluster-a
... cluster-b cluster-c
... cluster-c cluster-a
... cluster-c cluster-b
Test Connectivity Between Clusters ${src} ${dst} pod
Test Connectivity Between Clusters ${src} ${dst} service
Test Source IP Preserved Between Clusters ${src} ${dst} pod
Test Source IP Preserved Between Clusters ${src} ${dst} service
END

Verify C2CC Infrastructure
[Documentation] Verify routes, IP rules, nftables, OVN static routes,
... and node annotations for all cluster-peer combinations.
Verify Infra For Remote Peer cluster-a ${CLUSTER_B_POD_CIDR} ${CLUSTER_B_SVC_CIDR} ${CLUSTER_A_SVC_CIDR}
Verify Infra For Remote Peer cluster-a ${CLUSTER_C_POD_CIDR} ${CLUSTER_C_SVC_CIDR} ${CLUSTER_A_SVC_CIDR}
Verify Infra For Remote Peer cluster-b ${CLUSTER_A_POD_CIDR} ${CLUSTER_A_SVC_CIDR} ${CLUSTER_B_SVC_CIDR}
Verify Infra For Remote Peer cluster-b ${CLUSTER_C_POD_CIDR} ${CLUSTER_C_SVC_CIDR} ${CLUSTER_B_SVC_CIDR}
Verify Infra For Remote Peer cluster-c ${CLUSTER_A_POD_CIDR} ${CLUSTER_A_SVC_CIDR} ${CLUSTER_C_SVC_CIDR}
Verify Infra For Remote Peer cluster-c ${CLUSTER_B_POD_CIDR} ${CLUSTER_B_SVC_CIDR} ${CLUSTER_C_SVC_CIDR}

Verify Infra For Remote Peer
[Documentation] Verify all infrastructure components on a cluster for one remote peer.
[Arguments] ${alias} ${remote_pod_cidr} ${remote_svc_cidr} ${local_svc_cidr}
Verify Routes In Table 200 ${alias} ${remote_pod_cidr} ${remote_svc_cidr}
Verify IP Rules For Table 200 ${alias} ${remote_pod_cidr} ${remote_svc_cidr}
Verify Routes In Table 201 ${alias} ${local_svc_cidr}
Verify Service IP Rules ${alias} ${remote_pod_cidr} ${remote_svc_cidr} ${local_svc_cidr}
Verify NFTables Bypass Rules ${alias} ${remote_pod_cidr} ${remote_svc_cidr}
Verify OVN Static Routes ${alias} ${remote_pod_cidr} ${remote_svc_cidr}
Verify Node SNAT Annotation ${alias} ${remote_pod_cidr} ${remote_svc_cidr}
Verify C2CC Tracking Annotation ${alias} ${remote_pod_cidr} ${remote_svc_cidr}

Verify C2CC Health Probes
[Documentation] Verify all RemoteCluster CRs report Healthy with populated timestamps.
FOR ${alias} IN cluster-a cluster-b cluster-c
Verify RemoteCluster State ${alias} Healthy
${stdout}= Oc On Cluster ${alias}
... oc get remoteclusters.microshift.io -o jsonpath='{.items[*].status.lastProbeTime}'
Should Not Be Empty ${stdout}
${stdout}= Oc On Cluster ${alias}
... oc get remoteclusters.microshift.io -o jsonpath='{.items[*].status.lastSuccessfulProbe}'
Should Not Be Empty ${stdout}
END

Verify C2CC DNS
[Documentation] Verify CoreDNS Corefile contains C2CC server blocks and
... cross-cluster DNS resolution works for all pairs.
Verify Corefile Contains C2CC Server Block cluster-a ${CLUSTER_B_DOMAIN}
Verify Corefile Contains C2CC Server Block cluster-a ${CLUSTER_C_DOMAIN}
Verify Corefile Contains C2CC Server Block cluster-b ${CLUSTER_A_DOMAIN}
Verify Corefile Contains C2CC Server Block cluster-b ${CLUSTER_C_DOMAIN}
Verify Corefile Contains C2CC Server Block cluster-c ${CLUSTER_A_DOMAIN}
Verify Corefile Contains C2CC Server Block cluster-c ${CLUSTER_B_DOMAIN}
Curl Remote Service Via DNS cluster-a cluster-b
Curl Remote Service Via DNS cluster-a cluster-c
Curl Remote Service Via DNS cluster-b cluster-a
Curl Remote Service Via DNS cluster-b cluster-c
Curl Remote Service Via DNS cluster-c cluster-a
Curl Remote Service Via DNS cluster-c cluster-b

Ensure All Clusters Healthy
[Documentation] Pre-condition: all clusters must have Healthy RemoteCluster CRs.
Verify All RemoteClusters Healthy