From f1966d928610a01eaf4df290ba0971f589bf951a Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Wed, 24 Jun 2026 11:16:14 +0300 Subject: [PATCH 1/5] docs(design-proposals): add unified TLS and PKI model Record a written target design for the TLS/PKI convergence: two-tier (edge issuance + interior operator-owned PKI), a per-engine PKI-ownership contract, and a delivery mechanism for the cert-manager-minting engines where charts cannot wire the CA-only helper directly. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- design-proposals/unified-tls-pki/README.md | 195 +++++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 design-proposals/unified-tls-pki/README.md diff --git a/design-proposals/unified-tls-pki/README.md b/design-proposals/unified-tls-pki/README.md new file mode 100644 index 0000000..2ffc3a2 --- /dev/null +++ b/design-proposals/unified-tls-pki/README.md @@ -0,0 +1,195 @@ + +# Unified TLS and PKI model for managed applications + +- **Title:** `Unified TLS and PKI model for managed applications` +- **Author(s):** `@lexfrei` +- **Date:** `2026-06-24` +- **Status:** Draft + +## Overview + +Certificate handling in Cozystack grew out of four uncoordinated mechanisms: per-application issuance inside each chart, per-host ACME on the default ingress path, an opt-in Gateway API path that mints a wildcard via DNS-01, and no supported way to bring an externally-issued wildcard. The epic `cozystack/cozystack#2811` set out to converge them, and the edge of that convergence has largely landed. What has not landed — and what this proposal exists to pin down before it does — is the part that lives *inside* the engines: who owns the PKI for each managed application, and how a tenant receives the trust anchor it needs to verify a TLS connection. + +This proposal records a written target design for the whole model, so that the two architectural forks at its center are decided on paper rather than discovered one pull request at a time. The first fork is the issuance abstraction: rather than "cert-manager as the single issuer", the target is **a single operator-facing interface to choose the certificate source, plus a uniform contract for consuming `ca.crt`**. The second fork is mint-versus-consume: rather than forcing every application to stop minting and consume a central certificate, the target is an explicitly **two-tier** model — cert-manager (or a bring-your-own wildcard) at the edge, and operator-owned PKI inside the engines — where what is unified is the *consume contract*, not the certificate authority itself. + +## Scope and related proposals + +This proposal is the umbrella design for the work tracked by epic `cozystack/cozystack#2811`. It does not re-specify the edge work that has already merged; it states the target the whole model converges on and focuses on the interior contract that is still open. + +- **Edge, merged:** `cozystack/cozystack#2988` (ACME wildcard on the default ingress-nginx path), `cozystack/cozystack#2989` (the CA-only trust-anchor helper). +- **Edge, open:** `cozystack/cozystack#2990` (propagate the operator wildcard to per-tenant termination points — the PR implementing issue `cozystack/cozystack#2820`). +- **Workstreams (issues):** `cozystack/cozystack#2812` and `cozystack/cozystack#2400` (closed, edge wildcard); `cozystack/cozystack#2814` (converge per-app TLS and fix the CA / private-key coupling — the first consumer of this contract); `cozystack/cozystack#2815` (external DB exposure via Gateway TLS-passthrough); `cozystack/cozystack#2816` (end-to-end TLS for databases); `cozystack/cozystack#2977` (opt-in east-west encryption). Throughout this document a `cozystack/cozystack#NNNN` reference is an issue unless called out as a PR. +- **Per-app TLS series (open):** `cozystack/cozystack#2729` (redis), `cozystack/cozystack#2692` (mongodb), `cozystack/cozystack#2683` (rabbitmq), `cozystack/cozystack#2682` (opensearch), `cozystack/cozystack#2680` (mariadb). These are the pull requests that should land *after* this contract is accepted, not before. + +All repository paths below refer to the `cozystack/cozystack` repository; paths attributed to an open PR (for example the wildcard-secret reconciler in PR `cozystack/cozystack#2990`) are not yet on `main`. + +## Context + +### Edge today + +System ingresses (dashboard, grafana, keycloak, harbor) get a per-host certificate via the cert-manager `cluster-issuer` annotation and ingress-shim. The `letsencrypt-prod`, `letsencrypt-stage`, and `selfsigned-cluster-issuer` ClusterIssuers exist with HTTP-01 and DNS-01 solvers. Gateway API is present but opt-in, and in DNS-01 mode the `TenantGateway` controller renders a per-apex wildcard. With `cozystack/cozystack#2988` an operator can now drop in a wildcard Secret and have the default ingress path serve it via `--default-ssl-certificate`; `cozystack/cozystack#2990` extends that to per-tenant termination points. + +### Interior today + +The managed engines fall into two classes, verified against the current charts. + +The first class **mints** its own chain from a chart-rendered cert-manager graph (self-signed Issuer → CA Certificate → CA Issuer → leaf Certificate). `nats` and `qdrant` do this in `main` (`packages/apps/nats/templates/certmanager.yaml`, `packages/apps/qdrant/templates/certmanager.yaml`); the open per-app pull requests for redis, rabbitmq, mariadb, and opensearch add the same shape. In this class the CA Secret (`-ca`) is a cert-manager CA-certificate Secret and therefore **carries the private key**, and the `ca.crt` is not delivered to tenants at all today. + +The second class **consumes** PKI that its operator owns end-to-end. `postgres` (CloudNativePG) renders no cert-manager objects; the operator auto-generates a self-signed CA and signs the server certificate, and the `ca.crt` is present in every `-credentials` Secret (`packages/apps/postgres/templates/db.yaml`). `kafka` (Strimzi) is the same shape and is the reference for the *consume* contract: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). `mongodb` (Percona PSMDB) is operator-owned, but a tenant-facing `ca.crt` path is still in flight under `cozystack/cozystack#2692`, not in `main`. + +One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the control-plane CA, it is not swappable, and the kubeconfig pins that cluster CA — a public edge certificate is meaningless there. It needs no unification and is excluded. + +### Platform mechanisms this proposal builds on + +- **The CA-only helper.** `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`, from `cozystack/cozystack#2989`) renders a Secret containing only `ca.crt`. It fails closed if the input PEM contains any private-key header, and it always stamps the label `internal.cozystack.io/tenantresource: "true"`. It is covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`. +- **Why the helper is needed.** The CA+leaf chain is rendered per-app by each chart's own cert-manager graph (for example `packages/apps/nats/templates/certmanager.yaml`), and the resulting CA Secret (`-ca`) carries `ca.key` — so it is not itself a `ca.crt`-only object and cannot be handed to a tenant. The shared library carries no CA-rendering chart; the only TLS helper there is `cozy-lib.tls.caCertSecret` above. +- **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered. This is the security pivot of the whole model — a labelled Secret must contain only safe material. +- **The values channel.** Global values ride a `cozystack-values` Secret under `_cluster.*` keys, injected into every application HelmRelease via `valuesFrom`. + +### The problem + +The epic's original headline — "consume not mint, cert-manager as the single issuance abstraction" — is not realizable as stated, for two reasons. + +First, there is no written contract for the interior. Nothing records, per engine, who owns the PKI and how `ca.crt` reaches the tenant. As a result the per-app TLS pull requests each re-derive the answer, and the answer differs between them. + +Second, the cert-manager-minting engines (nats, qdrant, and the four open per-app pull requests) have **no path to consume a `ca.crt` via the helper today**, because of three compounding constraints that are real and verified: + +- `valuesFrom` is pinned. `expectedValuesFrom()` (`internal/controller/applicationdefinition_helmreconciler.go:99-107`) hardcodes a single `{Kind: Secret, Name: cozystack-values}` reference, and the reconciler overwrites any drift. An application chart cannot add a sideways `valuesFrom` pointing at its own `-ca`. +- `lookup` cannot drive it. PR `cozystack/cozystack#1787` moved the global-values channel off `lookup` onto `valuesFrom`; `lookup` itself is still available and is used by several charts to read a pre-existing per-release Secret. But it runs at template-render time and is invisible to the Flux digest, so a chart that reads an asynchronously-created Secret via `lookup` does not re-render when that Secret appears — it would need a manual `helm upgrade`. +- The per-release CA is created **asynchronously** by cert-manager, so it does not exist at template-render time at all. + +So the helper, by itself, closes the *output* shape (a key-free `ca.crt` Secret) but not the *input* path (where the chart gets that `ca.crt` for an asynchronously-issued, per-release CA). That gap is the substance of this proposal. + +## Goals + +- Record **two-tier** as the target architecture: edge issuance plus interior operator-owned PKI. +- Define a per-engine **PKI-ownership contract**: for each managed engine, who owns the CA, whether the engine mints or consumes, which Secret carries `ca.crt`, which carries the key, and whether the tenant sees it. +- Define a single **consume contract**: a `ca.crt`-only object, stamped with the tenant-resource label, delivered through the existing projection. Kafka's CA-certificate-only Secret is the reference shape. +- Define a **delivery mechanism** for `ca.crt` on the engines where the chart cannot wire the helper directly. +- Decide **mint-versus-consume explicitly, per engine**, rather than as one global rule. + +### Non-goals + +- This proposal does **not** force pure-consume on engines that own their PKI; doing so would break CloudNativePG and Strimzi certificate rotation, which are mutually exclusive with an externally-supplied server certificate. +- It does **not** homogenize the certificate authority. CA ownership stays the operator's choice — one corporate CA for everything, per-engine self-signed, or a cert-manager issuer are all legitimate. +- It does **not** make Cozystack a public/WebPKI certificate authority, and it does not issue certificates for a tenant's own external domain. +- It does **not** redesign the edge, which already merged (`cozystack/cozystack#2988`, `cozystack/cozystack#2989`, `cozystack/cozystack#2990`); it only references it and reframes the top-line goal. + +## Design + +### 1. The two-tier model + +```mermaid +flowchart TB + subgraph edge["Edge tier — issuance and external exposure"] + CM["cert-manager ClusterIssuers
(ACME HTTP-01 / DNS-01)"] + WC["operator wildcard Secret
(bring-your-own or ACME)"] + ING["ingress-nginx --default-ssl-certificate
Gateway existingSecret"] + CM --> ING + WC --> ING + end + subgraph interior["Interior tier — per-engine operator-owned PKI"] + OP["DB operator owns CA + server cert
(CNPG / Strimzi / PSMDB / cert-manager chart graph)"] + CACERT["<release>-ca-cert
(ca.crt only, no private key)"] + OP --> CACERT + end + subgraph tenant["Tenant"] + T["client verifies the server
using ca.crt"] + end + ING -->|"public / edge TLS"| T + CACERT -->|"tenantsecrets projection
(label internal.cozystack.io/tenantresource)"| T +``` + +The two tiers are independent. The edge tier answers "what certificate does a public client see when it reaches the platform", and is satisfied by an operator-chosen source (ACME HTTP-01, ACME DNS-01 wildcard, or a bring-your-own wildcard). The interior tier answers "what does a client that connects directly to a managed engine need to trust", and is satisfied by the engine's own operator-owned PKI. The only thing that crosses between them is the *shape* of the trust-anchor object a tenant consumes. + +### 2. Edge tier (already landed) + +The edge is done and is recorded here only to fix the framing. An operator selects the certificate source once; the default ingress path serves a supplied wildcard via `--default-ssl-certificate` (`cozystack/cozystack#2988`), the Gateway path consumes the same Secret via an `existingSecret` mode, and `cozystack/cozystack#2990` propagates it to per-tenant termination points. The reframed top-line goal applies here: this is "a single interface to choose the source", not "a single issuer for everything". + +### 3. Interior PKI-ownership contract + +The contract is a per-engine table. It is the artifact the per-app pull requests must conform to, and it makes the two-tier reality explicit: the operator owns the CA; the platform unifies only how `ca.crt` is consumed. + +| Engine | PKI owner | Mint / consume | CA-bearing Secret today | Key in that Secret? | `ca.crt` delivered to tenant today? | +| --- | --- | --- | --- | --- | --- | +| postgres (CloudNativePG) | operator | consume | `-credentials` | no | yes (in credentials) | +| kafka (Strimzi) | operator | consume | `-cluster-ca-cert`, `-clients-ca-cert` | no | yes — reference shape | +| mongodb (Percona PSMDB) | operator | consume | none in `main` (PR `cozystack/cozystack#2692`) | n/a | no (in flight) | +| nats | chart + cert-manager | mint | `-ca` | yes | no | +| qdrant | chart + cert-manager | mint | `-ca` | yes | no | +| redis, rabbitmq, mariadb, opensearch | chart + cert-manager (open PRs) | mint | in PR branches | yes | varies | +| kubernetes (Kamaji) | Kamaji | internal only | Kamaji-owned, not swappable | n/a | no — out of model | + +The takeaway: the operator-owned engines are already close to the target (postgres and kafka deliver `ca.crt` without a key), while the cert-manager-minting engines are the gap — their `-ca` carries the private key and the trust anchor never reaches the tenant. + +### 4. The uniform consume contract + +Every engine, regardless of who owns its CA, exposes its trust anchor through one canonical object: a Secret named `-ca-cert`, containing only `ca.crt`, stamped with `internal.cozystack.io/tenantresource: "true"`. The platform already has the building block — `cozy-lib.tls.caCertSecret` renders exactly this object and fails closed if the input contains a private key. + +This is where the label-filtered projection matters. Because `tenantsecrets` delivers the whole Secret `Data`, the helper's fail-closed guard is not a nicety — it is the boundary that keeps a server or CA private key out of a tenant's hands. Kafka's `-clients-ca-cert` is the shape to match: a CA certificate, no key, readable by the tenant. + +### 5. Delivery: closing the cert-manager-engine gap + +The contract in §4 fixes the output object. The remaining work is the input path, and it splits by engine class. + +For **operator-owned engines** (postgres, kafka, mongodb), the operator already publishes a CA-bearing Secret synchronously enough to consume, and for kafka it is already key-free. The work is to converge each on the canonical `-ca-cert` shape and label — no new controller is required. + +For **cert-manager-minting engines** (nats, qdrant, and the four open per-app pull requests), none of the chart-time paths work: `valuesFrom` is pinned, `lookup` cannot trigger a re-render when the CA appears, and the CA is asynchronous (see "The problem"). The proposed mechanism is a small **CA-distribution controller**, modelled on the wildcard-secret reconciler introduced in PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`). That reconciler demonstrates the projection/ownership pattern: it projects copies into the right namespaces, marks every copy it owns with a management label (`cozystack.io/wildcard-secret-copy: "true"`), garbage-collects stale copies, and refuses to touch a foreign Secret that happens to share the name (it catches in-place rotation of its dynamically-named source via a periodic resync rather than a source watch). The CA-distribution controller does the analogous thing one level down on the per-release `-ca` Secret that cert-manager creates asynchronously, projecting a `-ca-cert` object carrying only `ca.crt` with the tenant-resource label. Because `-ca` has a deterministic name, the controller can watch the source directly and project as soon as the CA appears after render; because it writes a Secret the engine and the projection already understand, it needs neither `lookup` nor a custom `valuesFrom`. + +### 6. Per-engine application order + +`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it already carries `ca.crt` in `-credentials`, so it validates the consume contract with the least new machinery. `kafka` and `mongodb` follow with minimal shape adaptation. The cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the CA-distribution controller. `redis` (`cozystack/cozystack#2729`) is a hybrid — it renders a cert-manager chain but its operator fork already publishes a key-free `-ca-cert`, so it may converge by shape adaptation like the operator-owned engines rather than needing the controller (this is part of the controller-scope open question below). `kubernetes` (Kamaji) is explicitly out. + +## User-facing changes + +A tenant sees one canonical, key-free trust-anchor object per managed application — `-ca-cert`, carrying only `ca.crt` — through the dashboard resource map and the `tenantsecrets` projection, in the same shape across every engine. An operator sees one interface to choose the edge certificate source. There is no new tenant-authored input. + +## Upgrade and rollback compatibility + +This document changes nothing at runtime; it records a target. The pull requests that implement it are individually backward-compatible: per-app TLS is opt-in (tri-state), external exposure is opt-in, and the CA-distribution controller adds a new object without altering existing Secrets. Reverting any one of them removes the new `-ca-cert` object and the controller that maintains it, leaving the engine's existing PKI untouched. + +## Security + +The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes the helper's fail-closed guard load-bearing — any labelled Secret is delivered in full, so the guard is what prevents a private key from leaking. The CA-distribution controller adds a new trust surface: read access to per-release cert-manager CA Secrets and write access for the key-free copy. It must adopt the same foreign-collision guard as the wildcard reconciler, so it never overwrites a Secret it did not create. + +## Failure and edge cases + +- The helper's input PEM contains a private-key header → render fails closed; the chart does not deploy a key-bearing Secret. +- The per-release CA Secret does not exist yet → the controller waits for the watch event; it does not error or busy-loop. +- A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard). +- The CA rotates → the controller re-projects `ca.crt` on the next watch event; no chart re-render is required. + +## Testing + +- The helper is already covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`, including the fail-closed assertions. +- The CA-distribution controller gets an envtest/Ginkgo suite modelled on `internal/controller/wildcardsecret/reconciler_test.go`: source appears after the consumer, foreign-collision, rotation, and garbage-collection cases. +- Each per-app pull request adds helm-unittest fixtures asserting the `-ca-cert` shape and label, plus an end-to-end check under `hack/e2e-apps/` that a tenant can read `ca.crt` and verify the server. + +## Rollout + +1. Edge — done (`cozystack/cozystack#2988`, `cozystack/cozystack#2989`, `cozystack/cozystack#2990`). +2. This contract — accepted. +3. CA-distribution controller — implemented and tested. +4. Per-app convergence — postgres first (tracked by `cozystack/cozystack#2814`), then the remaining per-app TLS pull requests onto the contract. + +## Open questions + +- The exact name and namespace convention for the `-ca-cert` object (per-release in the app namespace is assumed here). +- Whether the CA-distribution controller is scoped to the cert-manager-minting engines or generalized into one projector that also normalizes the operator-owned engines onto the canonical shape. +- How this intersects with per-tenant wildcard propagation (issue `cozystack/cozystack#2820`, implemented by PR `cozystack/cozystack#2990`), which solves a structurally similar cross-namespace replication problem and may share the controller mechanism. + +## Alternatives considered + +- **Pure-consume (applications stop minting, consume a central cert-manager output).** Rejected: it breaks the rotation lifecycle of CloudNativePG and Strimzi, whose own-CA management is mutually exclusive with an externally-supplied server certificate. +- **`lookup` for the asynchronous CA.** Rejected: `lookup` runs at render time and is invisible to the Flux digest, so the chart would not re-render when the CA appears. (`cozystack/cozystack#1787` already moved the global-values channel off `lookup` onto `valuesFrom` for the same digest reason.) +- **A custom `valuesFrom` pointing at `-ca`.** Rejected: `expectedValuesFrom()` pins every application HelmRelease to the single `cozystack-values` Secret and overwrites drift. +- **A general-purpose cluster-secret replication operator.** Rejected during the edge work (`cozystack/cozystack#2990`) in favour of a purpose-built reconciler with a tight ownership guard. +- **A copy-issuer webhook.** Rejected during the wildcard work (`cozystack/cozystack#2812`) in favour of native references that move only the Secret name, never key material. + +--- + + From 20b7604d96fc8a4df9986a6e6cd9640860f13f7d Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Wed, 24 Jun 2026 18:38:46 +0300 Subject: [PATCH 2/5] docs(unified-tls-pki): correct postgres PKI model, rework CA delivery The PKI-ownership table claimed postgres already delivers a key-free ca.crt. In fact CNPG's -credentials Secret carries the server tls.key and is the object surfaced to tenants, so postgres is the canonical CA / private-key coupling case, not a converged engine. Fix the table, the takeaway, and the interior-state text, and state that convergence for postgres means a dedicated key-free -ca-cert plus removing -credentials from the tenant-facing surface. Replace the name-convention CA-distribution controller with an explicit source-selection label plus a small extraction controller, reusing the lineage webhook's spec.secrets label selector, owner-reference walk, and authoritative tenantresource stamping for marking and garbage collection. The controller sanitises to ca.crt at write time, owner-refs the projected Secret to the application instance, and emits a Warning Event on a name collision instead of failing silently. Record the field-filter projection as the principled root fix, align the per-tenant-propagation PR status across the document, and note the clustersecret-operator overlap with cross-namespace wildcard propagation. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- design-proposals/unified-tls-pki/README.md | 61 ++++++++++++++-------- 1 file changed, 39 insertions(+), 22 deletions(-) diff --git a/design-proposals/unified-tls-pki/README.md b/design-proposals/unified-tls-pki/README.md index 2ffc3a2..667d5f1 100644 --- a/design-proposals/unified-tls-pki/README.md +++ b/design-proposals/unified-tls-pki/README.md @@ -35,7 +35,7 @@ The managed engines fall into two classes, verified against the current charts. The first class **mints** its own chain from a chart-rendered cert-manager graph (self-signed Issuer → CA Certificate → CA Issuer → leaf Certificate). `nats` and `qdrant` do this in `main` (`packages/apps/nats/templates/certmanager.yaml`, `packages/apps/qdrant/templates/certmanager.yaml`); the open per-app pull requests for redis, rabbitmq, mariadb, and opensearch add the same shape. In this class the CA Secret (`-ca`) is a cert-manager CA-certificate Secret and therefore **carries the private key**, and the `ca.crt` is not delivered to tenants at all today. -The second class **consumes** PKI that its operator owns end-to-end. `postgres` (CloudNativePG) renders no cert-manager objects; the operator auto-generates a self-signed CA and signs the server certificate, and the `ca.crt` is present in every `-credentials` Secret (`packages/apps/postgres/templates/db.yaml`). `kafka` (Strimzi) is the same shape and is the reference for the *consume* contract: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). `mongodb` (Percona PSMDB) is operator-owned, but a tenant-facing `ca.crt` path is still in flight under `cozystack/cozystack#2692`, not in `main`. +The second class **consumes** PKI that its operator owns end-to-end, but the two merged engines in it sit on opposite sides of the trust-anchor contract. `postgres` (CloudNativePG) renders no cert-manager objects; the operator auto-generates a self-signed CA and signs the server certificate, and the `ca.crt` is present in every `-credentials` Secret (`packages/apps/postgres/templates/db.yaml`). That same Secret also carries the server `tls.key`, and it is the object surfaced to tenants through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`) — so today postgres delivers the trust anchor and the private key coupled in a single tenant-facing object, which is precisely the coupling `cozystack/cozystack#2814` exists to break. `kafka` (Strimzi) is the **clean** reference for the *consume* contract: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). `mongodb` (Percona PSMDB) is operator-owned, but a tenant-facing `ca.crt` path is still in flight under `cozystack/cozystack#2692`, not in `main`. One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the control-plane CA, it is not swappable, and the kubeconfig pins that cluster CA — a public edge certificate is meaningless there. It needs no unification and is excluded. @@ -44,6 +44,7 @@ One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the co - **The CA-only helper.** `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`, from `cozystack/cozystack#2989`) renders a Secret containing only `ca.crt`. It fails closed if the input PEM contains any private-key header, and it always stamps the label `internal.cozystack.io/tenantresource: "true"`. It is covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`. - **Why the helper is needed.** The CA+leaf chain is rendered per-app by each chart's own cert-manager graph (for example `packages/apps/nats/templates/certmanager.yaml`), and the resulting CA Secret (`-ca`) carries `ca.key` — so it is not itself a `ca.crt`-only object and cannot be handed to a tenant. The shared library carries no CA-rendering chart; the only TLS helper there is `cozy-lib.tls.caCertSecret` above. - **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered. This is the security pivot of the whole model — a labelled Secret must contain only safe material. +- **The label's authority is the lineage webhook, not the chart.** The `tenantresource` label is not honoured just because something stamped it. The lineage admission webhook (`internal/lineagecontrollerwebhook/webhook.go`) is its authority: it walks a Secret's `ownerReferences` to the owning application and sets `tenantresource` to `true` or `false` from that application's `spec.secrets` selector on **every** admission. A statically-stamped label therefore does not survive on its own — any Secret meant to be tenant-visible must also match a `spec.secrets.include` entry, or the webhook overwrites the label to `false`. This is why the consume contract below marks the CA object through `spec.secrets` (by label), not by stamping the label alone. - **The values channel.** Global values ride a `cozystack-values` Secret under `_cluster.*` keys, injected into every application HelmRelease via `valuesFrom`. ### The problem @@ -73,7 +74,7 @@ So the helper, by itself, closes the *output* shape (a key-free `ca.crt` Secret) - This proposal does **not** force pure-consume on engines that own their PKI; doing so would break CloudNativePG and Strimzi certificate rotation, which are mutually exclusive with an externally-supplied server certificate. - It does **not** homogenize the certificate authority. CA ownership stays the operator's choice — one corporate CA for everything, per-engine self-signed, or a cert-manager issuer are all legitimate. - It does **not** make Cozystack a public/WebPKI certificate authority, and it does not issue certificates for a tenant's own external domain. -- It does **not** redesign the edge, which already merged (`cozystack/cozystack#2988`, `cozystack/cozystack#2989`, `cozystack/cozystack#2990`); it only references it and reframes the top-line goal. +- It does **not** redesign the edge, which already merged (`cozystack/cozystack#2988`, `cozystack/cozystack#2989`) or is in flight (`cozystack/cozystack#2990`); it only references it and reframes the top-line goal. ## Design @@ -102,9 +103,9 @@ flowchart TB The two tiers are independent. The edge tier answers "what certificate does a public client see when it reaches the platform", and is satisfied by an operator-chosen source (ACME HTTP-01, ACME DNS-01 wildcard, or a bring-your-own wildcard). The interior tier answers "what does a client that connects directly to a managed engine need to trust", and is satisfied by the engine's own operator-owned PKI. The only thing that crosses between them is the *shape* of the trust-anchor object a tenant consumes. -### 2. Edge tier (already landed) +### 2. Edge tier (largely landed) -The edge is done and is recorded here only to fix the framing. An operator selects the certificate source once; the default ingress path serves a supplied wildcard via `--default-ssl-certificate` (`cozystack/cozystack#2988`), the Gateway path consumes the same Secret via an `existingSecret` mode, and `cozystack/cozystack#2990` propagates it to per-tenant termination points. The reframed top-line goal applies here: this is "a single interface to choose the source", not "a single issuer for everything". +The edge is largely done and is recorded here only to fix the framing. An operator selects the certificate source once; the default ingress path serves a supplied wildcard via `--default-ssl-certificate` (`cozystack/cozystack#2988`), and the Gateway path consumes the same Secret via an `existingSecret` mode. Propagating that wildcard to per-tenant termination points (`cozystack/cozystack#2990`) is the one open remainder. The reframed top-line goal applies here: this is "a single interface to choose the source", not "a single issuer for everything". ### 3. Interior PKI-ownership contract @@ -112,15 +113,15 @@ The contract is a per-engine table. It is the artifact the per-app pull requests | Engine | PKI owner | Mint / consume | CA-bearing Secret today | Key in that Secret? | `ca.crt` delivered to tenant today? | | --- | --- | --- | --- | --- | --- | -| postgres (CloudNativePG) | operator | consume | `-credentials` | no | yes (in credentials) | -| kafka (Strimzi) | operator | consume | `-cluster-ca-cert`, `-clients-ca-cert` | no | yes — reference shape | +| postgres (CloudNativePG) | operator | consume | `-credentials` | **yes — `tls.key`** | coupled — `ca.crt` only via the key-bearing `-credentials`, exposed to tenants today (the `#2814` bug) | +| kafka (Strimzi) | operator | consume | `-cluster-ca-cert`, `-clients-ca-cert` | no | yes — clean reference shape | | mongodb (Percona PSMDB) | operator | consume | none in `main` (PR `cozystack/cozystack#2692`) | n/a | no (in flight) | | nats | chart + cert-manager | mint | `-ca` | yes | no | | qdrant | chart + cert-manager | mint | `-ca` | yes | no | | redis, rabbitmq, mariadb, opensearch | chart + cert-manager (open PRs) | mint | in PR branches | yes | varies | | kubernetes (Kamaji) | Kamaji | internal only | Kamaji-owned, not swappable | n/a | no — out of model | -The takeaway: the operator-owned engines are already close to the target (postgres and kafka deliver `ca.crt` without a key), while the cert-manager-minting engines are the gap — their `-ca` carries the private key and the trust anchor never reaches the tenant. +The takeaway: only **kafka** matches the target today — it delivers `ca.crt` with no key. **postgres does not.** Its `ca.crt` rides inside the `-credentials` Secret, which also holds the server `tls.key`, and that same key-bearing Secret is the one surfaced to tenants through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`). So postgres is not an already-converged engine — it is the canonical instance of the CA / private-key coupling that `cozystack/cozystack#2814` exists to fix, and convergence for postgres means publishing a dedicated key-free `-ca-cert` *and* removing `-credentials` from the tenant-facing surface. The cert-manager-minting engines are the other gap: their `-ca` carries the private key and the trust anchor never reaches the tenant at all. ### 4. The uniform consume contract @@ -132,13 +133,25 @@ This is where the label-filtered projection matters. Because `tenantsecrets` del The contract in §4 fixes the output object. The remaining work is the input path, and it splits by engine class. -For **operator-owned engines** (postgres, kafka, mongodb), the operator already publishes a CA-bearing Secret synchronously enough to consume, and for kafka it is already key-free. The work is to converge each on the canonical `-ca-cert` shape and label — no new controller is required. +For **operator-owned engines**, the operator already publishes CA material synchronously enough to consume. `kafka` is already key-free and needs no change. `postgres` is the coupling case from §3: convergence means publishing a dedicated key-free `-ca-cert` (extracted from the `ca.crt` the CNPG operator already materialises) **and** dropping the key-bearing `-credentials` from its dashboard resource map, so the trust anchor reaches the tenant without the server `tls.key`. `mongodb` adds the same `-ca-cert` shape (`cozystack/cozystack#2692`). No new controller is required for this class. -For **cert-manager-minting engines** (nats, qdrant, and the four open per-app pull requests), none of the chart-time paths work: `valuesFrom` is pinned, `lookup` cannot trigger a re-render when the CA appears, and the CA is asynchronous (see "The problem"). The proposed mechanism is a small **CA-distribution controller**, modelled on the wildcard-secret reconciler introduced in PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`). That reconciler demonstrates the projection/ownership pattern: it projects copies into the right namespaces, marks every copy it owns with a management label (`cozystack.io/wildcard-secret-copy: "true"`), garbage-collects stale copies, and refuses to touch a foreign Secret that happens to share the name (it catches in-place rotation of its dynamically-named source via a periodic resync rather than a source watch). The CA-distribution controller does the analogous thing one level down on the per-release `-ca` Secret that cert-manager creates asynchronously, projecting a `-ca-cert` object carrying only `ca.crt` with the tenant-resource label. Because `-ca` has a deterministic name, the controller can watch the source directly and project as soon as the CA appears after render; because it writes a Secret the engine and the projection already understand, it needs neither `lookup` nor a custom `valuesFrom`. +For **cert-manager-minting engines** (nats, qdrant, and the four open per-app pull requests), none of the chart-time paths work: `valuesFrom` is pinned, `lookup` cannot trigger a re-render when the CA appears, and the CA is asynchronous (see "The problem"). These need a small **extraction controller** — and it can be far smaller than a generic replicator, because the platform's lineage machinery already provides the marking, ownership, and garbage collection a replicator would otherwise re-implement. The contract is three pieces, only the middle of which is new code. + +**(a) An explicit source-selection label, not a name convention.** The chart stamps `internal.cozystack.io/publish-ca-cert: "true"` on the CA-bearing Secret it wants extracted, with an optional `internal.cozystack.io/publish-ca-cert-key` annotation naming the key to lift (default `ca.crt`). For the cert-manager-minting charts this rides `Certificate.spec.secretTemplate.labels`, so even the asynchronously-created `-ca` Secret carries the marker from the moment cert-manager writes it. This replaces "the controller watches `-ca` because the name is deterministic" — a brittle implicit contract the next chart author silently breaks by naming their CA Secret anything else — with an explicit opt-in the source declares. + +**(b) A small extraction controller.** Its watch/upsert skeleton can follow the wildcard-secret reconciler from PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`), but it carries none of that reconciler's copy-marking or prune logic — lineage provides those (part (c)). For each label-selected source Secret it upserts a `type: Opaque` Secret named `-ca-cert` containing **only** `ca.crt`, re-copying on every source change so a CA rotation propagates without a chart re-render. It does three security-load-bearing things and nothing more: + +- **Sanitize at write time, not just render time.** The `cozy-lib.tls.caCertSecret` helper's fail-closed guard runs at *chart-render* time; this controller writes at *runtime*, so it must itself copy only the single `ca.crt` key (an explicit whitelist) and re-assert that the value carries no `-----BEGIN … PRIVATE KEY-----` header before writing. It never copies the whole `Data`. +- **Owner-ref the projected Secret to the application instance CR**, resolved from the `app.kubernetes.io/instance` label already on the source Secret — *not* to the source `-ca` Secret. cert-manager does not own its output Secrets, so a `-ca` Secret typically has no `ownerReferences` and an ownership-graph walk from it dead-ends before reaching the app. Pointing the projected Secret one hop from the app root makes Kubernetes garbage-collect it on app deletion for free, which is why the controller needs no prune logic of its own. +- **Refuse to touch a foreign Secret, and say so.** If the target name is already occupied by a Secret the controller did not create, it leaves it untouched (a management-label guard) **and emits a Kubernetes Warning Event** on the application — a silent skip would otherwise surface only as an unexplained TLS-verification failure for the tenant, which is exactly the kind of dead end a low-skill operator cannot diagnose. + +**(c) Marking stays in `spec.secrets`, by label not by name.** Because `ApplicationDefinition.spec.secrets.include` accepts a label selector (`internal/lineagecontrollerwebhook/matcher.go`), one generic entry — `matchLabels: {internal.cozystack.io/tenant-ca: "true"}`, stamped by the controller on every projected Secret — covers every engine with no per-release `resourceName` templating. The lineage admission webhook then does the rest: on admission it walks the projected Secret's `ownerReferences` to the owning application and authoritatively stamps `internal.cozystack.io/tenantresource` to `true` (and to `false` should the Secret ever stop matching). + +What this *reuses* rather than rebuilds: the label selector in `ApplicationDefinition.spec.secrets` (`internal/lineagecontrollerwebhook/matcher.go`), the lineage webhook's owner-reference walk and authoritative `tenantresource` stamping (`internal/lineagecontrollerwebhook/webhook.go`, `pkg/lineage/lineage.go`), the private-key guard in `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`), and native Kubernetes garbage collection via the owner reference. The irreducibly new work is the extraction step itself — read one key, write a key-free copy, re-copy on rotation — the same job an operator does natively on the engines that already ship a key-free CA object (kafka's `-clients-ca-cert`, the redis fork's `-ca-cert`). ### 6. Per-engine application order -`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it already carries `ca.crt` in `-credentials`, so it validates the consume contract with the least new machinery. `kafka` and `mongodb` follow with minimal shape adaptation. The cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the CA-distribution controller. `redis` (`cozystack/cozystack#2729`) is a hybrid — it renders a cert-manager chain but its operator fork already publishes a key-free `-ca-cert`, so it may converge by shape adaptation like the operator-owned engines rather than needing the controller (this is part of the controller-scope open question below). `kubernetes` (Kamaji) is explicitly out. +`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it is the engine whose coupling the epic most needs to fix, and converging it — a dedicated key-free `-ca-cert` plus removing `-credentials` from the tenant-facing surface — validates the consume contract on the operator-owned path with no new controller. `kafka` is already there; `mongodb` follows with the same shape adaptation (`cozystack/cozystack#2692`). The cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the extraction controller. `redis` (`cozystack/cozystack#2729`) is a hybrid — it renders a cert-manager chain but its operator fork already publishes a key-free `-ca-cert`, so it may converge by shape adaptation like the operator-owned engines rather than needing the controller (this is part of the controller-scope open question below). `kubernetes` (Kamaji) is explicitly out. ## User-facing changes @@ -146,37 +159,39 @@ A tenant sees one canonical, key-free trust-anchor object per managed applicatio ## Upgrade and rollback compatibility -This document changes nothing at runtime; it records a target. The pull requests that implement it are individually backward-compatible: per-app TLS is opt-in (tri-state), external exposure is opt-in, and the CA-distribution controller adds a new object without altering existing Secrets. Reverting any one of them removes the new `-ca-cert` object and the controller that maintains it, leaving the engine's existing PKI untouched. +This document changes nothing at runtime; it records a target. The pull requests that implement it are individually backward-compatible in the engine's own PKI: per-app TLS is opt-in (tri-state), external exposure is opt-in, and the extraction controller adds a new object without altering existing Secrets. One deliberate, security-motivated break: converging postgres removes the key-bearing `-credentials` Secret from the tenant-facing dashboard surface, so a tenant that reads `ca.crt` out of `-credentials` today must switch to the new key-free `-ca-cert`. Reverting any implementing PR removes the new `-ca-cert` object and the controller that maintains it, leaving the engine's existing PKI untouched. ## Security -The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes the helper's fail-closed guard load-bearing — any labelled Secret is delivered in full, so the guard is what prevents a private key from leaking. The CA-distribution controller adds a new trust surface: read access to per-release cert-manager CA Secrets and write access for the key-free copy. It must adopt the same foreign-collision guard as the wildcard reconciler, so it never overwrites a Secret it did not create. +The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes the fail-closed guard load-bearing — any labelled Secret is delivered in full, so removing the key from the *object* (rather than relying on field filtering at projection time) is what prevents a private key from leaking. Two consequences follow. First, postgres' current exposure of the key-bearing `-credentials` to tenants is the live instance of this risk, and convergence closes it. Second, the extraction controller writes at runtime, after chart-render, so it cannot lean on the helper's render-time guard alone: it must whitelist the single `ca.crt` key and re-assert the no-private-key check itself on every write. The controller adds a new trust surface — read access to per-release cert-manager CA Secrets, write access for the key-free copy — and must never overwrite a Secret it did not create, surfacing any name collision as a Warning Event rather than failing silently. ## Failure and edge cases - The helper's input PEM contains a private-key header → render fails closed; the chart does not deploy a key-bearing Secret. -- The per-release CA Secret does not exist yet → the controller waits for the watch event; it does not error or busy-loop. -- A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard). -- The CA rotates → the controller re-projects `ca.crt` on the next watch event; no chart re-render is required. +- The source `ca.crt` somehow carries a private-key header at runtime → the controller refuses to write the copy (runtime whitelist plus guard); no key-bearing Secret is ever projected. +- The per-release CA Secret does not exist yet → the controller waits for the watch event on the labelled source; it does not error or busy-loop. +- A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard) and emits a Warning Event on the application, so the operator sees the collision. +- The CA rotates → the controller re-copies `ca.crt` on the next source change; no chart re-render is required. +- The application is deleted → the projected `-ca-cert` is garbage-collected by Kubernetes via its owner reference to the application instance; the controller needs no delete path. ## Testing - The helper is already covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`, including the fail-closed assertions. -- The CA-distribution controller gets an envtest/Ginkgo suite modelled on `internal/controller/wildcardsecret/reconciler_test.go`: source appears after the consumer, foreign-collision, rotation, and garbage-collection cases. -- Each per-app pull request adds helm-unittest fixtures asserting the `-ca-cert` shape and label, plus an end-to-end check under `hack/e2e-apps/` that a tenant can read `ca.crt` and verify the server. +- The extraction controller gets an envtest/Ginkgo suite (its skeleton mirrors `internal/controller/wildcardsecret/reconciler_test.go`): a labelled source appearing after the consumer, a source whose `ca.crt` is swapped (rotation), a foreign-name collision (asserting the Secret is untouched and a Warning Event is emitted), a source value that smuggles a private-key header (asserting the controller refuses to write), and owner-reference-driven garbage collection on app deletion. +- Each per-app pull request adds helm-unittest fixtures asserting the `-ca-cert` shape and label, plus an end-to-end check under `hack/e2e-apps/` that a tenant can read `ca.crt`, cannot read any object carrying `tls.key`, and can verify the server. ## Rollout -1. Edge — done (`cozystack/cozystack#2988`, `cozystack/cozystack#2989`, `cozystack/cozystack#2990`). +1. Edge — `cozystack/cozystack#2988` and `cozystack/cozystack#2989` merged; per-tenant propagation `cozystack/cozystack#2990` still open. 2. This contract — accepted. -3. CA-distribution controller — implemented and tested. +3. Extraction controller — implemented and tested. 4. Per-app convergence — postgres first (tracked by `cozystack/cozystack#2814`), then the remaining per-app TLS pull requests onto the contract. ## Open questions -- The exact name and namespace convention for the `-ca-cert` object (per-release in the app namespace is assumed here). -- Whether the CA-distribution controller is scoped to the cert-manager-minting engines or generalized into one projector that also normalizes the operator-owned engines onto the canonical shape. -- How this intersects with per-tenant wildcard propagation (issue `cozystack/cozystack#2820`, implemented by PR `cozystack/cozystack#2990`), which solves a structurally similar cross-namespace replication problem and may share the controller mechanism. +- The exact namespace convention for the `-ca-cert` object (per-release in the app namespace is assumed here) and the final label/annotation names. This proposal uses `internal.cozystack.io/publish-ca-cert` on the source and `internal.cozystack.io/tenant-ca` on the projected copy, matching the `internal.cozystack.io/` convention of the existing `tenantresource` and `managed-by-cozystack` markers. +- Whether the extraction controller is scoped to the cert-manager-minting engines, or generalized into one projector that also normalizes the operator-owned engines (postgres, mongodb) onto the canonical shape instead of each chart doing its own shape adaptation. +- How this intersects with per-tenant wildcard propagation (issue `cozystack/cozystack#2820`, implemented by PR `cozystack/cozystack#2990`). That is a *cross-namespace* replication problem, and Cozystack already ships `clustersecret-operator` (a namespace-selector `ClusterSecret` CRD) for that class; the extraction controller here is *intra-namespace, per-release*, so the two do not share a mechanism — but they should share the management-label and foreign-collision conventions. ## Alternatives considered @@ -185,6 +200,8 @@ The trust boundary is precise: a tenant receives `ca.crt` and never receives `tl - **A custom `valuesFrom` pointing at `-ca`.** Rejected: `expectedValuesFrom()` pins every application HelmRelease to the single `cozystack-values` Secret and overwrites drift. - **A general-purpose cluster-secret replication operator.** Rejected during the edge work (`cozystack/cozystack#2990`) in favour of a purpose-built reconciler with a tight ownership guard. - **A copy-issuer webhook.** Rejected during the wildcard work (`cozystack/cozystack#2812`) in favour of native references that move only the Secret name, never key material. +- **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favour of §5's explicit source label plus lineage reuse: the name convention is an implicit contract the next chart author silently breaks, and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. +- **A field filter on the projection itself (the principled root fix).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialised at all. Giving the projection (or the `spec.secrets` selector) a per-key field filter would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no extraction controller, no second object — attacking the CA / private-key coupling of `cozystack/cozystack#2814` at its source and generalizing to the operator-owned engines. Deferred as a larger change: the projection is whole-`Data` today (`pkg/registry/core/tenantsecret/rest.go`), so it touches the tenant-facing API surface rather than a single controller. Recorded as the direction the extraction controller is a pragmatic first step toward. --- From b50a3cb1c2973e545f7b1019d56dcdff540bd6a6 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Fri, 26 Jun 2026 17:26:59 +0300 Subject: [PATCH 3/5] docs(unified-tls-pki): reframe interior model as a uniform delivery gap State the interior contract precisely against the current charts: no managed engine hands a private key to tenants today. postgres exposes only a passwords Secret, and every merged engine withholds its key-bearing CA, so the interior problem is a uniform trust-anchor delivery gap (the tenant never receives ca.crt), not a CA/private-key coupling. Recast the Security section as a preventive invariant and note that the live risk is in the in-flight per-app PRs that propose labelling a key-bearing Secret to tenants. Recast ca.crt delivery as one engine-agnostic, label-driven extraction controller serving every engine whose operator does not self-publish a key-free CA (kafka and the redis fork opt out), keyed on a source label and secret content rather than mint-vs-consume or a name convention. Add the per-engine operator-capability matrix, including mongodb's key-bearing ca-cert and the CloudNativePG/Strimzi asymmetry; the trust-manager, ClusterTrustBundle, and field-filter alternatives with rationale; and the open question of labelling operator-created CA secrets. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- design-proposals/unified-tls-pki/README.md | 111 ++++++++++++--------- 1 file changed, 65 insertions(+), 46 deletions(-) diff --git a/design-proposals/unified-tls-pki/README.md b/design-proposals/unified-tls-pki/README.md index 667d5f1..c2e0b6b 100644 --- a/design-proposals/unified-tls-pki/README.md +++ b/design-proposals/unified-tls-pki/README.md @@ -10,7 +10,7 @@ Certificate handling in Cozystack grew out of four uncoordinated mechanisms: per-application issuance inside each chart, per-host ACME on the default ingress path, an opt-in Gateway API path that mints a wildcard via DNS-01, and no supported way to bring an externally-issued wildcard. The epic `cozystack/cozystack#2811` set out to converge them, and the edge of that convergence has largely landed. What has not landed — and what this proposal exists to pin down before it does — is the part that lives *inside* the engines: who owns the PKI for each managed application, and how a tenant receives the trust anchor it needs to verify a TLS connection. -This proposal records a written target design for the whole model, so that the two architectural forks at its center are decided on paper rather than discovered one pull request at a time. The first fork is the issuance abstraction: rather than "cert-manager as the single issuer", the target is **a single operator-facing interface to choose the certificate source, plus a uniform contract for consuming `ca.crt`**. The second fork is mint-versus-consume: rather than forcing every application to stop minting and consume a central certificate, the target is an explicitly **two-tier** model — cert-manager (or a bring-your-own wildcard) at the edge, and operator-owned PKI inside the engines — where what is unified is the *consume contract*, not the certificate authority itself. +This proposal records a written target design for the whole model, so that the two architectural forks at its center are decided on paper rather than discovered one pull request at a time. The first fork is the issuance abstraction: rather than "cert-manager as the single issuer", the target is **a single operator-facing interface to choose the certificate source, plus a uniform contract for consuming `ca.crt`**. The second fork is mint-versus-consume: rather than forcing every application to stop minting and consume a central certificate, the target is an explicitly **two-tier** model — cert-manager (or a bring-your-own wildcard) at the edge, and operator-owned PKI inside the engines — where what is unified is the *consume contract*, not the certificate authority itself. The second fork is the answer to the open question the epic body still carries as a goal ("should applications stop minting certs entirely and only consume cert-manager output?"): no — two-tier, because forcing pure-consume breaks the engines that rotate their own CA. ## Scope and related proposals @@ -18,8 +18,8 @@ This proposal is the umbrella design for the work tracked by epic `cozystack/coz - **Edge, merged:** `cozystack/cozystack#2988` (ACME wildcard on the default ingress-nginx path), `cozystack/cozystack#2989` (the CA-only trust-anchor helper). - **Edge, open:** `cozystack/cozystack#2990` (propagate the operator wildcard to per-tenant termination points — the PR implementing issue `cozystack/cozystack#2820`). -- **Workstreams (issues):** `cozystack/cozystack#2812` and `cozystack/cozystack#2400` (closed, edge wildcard); `cozystack/cozystack#2814` (converge per-app TLS and fix the CA / private-key coupling — the first consumer of this contract); `cozystack/cozystack#2815` (external DB exposure via Gateway TLS-passthrough); `cozystack/cozystack#2816` (end-to-end TLS for databases); `cozystack/cozystack#2977` (opt-in east-west encryption). Throughout this document a `cozystack/cozystack#NNNN` reference is an issue unless called out as a PR. -- **Per-app TLS series (open):** `cozystack/cozystack#2729` (redis), `cozystack/cozystack#2692` (mongodb), `cozystack/cozystack#2683` (rabbitmq), `cozystack/cozystack#2682` (opensearch), `cozystack/cozystack#2680` (mariadb). These are the pull requests that should land *after* this contract is accepted, not before. +- **Workstreams (issues):** `cozystack/cozystack#2812` and `cozystack/cozystack#2400` (closed, edge wildcard); `cozystack/cozystack#2814` (converge per-app TLS and close the trust-anchor delivery gap — the first consumer of this contract); `cozystack/cozystack#2815` (external DB exposure via Gateway TLS-passthrough); `cozystack/cozystack#2816` (end-to-end TLS for databases); `cozystack/cozystack#2977` (opt-in east-west encryption). Throughout this document a `cozystack/cozystack#NNNN` reference is an issue unless called out as a PR. +- **Per-app TLS series (open):** `cozystack/cozystack#2729` (redis), `cozystack/cozystack#2692` (mongodb), `cozystack/cozystack#2683` (rabbitmq), `cozystack/cozystack#2682` (opensearch), `cozystack/cozystack#2680` (mariadb). These are the pull requests that should land *after* this contract is accepted, not before; several currently propose handing tenants a key-bearing Secret, which this contract exists to correct (see Security). All repository paths below refer to the `cozystack/cozystack` repository; paths attributed to an open PR (for example the wildcard-secret reconciler in PR `cozystack/cozystack#2990`) are not yet on `main`. @@ -31,19 +31,23 @@ System ingresses (dashboard, grafana, keycloak, harbor) get a per-host certifica ### Interior today -The managed engines fall into two classes, verified against the current charts. +The managed engines fall into two classes by *who mints*, but — and this is the load-bearing correction — that axis is **not** the axis that decides who needs a delivery mechanism. Verified against the current charts on `main` and the open per-app PRs. -The first class **mints** its own chain from a chart-rendered cert-manager graph (self-signed Issuer → CA Certificate → CA Issuer → leaf Certificate). `nats` and `qdrant` do this in `main` (`packages/apps/nats/templates/certmanager.yaml`, `packages/apps/qdrant/templates/certmanager.yaml`); the open per-app pull requests for redis, rabbitmq, mariadb, and opensearch add the same shape. In this class the CA Secret (`-ca`) is a cert-manager CA-certificate Secret and therefore **carries the private key**, and the `ca.crt` is not delivered to tenants at all today. +The first class **mints** its own chain from a chart-rendered cert-manager graph (self-signed Issuer → CA Certificate → CA Issuer → leaf Certificate). `nats` and `qdrant` do this in `main` (`packages/apps/nats/templates/certmanager.yaml`, `packages/apps/qdrant/templates/certmanager.yaml`); the open per-app pull requests for redis, rabbitmq, mariadb, and opensearch add the same shape. In this class the CA Secret (`-ca` or `-ca-tls`) is a cert-manager CA-certificate Secret and therefore **carries the private key**, and the `ca.crt` is not delivered to tenants at all today. -The second class **consumes** PKI that its operator owns end-to-end, but the two merged engines in it sit on opposite sides of the trust-anchor contract. `postgres` (CloudNativePG) renders no cert-manager objects; the operator auto-generates a self-signed CA and signs the server certificate, and the `ca.crt` is present in every `-credentials` Secret (`packages/apps/postgres/templates/db.yaml`). That same Secret also carries the server `tls.key`, and it is the object surfaced to tenants through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`) — so today postgres delivers the trust anchor and the private key coupled in a single tenant-facing object, which is precisely the coupling `cozystack/cozystack#2814` exists to break. `kafka` (Strimzi) is the **clean** reference for the *consume* contract: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). `mongodb` (Percona PSMDB) is operator-owned, but a tenant-facing `ca.crt` path is still in flight under `cozystack/cozystack#2692`, not in `main`. +The second class **consumes** PKI that its operator owns end-to-end — but, crucially, operator-owned does **not** imply already-converged. Only one engine in it actually delivers a key-free trust anchor today: -One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the control-plane CA, it is not swappable, and the kubeconfig pins that cluster CA — a public edge certificate is meaningless there. It needs no unification and is excluded. +- `kafka` (Strimzi) is the **clean** reference: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). Strimzi can serve a public certificate on its external listener while keeping internal broker mTLS on its own CA (`generateCertificateAuthority: false` + per-listener `brokerCertChainAndKey`). +- `postgres` (CloudNativePG) renders no cert-manager objects; the CNPG operator auto-generates a self-signed CA and signs the server certificate. The CA lives in the operator-created `-ca` Secret, which **carries `ca.key`** and is created **asynchronously**; it is **not** delivered to tenants. The only tenant-facing object is `-credentials` (`packages/apps/postgres/templates/init-script.yaml`), which is a chart-rendered Opaque Secret holding **only `user: password` pairs** — no `ca.crt`, no `tls.key` — surfaced through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`). So postgres is a **trust-anchor delivery gap** (the tenant gets passwords but never `ca.crt`), the same shape as nats/qdrant — *not* a private-key coupling. The comment at `packages/apps/postgres/templates/db.yaml:20-22` claiming `ca.crt` rides in `-credentials` contradicts what the chart actually renders and is a documentation bug. +- `mongodb` (Percona PSMDB) is operator-owned, but the PSMDB operator mints a cert-manager chain and publishes `-ca-cert` as a **key-bearing** cert-manager `isCA` Secret (`tls.crt` + `tls.key` + `ca.crt`). Same gap as the rest, plus a trap: the name `-ca-cert` is key-free for the redis fork and key-bearing for PSMDB — opposite shapes under one name. + +One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the control-plane CA, it is not swappable, and the kubeconfig pins that cluster CA — a public edge certificate is meaningless there. It is the lone "private-CA-mandatory" case, needs no unification, and is excluded. ### Platform mechanisms this proposal builds on -- **The CA-only helper.** `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`, from `cozystack/cozystack#2989`) renders a Secret containing only `ca.crt`. It fails closed if the input PEM contains any private-key header, and it always stamps the label `internal.cozystack.io/tenantresource: "true"`. It is covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`. -- **Why the helper is needed.** The CA+leaf chain is rendered per-app by each chart's own cert-manager graph (for example `packages/apps/nats/templates/certmanager.yaml`), and the resulting CA Secret (`-ca`) carries `ca.key` — so it is not itself a `ca.crt`-only object and cannot be handed to a tenant. The shared library carries no CA-rendering chart; the only TLS helper there is `cozy-lib.tls.caCertSecret` above. -- **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered. This is the security pivot of the whole model — a labelled Secret must contain only safe material. +- **The CA-only helper.** `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`, from `cozystack/cozystack#2989`) renders a Secret containing only `ca.crt`. It fails closed if the input PEM contains any private-key header, and it always stamps the label `internal.cozystack.io/tenantresource: "true"`. It is covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`. The merged PR frames the situation it addresses as exactly this — a delivery gap: "the cert-manager apps grant tenants no access to those Secrets at all, so a tenant currently cannot obtain `ca.crt`". +- **Why the helper is needed but not sufficient.** The CA+leaf chain is rendered per-app by each chart's own cert-manager graph (for example `packages/apps/nats/templates/certmanager.yaml`), and the resulting CA Secret carries `ca.key` — so it is not itself a `ca.crt`-only object and cannot be handed to a tenant. The helper fixes the *output shape* (a key-free `ca.crt` Secret) but takes the CA cert as a Helm value, and nothing in the current architecture can feed it that value for an asynchronously-issued, per-release CA (see "The problem"). +- **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered (`secretToTenant` copies `sec.Data` whole). This is the security pivot of the whole model — a labelled Secret must contain only safe material. - **The label's authority is the lineage webhook, not the chart.** The `tenantresource` label is not honoured just because something stamped it. The lineage admission webhook (`internal/lineagecontrollerwebhook/webhook.go`) is its authority: it walks a Secret's `ownerReferences` to the owning application and sets `tenantresource` to `true` or `false` from that application's `spec.secrets` selector on **every** admission. A statically-stamped label therefore does not survive on its own — any Secret meant to be tenant-visible must also match a `spec.secrets.include` entry, or the webhook overwrites the label to `false`. This is why the consume contract below marks the CA object through `spec.secrets` (by label), not by stamping the label alone. - **The values channel.** Global values ride a `cozystack-values` Secret under `_cluster.*` keys, injected into every application HelmRelease via `valuesFrom`. @@ -51,22 +55,22 @@ One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the co The epic's original headline — "consume not mint, cert-manager as the single issuance abstraction" — is not realizable as stated, for two reasons. -First, there is no written contract for the interior. Nothing records, per engine, who owns the PKI and how `ca.crt` reaches the tenant. As a result the per-app TLS pull requests each re-derive the answer, and the answer differs between them. +First, there is no written contract for the interior. Nothing records, per engine, who owns the PKI and how `ca.crt` reaches the tenant. As a result the per-app TLS pull requests each re-derive the answer, and the answer differs between them — to the point that some now propose projecting a key-bearing Secret to tenants (see Security). -Second, the cert-manager-minting engines (nats, qdrant, and the four open per-app pull requests) have **no path to consume a `ca.crt` via the helper today**, because of three compounding constraints that are real and verified: +Second, most engines have **no path to deliver a key-free `ca.crt` today**. The cert-manager-minting engines (nats, qdrant, and the four open per-app pull requests) cannot consume the helper because of three compounding constraints that are real and verified — and the operator-owned engines whose operator emits a key-bearing CA (postgres/CNPG, mongodb/PSMDB) hit the same wall, because their CA Secret is also key-bearing and (for CNPG) asynchronous: - `valuesFrom` is pinned. `expectedValuesFrom()` (`internal/controller/applicationdefinition_helmreconciler.go:99-107`) hardcodes a single `{Kind: Secret, Name: cozystack-values}` reference, and the reconciler overwrites any drift. An application chart cannot add a sideways `valuesFrom` pointing at its own `-ca`. - `lookup` cannot drive it. PR `cozystack/cozystack#1787` moved the global-values channel off `lookup` onto `valuesFrom`; `lookup` itself is still available and is used by several charts to read a pre-existing per-release Secret. But it runs at template-render time and is invisible to the Flux digest, so a chart that reads an asynchronously-created Secret via `lookup` does not re-render when that Secret appears — it would need a manual `helm upgrade`. -- The per-release CA is created **asynchronously** by cert-manager, so it does not exist at template-render time at all. +- The per-release CA is created **asynchronously** by cert-manager (or by the CNPG/PSMDB operator), so it does not exist at template-render time at all. So the helper, by itself, closes the *output* shape (a key-free `ca.crt` Secret) but not the *input* path (where the chart gets that `ca.crt` for an asynchronously-issued, per-release CA). That gap is the substance of this proposal. ## Goals - Record **two-tier** as the target architecture: edge issuance plus interior operator-owned PKI. -- Define a per-engine **PKI-ownership contract**: for each managed engine, who owns the CA, whether the engine mints or consumes, which Secret carries `ca.crt`, which carries the key, and whether the tenant sees it. +- Define a per-engine **PKI-ownership contract**: for each managed engine, who owns the CA, whether the engine mints or consumes, which Secret carries `ca.crt`, which carries the key, whether the operator already self-publishes a key-free CA, and whether the tenant sees it. - Define a single **consume contract**: a `ca.crt`-only object, stamped with the tenant-resource label, delivered through the existing projection. Kafka's CA-certificate-only Secret is the reference shape. -- Define a **delivery mechanism** for `ca.crt` on the engines where the chart cannot wire the helper directly. +- Define a **delivery mechanism** for `ca.crt` on the engines whose operator does not already emit a key-free CA object. - Decide **mint-versus-consume explicitly, per engine**, rather than as one global rule. ### Non-goals @@ -109,19 +113,23 @@ The edge is largely done and is recorded here only to fix the framing. An operat ### 3. Interior PKI-ownership contract -The contract is a per-engine table. It is the artifact the per-app pull requests must conform to, and it makes the two-tier reality explicit: the operator owns the CA; the platform unifies only how `ca.crt` is consumed. +The contract is a per-engine table. It is the artifact the per-app pull requests must conform to, and it makes the two-tier reality explicit: the operator owns the CA; the platform unifies only how `ca.crt` is consumed. The decisive column is **"self-publishes a key-free `ca.crt`?"** — that, not "mint vs consume", is what determines whether an engine needs a delivery mechanism. -| Engine | PKI owner | Mint / consume | CA-bearing Secret today | Key in that Secret? | `ca.crt` delivered to tenant today? | -| --- | --- | --- | --- | --- | --- | -| postgres (CloudNativePG) | operator | consume | `-credentials` | **yes — `tls.key`** | coupled — `ca.crt` only via the key-bearing `-credentials`, exposed to tenants today (the `#2814` bug) | -| kafka (Strimzi) | operator | consume | `-cluster-ca-cert`, `-clients-ca-cert` | no | yes — clean reference shape | -| mongodb (Percona PSMDB) | operator | consume | none in `main` (PR `cozystack/cozystack#2692`) | n/a | no (in flight) | -| nats | chart + cert-manager | mint | `-ca` | yes | no | -| qdrant | chart + cert-manager | mint | `-ca` | yes | no | -| redis, rabbitmq, mariadb, opensearch | chart + cert-manager (open PRs) | mint | in PR branches | yes | varies | -| kubernetes (Kamaji) | Kamaji | internal only | Kamaji-owned, not swappable | n/a | no — out of model | +| Engine | PKI owner | CA-bearing Secret today | Key in that Secret? | Self-publishes key-free `ca.crt`? | `ca.crt` to tenant today? | Operator certificate capability | +| --- | --- | --- | --- | --- | --- | --- | +| kafka (Strimzi) | operator | `-cluster-ca-cert`, `-clients-ca-cert` | no | **yes** | yes — clean reference | `generateCertificateAuthority: false` BYO-CA + per-listener `brokerCertChainAndKey`; public cert on external listener, own-CA mTLS internally | +| redis (forked operator) | operator fork | `-ca-tls` (chart cert-manager) | yes | **yes** → operator emits `-ca-cert` | yes | forked `redis-operator` `spec.tls.caCertSecretName` outputs a key-free `ca.crt`-only Opaque Secret (v1.4.0+) | +| postgres (CloudNativePG) | operator | `-ca` (operator-created, async) | **yes — `ca.key`** | no | no — tenant gets passwords-only `-credentials` (gap) | `serverCASecret` / `serverTLSSecret` BYO; single server cert, no edge/internal split | +| mongodb (Percona PSMDB) | operator | `-ca-cert` (operator-created) | **yes — cert-manager `isCA`** | no | no (gap; name collides with redis, opposite shape) | `spec.tls.issuerConf` BYO issuer; auto-mints a cert-manager chain | +| nats | chart + cert-manager | `-ca` | yes | no | no | — | +| qdrant | chart + cert-manager | `-ca` | yes | no | no | — | +| mariadb (open PR) | chart + cert-manager | `-ca-tls` | yes | no | no | mariadb-operator `serverCertSecretRef` / `serverCASecretRef` / `serverCertIssuerRef` | +| opensearch (open PR) | chart + cert-manager | `-http-ca` | yes | no | no | opster `security.tls.http.secret.name`; transport operator-generated | +| rabbitmq (open PR) | chart + cert-manager | `-ca` | yes | no | no | cluster-operator `spec.tls.secretName` / `caSecretName` | +| clickhouse (Altinity) | — (BYO-cert) | supplied cert mount | n/a | no | no (not in TLS series yet) | mounts a supplied cert; no native issuance | +| kubernetes (Kamaji) | Kamaji | Kamaji-owned, not swappable | n/a | n/a | no — out of model | control-plane CA, kubeconfig-pinned, non-swappable | -The takeaway: only **kafka** matches the target today — it delivers `ca.crt` with no key. **postgres does not.** Its `ca.crt` rides inside the `-credentials` Secret, which also holds the server `tls.key`, and that same key-bearing Secret is the one surfaced to tenants through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`). So postgres is not an already-converged engine — it is the canonical instance of the CA / private-key coupling that `cozystack/cozystack#2814` exists to fix, and convergence for postgres means publishing a dedicated key-free `-ca-cert` *and* removing `-credentials` from the tenant-facing surface. The cert-manager-minting engines are the other gap: their `-ca` carries the private key and the trust anchor never reaches the tenant at all. +The takeaway: only **kafka** (native) and **redis** (forked operator) are at the target today — they self-publish a key-free `ca.crt`. **Every other engine has a key-bearing CA Secret and delivers no `ca.crt` to the tenant** — one uniform delivery gap. The gap does **not** split along mint-versus-consume: postgres and mongodb are operator-owned, yet their CA Secret is key-bearing (and operator-created, hence asynchronous), so they need the same key-free projection as the cert-manager-minting engines. The cert-manager-minting engines additionally never reach the tenant at all. ### 4. The uniform consume contract @@ -129,21 +137,25 @@ Every engine, regardless of who owns its CA, exposes its trust anchor through on This is where the label-filtered projection matters. Because `tenantsecrets` delivers the whole Secret `Data`, the helper's fail-closed guard is not a nicety — it is the boundary that keeps a server or CA private key out of a tenant's hands. Kafka's `-clients-ca-cert` is the shape to match: a CA certificate, no key, readable by the tenant. -### 5. Delivery: closing the cert-manager-engine gap +### 5. Delivery: one engine-agnostic extraction controller + +The contract in §4 fixes the output object. The remaining work is the input path, and it splits on a single question — **does the engine's operator already emit a key-free `ca.crt` object?** -The contract in §4 fixes the output object. The remaining work is the input path, and it splits by engine class. +- **Engines that self-publish a key-free CA** need no platform machinery. `kafka` (Strimzi, native) and `redis` (forked operator via `caCertSecretName`) are here: they converge by matching the canonical name and stamping the tenant-resource label. They opt **out** of the controller below simply by not stamping the source label. +- **Every other engine** — nats, qdrant, mariadb, opensearch, rabbitmq (cert-manager chains, chart-rendered, key-bearing CA) and mongodb, postgres (operator-rendered, key-bearing CA, asynchronous) — has a key-bearing CA Secret and no key-free output. None can feed the helper at chart-render time (see "The problem"). These opt **in** via an explicit source label and are served by **one engine-agnostic extraction controller**. -For **operator-owned engines**, the operator already publishes CA material synchronously enough to consume. `kafka` is already key-free and needs no change. `postgres` is the coupling case from §3: convergence means publishing a dedicated key-free `-ca-cert` (extracted from the `ca.crt` the CNPG operator already materialises) **and** dropping the key-bearing `-credentials` from its dashboard resource map, so the trust anchor reaches the tenant without the server `tls.key`. `mongodb` adds the same `-ca-cert` shape (`cozystack/cozystack#2692`). No new controller is required for this class. +The alternative to one controller is to fork each upstream operator the way redis was forked (adding a `caCertSecretName`-style key-free output). Forking CloudNativePG, PSMDB, mariadb-operator, opster, and the rabbitmq cluster-operator is far more surface to own than a single small controller, and redis's fork was a reaction to a review blocker, not a designed-up-front pattern. So the controller is the target for every non-self-publishing engine. -For **cert-manager-minting engines** (nats, qdrant, and the four open per-app pull requests), none of the chart-time paths work: `valuesFrom` is pinned, `lookup` cannot trigger a re-render when the CA appears, and the CA is asynchronous (see "The problem"). These need a small **extraction controller** — and it can be far smaller than a generic replicator, because the platform's lineage machinery already provides the marking, ownership, and garbage collection a replicator would otherwise re-implement. The contract is three pieces, only the middle of which is new code. +The controller is deliberately **engine-agnostic**: it does not branch on cert-manager-vs-operator-owned, and it does not key off Secret names. Its contract is three pieces, only the middle of which is new code. -**(a) An explicit source-selection label, not a name convention.** The chart stamps `internal.cozystack.io/publish-ca-cert: "true"` on the CA-bearing Secret it wants extracted, with an optional `internal.cozystack.io/publish-ca-cert-key` annotation naming the key to lift (default `ca.crt`). For the cert-manager-minting charts this rides `Certificate.spec.secretTemplate.labels`, so even the asynchronously-created `-ca` Secret carries the marker from the moment cert-manager writes it. This replaces "the controller watches `-ca` because the name is deterministic" — a brittle implicit contract the next chart author silently breaks by naming their CA Secret anything else — with an explicit opt-in the source declares. +**(a) An explicit source-selection label, engine-agnostic, not a name convention.** The owner of the CA-bearing Secret stamps `internal.cozystack.io/publish-ca-cert: "true"`, with an optional `internal.cozystack.io/publish-ca-cert-key` annotation naming the key to lift (default `ca.crt`). For the cert-manager-minting charts this rides `Certificate.spec.secretTemplate.labels`, so even the asynchronously-created CA Secret carries the marker from the moment cert-manager writes it. The label, not the name, is the contract — which matters because the CA-bearing Secret names are non-uniform (`-ca`, `-ca-tls`, `-http-ca`, `-ca-cert`) and the name `-ca-cert` is overloaded: key-free for the redis fork, key-bearing for PSMDB. A name convention would mis-handle one of them; the label plus a content check does not. -**(b) A small extraction controller.** Its watch/upsert skeleton can follow the wildcard-secret reconciler from PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`), but it carries none of that reconciler's copy-marking or prune logic — lineage provides those (part (c)). For each label-selected source Secret it upserts a `type: Opaque` Secret named `-ca-cert` containing **only** `ca.crt`, re-copying on every source change so a CA rotation propagates without a chart re-render. It does three security-load-bearing things and nothing more: +**(b) A small extraction controller.** Its watch/upsert skeleton can follow the wildcard-secret reconciler from PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`), but it carries none of that reconciler's copy-marking or prune logic — lineage provides those (part (c)) — and, being **intra-namespace**, it can do something that reconciler cannot. For each label-selected source Secret it upserts a `type: Opaque` Secret named `-ca-cert` containing **only** `ca.crt`, re-copying on every source change so a CA rotation propagates without a chart re-render. It does four security-load-bearing things and nothing more: -- **Sanitize at write time, not just render time.** The `cozy-lib.tls.caCertSecret` helper's fail-closed guard runs at *chart-render* time; this controller writes at *runtime*, so it must itself copy only the single `ca.crt` key (an explicit whitelist) and re-assert that the value carries no `-----BEGIN … PRIVATE KEY-----` header before writing. It never copies the whole `Data`. -- **Owner-ref the projected Secret to the application instance CR**, resolved from the `app.kubernetes.io/instance` label already on the source Secret — *not* to the source `-ca` Secret. cert-manager does not own its output Secrets, so a `-ca` Secret typically has no `ownerReferences` and an ownership-graph walk from it dead-ends before reaching the app. Pointing the projected Secret one hop from the app root makes Kubernetes garbage-collect it on app deletion for free, which is why the controller needs no prune logic of its own. -- **Refuse to touch a foreign Secret, and say so.** If the target name is already occupied by a Secret the controller did not create, it leaves it untouched (a management-label guard) **and emits a Kubernetes Warning Event** on the application — a silent skip would otherwise surface only as an unexplained TLS-verification failure for the tenant, which is exactly the kind of dead end a low-skill operator cannot diagnose. +- **Key off the label and the content, never the name.** Select sources by the `publish-ca-cert` label; before writing, verify the lifted value is a CA certificate and carries no `-----BEGIN … PRIVATE KEY-----` header. This is what lets one controller serve a key-free redis source it should ignore and a key-bearing PSMDB source it must strip, both named `-ca-cert`-adjacent, without confusion. +- **Tolerate operator-created, asynchronous sources.** The source may be created by the CNPG or PSMDB operator after the chart renders (not just by cert-manager). The controller waits on the watch event for the labelled source; it does not assume the chart owns the Secret and does not error or busy-loop before the source exists. +- **Sanitize at write time, not just render time.** The `cozy-lib.tls.caCertSecret` helper's fail-closed guard runs at *chart-render* time; this controller writes at *runtime*, so it must itself copy only the single `ca.crt` key (an explicit whitelist) and re-assert the no-private-key check on every write. It never copies the whole `Data`. +- **Owner-ref the projected Secret to the application instance CR**, resolved from the `app.kubernetes.io/instance` label already on the source Secret. This is where intra-namespace beats the `cozystack/cozystack#2990` reconciler, which tracks its replicas by a management label plus a back-reference annotation *because* a cross-namespace replica cannot carry a valid `ownerReference`. Our projection is same-namespace, so a real owner-reference is valid: Kubernetes garbage-collects the `-ca-cert` on app deletion for free, and the controller needs no prune logic of its own. (It does not owner-ref the *source* `-ca`: cert-manager and the DB operators do not own their output Secrets, so a walk from the source would dead-end before the app.) **(c) Marking stays in `spec.secrets`, by label not by name.** Because `ApplicationDefinition.spec.secrets.include` accepts a label selector (`internal/lineagecontrollerwebhook/matcher.go`), one generic entry — `matchLabels: {internal.cozystack.io/tenant-ca: "true"}`, stamped by the controller on every projected Secret — covers every engine with no per-release `resourceName` templating. The lineage admission webhook then does the rest: on admission it walks the projected Secret's `ownerReferences` to the owning application and authoritatively stamps `internal.cozystack.io/tenantresource` to `true` (and to `false` should the Secret ever stop matching). @@ -151,7 +163,7 @@ What this *reuses* rather than rebuilds: the label selector in `ApplicationDefin ### 6. Per-engine application order -`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it is the engine whose coupling the epic most needs to fix, and converging it — a dedicated key-free `-ca-cert` plus removing `-credentials` from the tenant-facing surface — validates the consume contract on the operator-owned path with no new controller. `kafka` is already there; `mongodb` follows with the same shape adaptation (`cozystack/cozystack#2692`). The cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the extraction controller. `redis` (`cozystack/cozystack#2729`) is a hybrid — it renders a cert-manager chain but its operator fork already publishes a key-free `-ca-cert`, so it may converge by shape adaptation like the operator-owned engines rather than needing the controller (this is part of the controller-scope open question below). `kubernetes` (Kamaji) is explicitly out. +`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it is the engine the epic most wants converged, and it exercises the controller against the hardest input — an **operator-created, asynchronous** CA Secret — so validating it there validates the mechanism everywhere. Convergence for postgres means publishing a key-free `-ca-cert` extracted from CNPG's key-bearing `-ca`, while leaving the passwords-only `-credentials` untouched. `kafka` is already at the target and needs only documentation. `redis` (`cozystack/cozystack#2729`) converges by shape adaptation: its forked operator self-publishes a key-free `-ca-cert` via `caCertSecretName`, so it does not stamp the source label and the controller leaves it alone. `mongodb`, then the cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the controller via the source label. `kubernetes` (Kamaji) is explicitly out. ## User-facing changes @@ -159,17 +171,22 @@ A tenant sees one canonical, key-free trust-anchor object per managed applicatio ## Upgrade and rollback compatibility -This document changes nothing at runtime; it records a target. The pull requests that implement it are individually backward-compatible in the engine's own PKI: per-app TLS is opt-in (tri-state), external exposure is opt-in, and the extraction controller adds a new object without altering existing Secrets. One deliberate, security-motivated break: converging postgres removes the key-bearing `-credentials` Secret from the tenant-facing dashboard surface, so a tenant that reads `ca.crt` out of `-credentials` today must switch to the new key-free `-ca-cert`. Reverting any implementing PR removes the new `-ca-cert` object and the controller that maintains it, leaving the engine's existing PKI untouched. +This document changes nothing at runtime; it records a target. The pull requests that implement it are individually backward-compatible in the engine's own PKI: per-app TLS is opt-in (tri-state), external exposure is opt-in, and the extraction controller adds a new object without altering existing Secrets. postgres convergence is purely additive — it publishes a new key-free `-ca-cert` and leaves the passwords-only `-credentials` in place (removing it would drop the tenant's connection passwords). Reverting any implementing PR removes the new `-ca-cert` object and the controller that maintains it, leaving the engine's existing PKI untouched. ## Security -The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes the fail-closed guard load-bearing — any labelled Secret is delivered in full, so removing the key from the *object* (rather than relying on field filtering at projection time) is what prevents a private key from leaking. Two consequences follow. First, postgres' current exposure of the key-bearing `-credentials` to tenants is the live instance of this risk, and convergence closes it. Second, the extraction controller writes at runtime, after chart-render, so it cannot lean on the helper's render-time guard alone: it must whitelist the single `ca.crt` key and re-assert the no-private-key check itself on every write. The controller adds a new trust surface — read access to per-release cert-manager CA Secrets, write access for the key-free copy — and must never overwrite a Secret it did not create, surfacing any name collision as a Warning Event rather than failing silently. +The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes this a **preventive invariant** rather than the fix of a live leak: because any labelled Secret is delivered in full, the platform's standing rule must be that no key-bearing Secret is ever labelled tenant-facing — and the way to honour it is to remove the key from the *object* (a separate `ca.crt`-only Secret), not to rely on field filtering at projection time. + +No merged engine leaks a private key to a tenant today. The merged charts withhold their key-bearing Secrets, and the tenant-facing objects (postgres `-credentials`, the nats/qdrant credentials Secrets) are passwords-only. The live instance of the risk is not in `main` — it is in the **in-flight per-app PRs**, several of which currently propose labelling a key-bearing Secret to tenants: mariadb (`cozystack/cozystack#2680`) projects the CA private key `-ca-tls`, mongodb (`cozystack/cozystack#2692`) projects the key-bearing `-ca-cert`, and rabbitmq (`cozystack/cozystack#2683`) and opensearch (`cozystack/cozystack#2682`) project the leaf key. This contract exists to stop them landing that way; the consume object and the extraction controller are what let those PRs deliver `ca.crt` without the key. + +Two consequences for the controller. First, it writes at runtime, after chart-render, so it cannot lean on the helper's render-time guard alone: it must whitelist the single `ca.crt` key and re-assert the no-private-key check itself on every write. Second, it adds a new trust surface — read access to per-release CA Secrets, write access for the key-free copy — and must never overwrite a Secret it did not create, surfacing any name collision as a Warning Event rather than failing silently. ## Failure and edge cases - The helper's input PEM contains a private-key header → render fails closed; the chart does not deploy a key-bearing Secret. - The source `ca.crt` somehow carries a private-key header at runtime → the controller refuses to write the copy (runtime whitelist plus guard); no key-bearing Secret is ever projected. -- The per-release CA Secret does not exist yet → the controller waits for the watch event on the labelled source; it does not error or busy-loop. +- The labelled source is a key-free Secret that should be left alone (a self-publishing engine mis-stamped) → the content check sees no private key to strip and the controller still writes only `ca.crt`; no key is ever exposed. +- The per-release CA Secret does not exist yet (operator-created, asynchronous) → the controller waits for the watch event on the labelled source; it does not error or busy-loop. - A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard) and emits a Warning Event on the application, so the operator sees the collision. - The CA rotates → the controller re-copies `ca.crt` on the next source change; no chart re-render is required. - The application is deleted → the projected `-ca-cert` is garbage-collected by Kubernetes via its owner reference to the application instance; the controller needs no delete path. @@ -177,7 +194,7 @@ The trust boundary is precise: a tenant receives `ca.crt` and never receives `tl ## Testing - The helper is already covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`, including the fail-closed assertions. -- The extraction controller gets an envtest/Ginkgo suite (its skeleton mirrors `internal/controller/wildcardsecret/reconciler_test.go`): a labelled source appearing after the consumer, a source whose `ca.crt` is swapped (rotation), a foreign-name collision (asserting the Secret is untouched and a Warning Event is emitted), a source value that smuggles a private-key header (asserting the controller refuses to write), and owner-reference-driven garbage collection on app deletion. +- The extraction controller gets an envtest/Ginkgo suite (its skeleton mirrors `internal/controller/wildcardsecret/reconciler_test.go`): a labelled source appearing after the consumer (cert-manager and operator-created), a source whose `ca.crt` is swapped (rotation), a key-free source that must be left byte-for-byte (no spurious rewrite), a key-bearing source that must be stripped to `ca.crt`, a foreign-name collision (asserting the Secret is untouched and a Warning Event is emitted), a source value that smuggles a private-key header (asserting the controller refuses to write), and owner-reference-driven garbage collection on app deletion. - Each per-app pull request adds helm-unittest fixtures asserting the `-ca-cert` shape and label, plus an end-to-end check under `hack/e2e-apps/` that a tenant can read `ca.crt`, cannot read any object carrying `tls.key`, and can verify the server. ## Rollout @@ -189,19 +206,21 @@ The trust boundary is precise: a tenant receives `ca.crt` and never receives `tl ## Open questions +- **Labelling operator-created sources.** The cert-manager charts stamp the source label via `Certificate.spec.secretTemplate.labels`. Operators that create the CA Secret themselves (CNPG `-ca`, PSMDB `-ca-cert`) may not let the chart label their output. For those, the controller needs either an operator that supports output labels or a small per-engine source-name configuration as a fallback — this is the one place the engine-agnostic label is not yet sufficient, and it should be resolved before postgres/mongodb convergence. - The exact namespace convention for the `-ca-cert` object (per-release in the app namespace is assumed here) and the final label/annotation names. This proposal uses `internal.cozystack.io/publish-ca-cert` on the source and `internal.cozystack.io/tenant-ca` on the projected copy, matching the `internal.cozystack.io/` convention of the existing `tenantresource` and `managed-by-cozystack` markers. -- Whether the extraction controller is scoped to the cert-manager-minting engines, or generalized into one projector that also normalizes the operator-owned engines (postgres, mongodb) onto the canonical shape instead of each chart doing its own shape adaptation. -- How this intersects with per-tenant wildcard propagation (issue `cozystack/cozystack#2820`, implemented by PR `cozystack/cozystack#2990`). That is a *cross-namespace* replication problem, and Cozystack already ships `clustersecret-operator` (a namespace-selector `ClusterSecret` CRD) for that class; the extraction controller here is *intra-namespace, per-release*, so the two do not share a mechanism — but they should share the management-label and foreign-collision conventions. +- How this intersects with per-tenant wildcard propagation (issue `cozystack/cozystack#2820`, implemented by PR `cozystack/cozystack#2990`). That is a *cross-namespace* replication problem; the extraction controller here is *intra-namespace, per-release*, so the two do not share a mechanism — but they should share the management-label and foreign-collision conventions. Note that `clustersecret-operator` is packaged in the repo but is not wired into a default bundle, so it is not an installed primitive this design can assume. ## Alternatives considered - **Pure-consume (applications stop minting, consume a central cert-manager output).** Rejected: it breaks the rotation lifecycle of CloudNativePG and Strimzi, whose own-CA management is mutually exclusive with an externally-supplied server certificate. - **`lookup` for the asynchronous CA.** Rejected: `lookup` runs at render time and is invisible to the Flux digest, so the chart would not re-render when the CA appears. (`cozystack/cozystack#1787` already moved the global-values channel off `lookup` onto `valuesFrom` for the same digest reason.) - **A custom `valuesFrom` pointing at `-ca`.** Rejected: `expectedValuesFrom()` pins every application HelmRelease to the single `cozystack-values` Secret and overwrites drift. -- **A general-purpose cluster-secret replication operator.** Rejected during the edge work (`cozystack/cozystack#2990`) in favour of a purpose-built reconciler with a tight ownership guard. -- **A copy-issuer webhook.** Rejected during the wildcard work (`cozystack/cozystack#2812`) in favour of native references that move only the Secret name, never key material. -- **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favour of §5's explicit source label plus lineage reuse: the name convention is an implicit contract the next chart author silently breaks, and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. -- **A field filter on the projection itself (the principled root fix).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialised at all. Giving the projection (or the `spec.secrets` selector) a per-key field filter would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no extraction controller, no second object — attacking the CA / private-key coupling of `cozystack/cozystack#2814` at its source and generalizing to the operator-owned engines. Deferred as a larger change: the projection is whole-`Data` today (`pkg/registry/core/tenantsecret/rest.go`), so it touches the tenant-facing API surface rather than a single controller. Recorded as the direction the extraction controller is a pragmatic first step toward. +- **Forking each upstream operator (the redis path) to emit a key-free CA.** Rejected as the general mechanism: it works for redis because the fork is small, but applying it to CloudNativePG, PSMDB, mariadb-operator, opster, and the rabbitmq cluster-operator is far more surface to own than one engine-agnostic controller. +- **cert-manager `trust-manager`.** Considered and rejected as the mechanism, recorded as the validating prior art. trust-manager is the canonical "watch a CA source, materialize a key-free copy, never touch `tls.key`" operator and confirms the copy-model is the mainstream choice for trust-anchor distribution. But it does not fit operationally: its `Bundle`/`ClusterBundle` is cluster-scoped and fans out to namespaces by selector (not one object per release); it reads its sources from its own trust namespace, not from an arbitrary per-release tenant namespace; and it targets ConfigMaps first (Secret targets are opt-in). Modelling per-release, intra-namespace extraction on it would mean one cluster-scoped Bundle per database release reading a source it cannot natively see — against the grain. The small intra-namespace controller is a better fit. +- **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favour of §5's explicit source label plus lineage reuse: the CA Secret names are non-uniform and `-ca-cert` is overloaded across engines, so a name convention would mis-handle at least one; and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. +- **A field filter on the projection itself (the principled root fix — deferred to its own proposal).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialised at all. A per-key field filter on the projection (or the `spec.secrets` selector) would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no controller, no second object — and would generalise to the operator-owned engines. It is **not** a local change, which is why it is deferred rather than adopted here: the projection is **writable** at the registry level (`pkg/registry/core/tenantsecret/rest.go` implements Create/Update/Patch/Delete), and the write path (`tenantToSecret`) replaces the underlying Secret's `Data` **wholesale** (`out.Data = ts.Data`). A read-side key filter without a matching write-side filter would, the moment any principal with write access used the filtered view, silently drop the keys it could not see. Tenant roles grant only `get/list/watch` today, but that read-only posture lives in a different package (`packages/system/cozystack-basics` clusterroles), so a field filter's safety is entangled with a write path and an RBAC posture defined elsewhere. That makes it a redesign of the tenant-secret API with its own blast-radius analysis — a separate design proposal — not a rider on the TLS rollout. Recorded here as the future simplification the extraction controller could later retire, explicitly **not** a dependency of this proposal. +- **A native `ClusterTrustBundle` (KEP-3257).** Considered, recorded as the direction Kubernetes itself is taking for trust anchors (cluster-scoped, world-readable, public-only by construction — the API server rejects PEM with a private key — with central rotation). It does not fit this use case: it is consumed through the pod-mount `clusterTrustBundle` volume projection, whereas a Cozystack tenant reads `-ca-cert` as a named Secret object through the dashboard and `tenantsecrets`. Different consumer model; a copy object is what an object-by-name consumer needs. +- **A general-purpose cluster-secret replication operator / a copy-issuer webhook.** Rejected during the edge work (`cozystack/cozystack#2990`, `cozystack/cozystack#2812`) in favour of native references and a purpose-built reconciler with a tight ownership guard that move only the Secret name, never key material. --- From 0febad1e69d09bd5e74ed388d4e29e25130f62a1 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 2 Jul 2026 12:58:43 +0300 Subject: [PATCH 4/5] docs(unified-tls-pki): present Kamaji as the model's boundary case MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Promote Kamaji from an exclusion row to a first-class member of the issuer enumeration: it is the fourth independent PKI owner and the one that is fundamentally non-swappable (the kubeconfig pins its CA), which makes it the strongest single argument for unifying the consume interface rather than the certificate authority. The rollout exclusion is unchanged. Sharpen the mongodb row: the PSMDB PKI is entirely operator-created at runtime — the chart renders no TLS objects on main — which is exactly why it needs the same asynchronous-source treatment as CloudNativePG. Cross-reference the proposals that consume this contract (the sibling external-database-exposure design and the structured expose model), normalize spelling to American English throughout, and move the status to Review per the template's lifecycle. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- design-proposals/unified-tls-pki/README.md | 33 +++++++++++----------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/design-proposals/unified-tls-pki/README.md b/design-proposals/unified-tls-pki/README.md index c2e0b6b..473a6e9 100644 --- a/design-proposals/unified-tls-pki/README.md +++ b/design-proposals/unified-tls-pki/README.md @@ -4,7 +4,7 @@ - **Title:** `Unified TLS and PKI model for managed applications` - **Author(s):** `@lexfrei` - **Date:** `2026-06-24` -- **Status:** Draft +- **Status:** Review ## Overview @@ -20,6 +20,7 @@ This proposal is the umbrella design for the work tracked by epic `cozystack/coz - **Edge, open:** `cozystack/cozystack#2990` (propagate the operator wildcard to per-tenant termination points — the PR implementing issue `cozystack/cozystack#2820`). - **Workstreams (issues):** `cozystack/cozystack#2812` and `cozystack/cozystack#2400` (closed, edge wildcard); `cozystack/cozystack#2814` (converge per-app TLS and close the trust-anchor delivery gap — the first consumer of this contract); `cozystack/cozystack#2815` (external DB exposure via Gateway TLS-passthrough); `cozystack/cozystack#2816` (end-to-end TLS for databases); `cozystack/cozystack#2977` (opt-in east-west encryption). Throughout this document a `cozystack/cozystack#NNNN` reference is an issue unless called out as a PR. - **Per-app TLS series (open):** `cozystack/cozystack#2729` (redis), `cozystack/cozystack#2692` (mongodb), `cozystack/cozystack#2683` (rabbitmq), `cozystack/cozystack#2682` (opensearch), `cozystack/cozystack#2680` (mariadb). These are the pull requests that should land *after* this contract is accepted, not before; several currently propose handing tenants a key-bearing Secret, which this contract exists to correct (see Security). +- **Related (consumers of this contract):** `design-proposals/external-database-exposure` — the sibling proposal under the same epic; its end-to-end TLS story has external clients validate against the `-ca-cert` object defined here. `design-proposals/structured-external-exposure` (community pull request #29) — the structured `expose` model whose "external implies TLS" rule likewise points clients at this trust anchor. Both consume this proposal's output and define no PKI of their own. All repository paths below refer to the `cozystack/cozystack` repository; paths attributed to an open PR (for example the wildcard-secret reconciler in PR `cozystack/cozystack#2990`) are not yet on `main`. @@ -39,16 +40,16 @@ The second class **consumes** PKI that its operator owns end-to-end — but, cru - `kafka` (Strimzi) is the **clean** reference: it exposes `-cluster-ca-cert` and `-clients-ca-cert`, each a CA-certificate-only object with no private key (`packages/apps/kafka/templates/dashboard-resourcemap.yaml`). Strimzi can serve a public certificate on its external listener while keeping internal broker mTLS on its own CA (`generateCertificateAuthority: false` + per-listener `brokerCertChainAndKey`). - `postgres` (CloudNativePG) renders no cert-manager objects; the CNPG operator auto-generates a self-signed CA and signs the server certificate. The CA lives in the operator-created `-ca` Secret, which **carries `ca.key`** and is created **asynchronously**; it is **not** delivered to tenants. The only tenant-facing object is `-credentials` (`packages/apps/postgres/templates/init-script.yaml`), which is a chart-rendered Opaque Secret holding **only `user: password` pairs** — no `ca.crt`, no `tls.key` — surfaced through the dashboard resource map (`packages/apps/postgres/templates/dashboard-resourcemap.yaml`). So postgres is a **trust-anchor delivery gap** (the tenant gets passwords but never `ca.crt`), the same shape as nats/qdrant — *not* a private-key coupling. The comment at `packages/apps/postgres/templates/db.yaml:20-22` claiming `ca.crt` rides in `-credentials` contradicts what the chart actually renders and is a documentation bug. -- `mongodb` (Percona PSMDB) is operator-owned, but the PSMDB operator mints a cert-manager chain and publishes `-ca-cert` as a **key-bearing** cert-manager `isCA` Secret (`tls.crt` + `tls.key` + `ca.crt`). Same gap as the rest, plus a trap: the name `-ca-cert` is key-free for the redis fork and key-bearing for PSMDB — opposite shapes under one name. +- `mongodb` (Percona PSMDB) is operator-owned, but the PSMDB operator mints a cert-manager chain **at runtime** and publishes `-ca-cert` as a **key-bearing** cert-manager `isCA` Secret (`tls.crt` + `tls.key` + `ca.crt`). The chart itself renders no TLS objects on `main` today — the whole PKI is operator-created after deploy, which is exactly why it needs the same asynchronous-source treatment as CNPG. Same gap as the rest, plus a trap: the name `-ca-cert` is key-free for the redis fork and key-bearing for PSMDB — opposite shapes under one name. -One engine is deliberately outside this model. `kubernetes` (Kamaji) owns the control-plane CA, it is not swappable, and the kubeconfig pins that cluster CA — a public edge certificate is meaningless there. It is the lone "private-CA-mandatory" case, needs no unification, and is excluded. +And one engine is the proof of the whole framing rather than a mere exclusion. `kubernetes` (Kamaji) is the **fourth independent PKI owner** — alongside cert-manager chart graphs, CNPG, and Strimzi — and the one that is fundamentally non-swappable: Kamaji owns the control-plane CA, the kubeconfig it hands out *pins* that CA, and every client authenticates against it by construction, so a public edge certificate is not just unnecessary there but meaningless. Kamaji is the strongest single argument for "unify the interface, not the CA": any design that homogenized the certificate authority would break it outright, whereas the consume contract below (a key-free trust anchor, delivered uniformly) is something even Kamaji already satisfies in spirit — the kubeconfig *is* its trust-anchor delivery. It needs no unification work and is excluded from the rollout, but it belongs in the enumeration as the boundary case that fixes the model's shape. ### Platform mechanisms this proposal builds on - **The CA-only helper.** `cozy-lib.tls.caCertSecret` (`packages/library/cozy-lib/templates/_tls.tpl`, from `cozystack/cozystack#2989`) renders a Secret containing only `ca.crt`. It fails closed if the input PEM contains any private-key header, and it always stamps the label `internal.cozystack.io/tenantresource: "true"`. It is covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`. The merged PR frames the situation it addresses as exactly this — a delivery gap: "the cert-manager apps grant tenants no access to those Secrets at all, so a tenant currently cannot obtain `ca.crt`". - **Why the helper is needed but not sufficient.** The CA+leaf chain is rendered per-app by each chart's own cert-manager graph (for example `packages/apps/nats/templates/certmanager.yaml`), and the resulting CA Secret carries `ca.key` — so it is not itself a `ca.crt`-only object and cannot be handed to a tenant. The helper fixes the *output shape* (a key-free `ca.crt` Secret) but takes the CA cert as a Helm value, and nothing in the current architecture can feed it that value for an asynchronously-issued, per-release CA (see "The problem"). -- **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered (`secretToTenant` copies `sec.Data` whole). This is the security pivot of the whole model — a labelled Secret must contain only safe material. -- **The label's authority is the lineage webhook, not the chart.** The `tenantresource` label is not honoured just because something stamped it. The lineage admission webhook (`internal/lineagecontrollerwebhook/webhook.go`) is its authority: it walks a Secret's `ownerReferences` to the owning application and sets `tenantresource` to `true` or `false` from that application's `spec.secrets` selector on **every** admission. A statically-stamped label therefore does not survive on its own — any Secret meant to be tenant-visible must also match a `spec.secrets.include` entry, or the webhook overwrites the label to `false`. This is why the consume contract below marks the CA object through `spec.secrets` (by label), not by stamping the label alone. +- **The tenant projection.** Secrets carrying the `internal.cozystack.io/tenantresource` label are exposed to tenants as the virtual resource `core.cozystack.io/tenantsecrets` (`pkg/registry/core/tenantsecret/rest.go`; the label constant lives in `pkg/apis/core/v1alpha1/tenantresource_types.go`; the RBAC grant is on the virtual resource, not on raw `core/v1` Secrets — `packages/system/cozystack-basics/templates/clusterroles.yaml`). The projection is **label-filtered, not field-filtered**: the entire Secret `Data` is delivered (`secretToTenant` copies `sec.Data` whole). This is the security pivot of the whole model — a labeled Secret must contain only safe material. +- **The label's authority is the lineage webhook, not the chart.** The `tenantresource` label is not honored just because something stamped it. The lineage admission webhook (`internal/lineagecontrollerwebhook/webhook.go`) is its authority: it walks a Secret's `ownerReferences` to the owning application and sets `tenantresource` to `true` or `false` from that application's `spec.secrets` selector on **every** admission. A statically-stamped label therefore does not survive on its own — any Secret meant to be tenant-visible must also match a `spec.secrets.include` entry, or the webhook overwrites the label to `false`. This is why the consume contract below marks the CA object through `spec.secrets` (by label), not by stamping the label alone. - **The values channel.** Global values ride a `cozystack-values` Secret under `_cluster.*` keys, injected into every application HelmRelease via `valuesFrom`. ### The problem @@ -120,7 +121,7 @@ The contract is a per-engine table. It is the artifact the per-app pull requests | kafka (Strimzi) | operator | `-cluster-ca-cert`, `-clients-ca-cert` | no | **yes** | yes — clean reference | `generateCertificateAuthority: false` BYO-CA + per-listener `brokerCertChainAndKey`; public cert on external listener, own-CA mTLS internally | | redis (forked operator) | operator fork | `-ca-tls` (chart cert-manager) | yes | **yes** → operator emits `-ca-cert` | yes | forked `redis-operator` `spec.tls.caCertSecretName` outputs a key-free `ca.crt`-only Opaque Secret (v1.4.0+) | | postgres (CloudNativePG) | operator | `-ca` (operator-created, async) | **yes — `ca.key`** | no | no — tenant gets passwords-only `-credentials` (gap) | `serverCASecret` / `serverTLSSecret` BYO; single server cert, no edge/internal split | -| mongodb (Percona PSMDB) | operator | `-ca-cert` (operator-created) | **yes — cert-manager `isCA`** | no | no (gap; name collides with redis, opposite shape) | `spec.tls.issuerConf` BYO issuer; auto-mints a cert-manager chain | +| mongodb (Percona PSMDB) | operator | `-ca-cert` (operator-created at runtime; the chart renders no TLS objects) | **yes — cert-manager `isCA`** | no | no (gap; name collides with redis, opposite shape) | `spec.tls.issuerConf` BYO issuer; auto-mints a cert-manager chain | | nats | chart + cert-manager | `-ca` | yes | no | no | — | | qdrant | chart + cert-manager | `-ca` | yes | no | no | — | | mariadb (open PR) | chart + cert-manager | `-ca-tls` | yes | no | no | mariadb-operator `serverCertSecretRef` / `serverCASecretRef` / `serverCertIssuerRef` | @@ -129,7 +130,7 @@ The contract is a per-engine table. It is the artifact the per-app pull requests | clickhouse (Altinity) | — (BYO-cert) | supplied cert mount | n/a | no | no (not in TLS series yet) | mounts a supplied cert; no native issuance | | kubernetes (Kamaji) | Kamaji | Kamaji-owned, not swappable | n/a | n/a | no — out of model | control-plane CA, kubeconfig-pinned, non-swappable | -The takeaway: only **kafka** (native) and **redis** (forked operator) are at the target today — they self-publish a key-free `ca.crt`. **Every other engine has a key-bearing CA Secret and delivers no `ca.crt` to the tenant** — one uniform delivery gap. The gap does **not** split along mint-versus-consume: postgres and mongodb are operator-owned, yet their CA Secret is key-bearing (and operator-created, hence asynchronous), so they need the same key-free projection as the cert-manager-minting engines. The cert-manager-minting engines additionally never reach the tenant at all. +The takeaway: only **kafka** (native) and **redis** (forked operator) are at the target today — they self-publish a key-free `ca.crt`. **Every other engine has a key-bearing CA Secret and delivers no `ca.crt` to the tenant** — one uniform delivery gap. The gap does **not** split along mint-versus-consume: postgres and mongodb are operator-owned, yet their CA Secret is key-bearing (and operator-created, hence asynchronous), so they need the same key-free projection as the cert-manager-minting engines. The cert-manager-minting engines additionally never reach the tenant at all. Kamaji anchors the other end of the table: a PKI owner so absolute that the only sane unification is the one this contract chooses — leave the CA alone, unify how trust reaches the client. ### 4. The uniform consume contract @@ -153,7 +154,7 @@ The controller is deliberately **engine-agnostic**: it does not branch on cert-m **(b) A small extraction controller.** Its watch/upsert skeleton can follow the wildcard-secret reconciler from PR `cozystack/cozystack#2990` (`internal/controller/wildcardsecret/reconciler.go`), but it carries none of that reconciler's copy-marking or prune logic — lineage provides those (part (c)) — and, being **intra-namespace**, it can do something that reconciler cannot. For each label-selected source Secret it upserts a `type: Opaque` Secret named `-ca-cert` containing **only** `ca.crt`, re-copying on every source change so a CA rotation propagates without a chart re-render. It does four security-load-bearing things and nothing more: - **Key off the label and the content, never the name.** Select sources by the `publish-ca-cert` label; before writing, verify the lifted value is a CA certificate and carries no `-----BEGIN … PRIVATE KEY-----` header. This is what lets one controller serve a key-free redis source it should ignore and a key-bearing PSMDB source it must strip, both named `-ca-cert`-adjacent, without confusion. -- **Tolerate operator-created, asynchronous sources.** The source may be created by the CNPG or PSMDB operator after the chart renders (not just by cert-manager). The controller waits on the watch event for the labelled source; it does not assume the chart owns the Secret and does not error or busy-loop before the source exists. +- **Tolerate operator-created, asynchronous sources.** The source may be created by the CNPG or PSMDB operator after the chart renders (not just by cert-manager). The controller waits on the watch event for the labeled source; it does not assume the chart owns the Secret and does not error or busy-loop before the source exists. - **Sanitize at write time, not just render time.** The `cozy-lib.tls.caCertSecret` helper's fail-closed guard runs at *chart-render* time; this controller writes at *runtime*, so it must itself copy only the single `ca.crt` key (an explicit whitelist) and re-assert the no-private-key check on every write. It never copies the whole `Data`. - **Owner-ref the projected Secret to the application instance CR**, resolved from the `app.kubernetes.io/instance` label already on the source Secret. This is where intra-namespace beats the `cozystack/cozystack#2990` reconciler, which tracks its replicas by a management label plus a back-reference annotation *because* a cross-namespace replica cannot carry a valid `ownerReference`. Our projection is same-namespace, so a real owner-reference is valid: Kubernetes garbage-collects the `-ca-cert` on app deletion for free, and the controller needs no prune logic of its own. (It does not owner-ref the *source* `-ca`: cert-manager and the DB operators do not own their output Secrets, so a walk from the source would dead-end before the app.) @@ -163,7 +164,7 @@ What this *reuses* rather than rebuilds: the label selector in `ApplicationDefin ### 6. Per-engine application order -`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it is the engine the epic most wants converged, and it exercises the controller against the hardest input — an **operator-created, asynchronous** CA Secret — so validating it there validates the mechanism everywhere. Convergence for postgres means publishing a key-free `-ca-cert` extracted from CNPG's key-bearing `-ca`, while leaving the passwords-only `-credentials` untouched. `kafka` is already at the target and needs only documentation. `redis` (`cozystack/cozystack#2729`) converges by shape adaptation: its forked operator self-publishes a key-free `-ca-cert` via `caCertSecretName`, so it does not stamp the source label and the controller leaves it alone. `mongodb`, then the cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the controller via the source label. `kubernetes` (Kamaji) is explicitly out. +`postgres` goes first (under the tracking issue `cozystack/cozystack#2814`): it is the engine the epic most wants to see converged, and it exercises the controller against the hardest input — an **operator-created, asynchronous** CA Secret — so validating it there validates the mechanism everywhere. Convergence for postgres means publishing a key-free `-ca-cert` extracted from CNPG's key-bearing `-ca`, while leaving the passwords-only `-credentials` untouched. `kafka` is already at the target and needs only documentation. `redis` (`cozystack/cozystack#2729`) converges by shape adaptation: its forked operator self-publishes a key-free `-ca-cert` via `caCertSecretName`, so it does not stamp the source label and the controller leaves it alone. `mongodb`, then the cert-manager-minting engines (nats, qdrant, and then rabbitmq, mariadb, opensearch) adopt the controller via the source label. `kubernetes` (Kamaji) is explicitly out. ## User-facing changes @@ -175,7 +176,7 @@ This document changes nothing at runtime; it records a target. The pull requests ## Security -The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes this a **preventive invariant** rather than the fix of a live leak: because any labelled Secret is delivered in full, the platform's standing rule must be that no key-bearing Secret is ever labelled tenant-facing — and the way to honour it is to remove the key from the *object* (a separate `ca.crt`-only Secret), not to rely on field filtering at projection time. +The trust boundary is precise: a tenant receives `ca.crt` and never receives `tls.key` or `ca.key`. The label-filtered, full-object nature of the `tenantsecrets` projection makes this a **preventive invariant** rather than the fix of a live leak: because any labeled Secret is delivered in full, the platform's standing rule must be that no key-bearing Secret is ever labeled tenant-facing — and the way to honor it is to remove the key from the *object* (a separate `ca.crt`-only Secret), not to rely on field filtering at projection time. No merged engine leaks a private key to a tenant today. The merged charts withhold their key-bearing Secrets, and the tenant-facing objects (postgres `-credentials`, the nats/qdrant credentials Secrets) are passwords-only. The live instance of the risk is not in `main` — it is in the **in-flight per-app PRs**, several of which currently propose labelling a key-bearing Secret to tenants: mariadb (`cozystack/cozystack#2680`) projects the CA private key `-ca-tls`, mongodb (`cozystack/cozystack#2692`) projects the key-bearing `-ca-cert`, and rabbitmq (`cozystack/cozystack#2683`) and opensearch (`cozystack/cozystack#2682`) project the leaf key. This contract exists to stop them landing that way; the consume object and the extraction controller are what let those PRs deliver `ca.crt` without the key. @@ -185,8 +186,8 @@ Two consequences for the controller. First, it writes at runtime, after chart-re - The helper's input PEM contains a private-key header → render fails closed; the chart does not deploy a key-bearing Secret. - The source `ca.crt` somehow carries a private-key header at runtime → the controller refuses to write the copy (runtime whitelist plus guard); no key-bearing Secret is ever projected. -- The labelled source is a key-free Secret that should be left alone (a self-publishing engine mis-stamped) → the content check sees no private key to strip and the controller still writes only `ca.crt`; no key is ever exposed. -- The per-release CA Secret does not exist yet (operator-created, asynchronous) → the controller waits for the watch event on the labelled source; it does not error or busy-loop. +- The labeled source is a key-free Secret that should be left alone (a self-publishing engine mis-stamped) → the content check sees no private key to strip and the controller still writes only `ca.crt`; no key is ever exposed. +- The per-release CA Secret does not exist yet (operator-created, asynchronous) → the controller waits for the watch event on the labeled source; it does not error or busy-loop. - A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard) and emits a Warning Event on the application, so the operator sees the collision. - The CA rotates → the controller re-copies `ca.crt` on the next source change; no chart re-render is required. - The application is deleted → the projected `-ca-cert` is garbage-collected by Kubernetes via its owner reference to the application instance; the controller needs no delete path. @@ -194,7 +195,7 @@ Two consequences for the controller. First, it writes at runtime, after chart-re ## Testing - The helper is already covered by `packages/tests/cozy-lib-tests/tests/tls_cacert_test.yaml`, including the fail-closed assertions. -- The extraction controller gets an envtest/Ginkgo suite (its skeleton mirrors `internal/controller/wildcardsecret/reconciler_test.go`): a labelled source appearing after the consumer (cert-manager and operator-created), a source whose `ca.crt` is swapped (rotation), a key-free source that must be left byte-for-byte (no spurious rewrite), a key-bearing source that must be stripped to `ca.crt`, a foreign-name collision (asserting the Secret is untouched and a Warning Event is emitted), a source value that smuggles a private-key header (asserting the controller refuses to write), and owner-reference-driven garbage collection on app deletion. +- The extraction controller gets an envtest/Ginkgo suite (its skeleton mirrors `internal/controller/wildcardsecret/reconciler_test.go`): a labeled source appearing after the consumer (cert-manager and operator-created), a source whose `ca.crt` is swapped (rotation), a key-free source that must be left byte-for-byte (no spurious rewrite), a key-bearing source that must be stripped to `ca.crt`, a foreign-name collision (asserting the Secret is untouched and a Warning Event is emitted), a source value that smuggles a private-key header (asserting the controller refuses to write), and owner-reference-driven garbage collection on app deletion. - Each per-app pull request adds helm-unittest fixtures asserting the `-ca-cert` shape and label, plus an end-to-end check under `hack/e2e-apps/` that a tenant can read `ca.crt`, cannot read any object carrying `tls.key`, and can verify the server. ## Rollout @@ -217,10 +218,10 @@ Two consequences for the controller. First, it writes at runtime, after chart-re - **A custom `valuesFrom` pointing at `-ca`.** Rejected: `expectedValuesFrom()` pins every application HelmRelease to the single `cozystack-values` Secret and overwrites drift. - **Forking each upstream operator (the redis path) to emit a key-free CA.** Rejected as the general mechanism: it works for redis because the fork is small, but applying it to CloudNativePG, PSMDB, mariadb-operator, opster, and the rabbitmq cluster-operator is far more surface to own than one engine-agnostic controller. - **cert-manager `trust-manager`.** Considered and rejected as the mechanism, recorded as the validating prior art. trust-manager is the canonical "watch a CA source, materialize a key-free copy, never touch `tls.key`" operator and confirms the copy-model is the mainstream choice for trust-anchor distribution. But it does not fit operationally: its `Bundle`/`ClusterBundle` is cluster-scoped and fans out to namespaces by selector (not one object per release); it reads its sources from its own trust namespace, not from an arbitrary per-release tenant namespace; and it targets ConfigMaps first (Secret targets are opt-in). Modelling per-release, intra-namespace extraction on it would mean one cluster-scoped Bundle per database release reading a source it cannot natively see — against the grain. The small intra-namespace controller is a better fit. -- **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favour of §5's explicit source label plus lineage reuse: the CA Secret names are non-uniform and `-ca-cert` is overloaded across engines, so a name convention would mis-handle at least one; and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. -- **A field filter on the projection itself (the principled root fix — deferred to its own proposal).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialised at all. A per-key field filter on the projection (or the `spec.secrets` selector) would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no controller, no second object — and would generalise to the operator-owned engines. It is **not** a local change, which is why it is deferred rather than adopted here: the projection is **writable** at the registry level (`pkg/registry/core/tenantsecret/rest.go` implements Create/Update/Patch/Delete), and the write path (`tenantToSecret`) replaces the underlying Secret's `Data` **wholesale** (`out.Data = ts.Data`). A read-side key filter without a matching write-side filter would, the moment any principal with write access used the filtered view, silently drop the keys it could not see. Tenant roles grant only `get/list/watch` today, but that read-only posture lives in a different package (`packages/system/cozystack-basics` clusterroles), so a field filter's safety is entangled with a write path and an RBAC posture defined elsewhere. That makes it a redesign of the tenant-secret API with its own blast-radius analysis — a separate design proposal — not a rider on the TLS rollout. Recorded here as the future simplification the extraction controller could later retire, explicitly **not** a dependency of this proposal. +- **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favor of §5's explicit source label plus lineage reuse: the CA Secret names are non-uniform and `-ca-cert` is overloaded across engines, so a name convention would mis-handle at least one; and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. +- **A field filter on the projection itself (the principled root fix — deferred to its own proposal).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialized at all. A per-key field filter on the projection (or the `spec.secrets` selector) would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no controller, no second object — and would generalize to the operator-owned engines. It is **not** a local change, which is why it is deferred rather than adopted here: the projection is **writable** at the registry level (`pkg/registry/core/tenantsecret/rest.go` implements Create/Update/Patch/Delete), and the write path (`tenantToSecret`) replaces the underlying Secret's `Data` **wholesale** (`out.Data = ts.Data`). A read-side key filter without a matching write-side filter would, the moment any principal with write access used the filtered view, silently drop the keys it could not see. Tenant roles grant only `get/list/watch` today, but that read-only posture lives in a different package (`packages/system/cozystack-basics` clusterroles), so a field filter's safety is entangled with a write path and an RBAC posture defined elsewhere. That makes it a redesign of the tenant-secret API with its own blast-radius analysis — a separate design proposal — not a rider on the TLS rollout. Recorded here as the future simplification the extraction controller could later retire, explicitly **not** a dependency of this proposal. - **A native `ClusterTrustBundle` (KEP-3257).** Considered, recorded as the direction Kubernetes itself is taking for trust anchors (cluster-scoped, world-readable, public-only by construction — the API server rejects PEM with a private key — with central rotation). It does not fit this use case: it is consumed through the pod-mount `clusterTrustBundle` volume projection, whereas a Cozystack tenant reads `-ca-cert` as a named Secret object through the dashboard and `tenantsecrets`. Different consumer model; a copy object is what an object-by-name consumer needs. -- **A general-purpose cluster-secret replication operator / a copy-issuer webhook.** Rejected during the edge work (`cozystack/cozystack#2990`, `cozystack/cozystack#2812`) in favour of native references and a purpose-built reconciler with a tight ownership guard that move only the Secret name, never key material. +- **A general-purpose cluster-secret replication operator / a copy-issuer webhook.** Rejected during the edge work (`cozystack/cozystack#2990`, `cozystack/cozystack#2812`) in favor of native references and a purpose-built reconciler with a tight ownership guard that move only the Secret name, never key material. --- From b7dc96a1983f51e73fe823c3193d76e7338f163c Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 2 Jul 2026 15:46:25 +0300 Subject: [PATCH 5/5] docs(unified-tls-pki): weigh External Secrets Operator as a delivery alternative The ESO kubernetes provider is the closest prior art for the ca.crt extraction step: a per-release ExternalSecret selecting a single remoteRef property materializes the key-free object, the chart renders a declaration rather than data (sidestepping the async render-time wall), and the lineage ownership walk resolves through ESO's owner reference to the chart-rendered ExternalSecret, with garbage collection following. Record it in Alternatives with the reasons it is not the default mechanism: the package is opt-in behind bundles.enabledPackages (empty by default), so the trust-anchor contract would depend on a component an operator may not run; it adds a per-namespace SecretStore and secret-read ServiceAccount fan-out; and its poll-based refresh is weaker than a watch-driven controller for rotation propagation. Noted as the controller's natural retirement path if ESO ever becomes a default platform component. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- design-proposals/unified-tls-pki/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/design-proposals/unified-tls-pki/README.md b/design-proposals/unified-tls-pki/README.md index 473a6e9..9d1c1bf 100644 --- a/design-proposals/unified-tls-pki/README.md +++ b/design-proposals/unified-tls-pki/README.md @@ -218,6 +218,7 @@ Two consequences for the controller. First, it writes at runtime, after chart-re - **A custom `valuesFrom` pointing at `-ca`.** Rejected: `expectedValuesFrom()` pins every application HelmRelease to the single `cozystack-values` Secret and overwrites drift. - **Forking each upstream operator (the redis path) to emit a key-free CA.** Rejected as the general mechanism: it works for redis because the fork is small, but applying it to CloudNativePG, PSMDB, mariadb-operator, opster, and the rabbitmq cluster-operator is far more surface to own than one engine-agnostic controller. - **cert-manager `trust-manager`.** Considered and rejected as the mechanism, recorded as the validating prior art. trust-manager is the canonical "watch a CA source, materialize a key-free copy, never touch `tls.key`" operator and confirms the copy-model is the mainstream choice for trust-anchor distribution. But it does not fit operationally: its `Bundle`/`ClusterBundle` is cluster-scoped and fans out to namespaces by selector (not one object per release); it reads its sources from its own trust namespace, not from an arbitrary per-release tenant namespace; and it targets ConfigMaps first (Secret targets are opt-in). Modelling per-release, intra-namespace extraction on it would mean one cluster-scoped Bundle per database release reading a source it cannot natively see — against the grain. The small intra-namespace controller is a better fit. +- **External Secrets Operator (kubernetes provider).** The closest prior art of the copy-model options, and mechanically sound end-to-end: a `SecretStore` with `provider.kubernetes` plus a per-release `ExternalSecret` selecting a single key (`data[].remoteRef.property: ca.crt`, not `dataFrom`) materializes exactly the key-free `-ca-cert` object. Selecting one property is a structural no-private-key whitelist, `target.template` stamps the marker label, `refreshInterval` tolerates the operator-created asynchronous source, and the chart renders a *declaration* of the projection rather than its data — sidestepping the async/render-time wall the same way the controller does. The lineage chain resolves too: ESO owner-refs its output Secret to the `ExternalSecret`, itself a chart-rendered object, so the ownership walk reaches the owning application the way it does for any chart object, and garbage collection follows. Not adopted as the mechanism for one operational reason and two costs: ESO is packaged but **opt-in** — `cozystack.external-secrets-operator` renders only when listed in `bundles.enabledPackages`, which defaults to empty (`packages/core/platform/values.yaml`) — so the platform's trust-anchor contract would hinge on a component an operator may not run, and adopting it means first promoting ESO to a required, always-on platform dependency; each consuming namespace needs a `SecretStore` plus a ServiceAccount with read access to source Secrets (a per-namespace privileged-read fan-out the intra-namespace controller does not add); and refresh is poll-based (`refreshInterval`) where the controller is watch-driven, so rotation propagation is bounded by the polling period. If ESO later graduates to a default platform component, it is the natural retirement path for the extraction controller — recorded with the same status as the field-filter below. - **A name-convention CA-distribution controller** (the controller watches `-ca` because its name is deterministic, and carries its own marking, garbage-collection, and collision guard). Rejected in favor of §5's explicit source label plus lineage reuse: the CA Secret names are non-uniform and `-ca-cert` is overloaded across engines, so a name convention would mis-handle at least one; and the marking/GC it re-implements is already provided by `spec.secrets` and owner references. - **A field filter on the projection itself (the principled root fix — deferred to its own proposal).** Today `tenantsecrets` delivers the whole Secret `Data`, which is the only reason a key-free *copy* must be materialized at all. A per-key field filter on the projection (or the `spec.secrets` selector) would let a single `ca.crt` key be projected straight out of a key-bearing Secret — no controller, no second object — and would generalize to the operator-owned engines. It is **not** a local change, which is why it is deferred rather than adopted here: the projection is **writable** at the registry level (`pkg/registry/core/tenantsecret/rest.go` implements Create/Update/Patch/Delete), and the write path (`tenantToSecret`) replaces the underlying Secret's `Data` **wholesale** (`out.Data = ts.Data`). A read-side key filter without a matching write-side filter would, the moment any principal with write access used the filtered view, silently drop the keys it could not see. Tenant roles grant only `get/list/watch` today, but that read-only posture lives in a different package (`packages/system/cozystack-basics` clusterroles), so a field filter's safety is entangled with a write path and an RBAC posture defined elsewhere. That makes it a redesign of the tenant-secret API with its own blast-radius analysis — a separate design proposal — not a rider on the TLS rollout. Recorded here as the future simplification the extraction controller could later retire, explicitly **not** a dependency of this proposal. - **A native `ClusterTrustBundle` (KEP-3257).** Considered, recorded as the direction Kubernetes itself is taking for trust anchors (cluster-scoped, world-readable, public-only by construction — the API server rejects PEM with a private key — with central rotation). It does not fit this use case: it is consumed through the pod-mount `clusterTrustBundle` volume projection, whereas a Cozystack tenant reads `-ca-cert` as a named Secret object through the dashboard and `tenantsecrets`. Different consumer model; a copy object is what an object-by-name consumer needs.