# ADR-0002 — Step-CA-issued TLS certificates for Redis, Postgres, and InfluxDB **Status.** Proposed, 2026-05-08. **Related.** ccatobs/system-integration#95 (PRD). **Supersedes.** Nothing on disk; the homegrown PKI under `ansible/roles/redis_certs/` and `redis//certs/` is what this retires. ## Context Today three datastores in the stack use ad-hoc or absent TLS: - **Redis** — homegrown CA per environment variant (`ansible/roles/redis_certs/`, four self-signed CAs under `redis/{main,ccat,develop,develop-ccat}/certs/`). mTLS is configured but the CA story is bespoke and cannot be audited as part of the rest of the CCAT trust chain. - **Postgres** — TLS not enforced; client traffic in the clear or with trust-on-first-use. - **InfluxDB** — fronted today on plain HTTP. The Grafana datasource at `grafana/provisioning/production/datasources/influxdb-datasource.yaml` literally points at `http://data.ccat.uni-koeln.de:8086` with `tlsSkipVerify: true`. The same pattern lives in `grafana/provisioning/staging/datasources/`. The CCAT step-ca endpoint already issues: - the SSH user-cert plane (`ansible/roles/ssh_service_cert/`), - the public TLS cert for the CA's own vhost (`step-ca/issue-vhost-cert.sh`, `step-ca/renew-vhost-cert.sh`, `step_ca_vhost_cert.timer` every 12h). PRD #95 proposes routing the three datastores onto the same step-ca-issued path: a single root of trust, predictable lifetimes, and the same operator muscle memory. A senior-architect review of #95 blocked the PRD on this ADR existing. The PRD names the headline decisions ("JWK", "hard cutover", "multi-SAN no IP") but defers the reasoning to here. This ADR is that reasoning. ### Important PRD correction The PRD claims it copies an existing `step_ca_vhost_cert` *Ansible role* verbatim. **There is no such role on disk.** The prior art is: - the **scripts** at `step-ca/issue-vhost-cert.sh` (one-shot issuance) and `step-ca/renew-vhost-cert.sh` (PRE_MTIME / `step ca renew` / POST_MTIME / conditional reload), - the **per-container password-staging convention** in `ansible/roles/ssh_service_cert/tasks/_per_container.yml` (vault → 0400 host tmpfile → unlink in an `always:` block; or stdin-only via `docker_container_exec` for the in-container case). The implementation must build the Ansible role **from these patterns**, not from a role that does not exist. Any reader of this ADR or the PRD should not waste time grepping for `step_ca_vhost_cert/` under `ansible/roles/`. ## Decision Issue TLS certs for Redis, Postgres, and InfluxDB from the CCAT step-ca via the **JWK** provisioner, using cert-as-auth (`step ca renew`) on a 12h timer cadence. Cut over hard, no dual-trust soak. Implement as a single parameterised Ansible role (`step_ca_vhost_cert`) plus three pluggable reload-strategy adapters — not as one deep module pretending all three services are the same shape. Per-decision detail follows. --- ## Decision: Provisioner choice — JWK over ACME and X5C ### Context step-ca offers three provisioner classes for non-interactive cert flows: ACME (HTTP-01, DNS-01, TLS-ALPN-01), X5C (cert-presented-as-auth, but chained to an external trust root), and JWK (password-or-key-protected provisioner credentials). ### Decision Use **JWK**, with `step ca renew` (cert-as-auth) for steady-state renewal. Initial issuance presents the JWK provisioner password; every renewal thereafter authorises with the cert's own private key, so the provisioner password never has to live on the renewing host past the one-shot issuance step. Working precedent: `step-ca/renew-vhost-cert.sh` does exactly this for the `ca.ccat.uni-koeln.de` vhost cert. `step ca renew` only contacts the CA inside the renewal window (last 1/3 of lifetime by default), so a 12h timer is benign — most fires are no-ops. ### Alternatives considered - **ACME HTTP-01.** Would require the CA reach the requesting service on port 80. In our topology Redis on input-b is firewalled off the public internet; Postgres on input-a is internal; InfluxDB has its own vhost path. Opening HTTP-01 challenge paths through the proxy for three more vhosts adds a brittle coupling between CA, proxy config, and ACME challenge timing — and is the operational class of problem that ADR-0001 already had to navigate to get the CA's own vhost cert working. - **ACME DNS-01.** Would require the CA orchestrate DNS records in the Uni-Köln DNS zone. We do not control that zone programmatically; a manual record-flip per renewal is unacceptable on a 12h cadence. - **ACME TLS-ALPN-01.** Same firewall constraint as HTTP-01, plus the Redis/Postgres/Influx daemons are not HTTP servers and cannot serve the challenge. - **X5C.** Would require us to bootstrap a separate trust root just to authorise these provisioners, then maintain it. It does not solve a problem JWK doesn't already solve; it adds a parallel trust path we'd then have to monitor. ### Consequences - The JWK provisioner password is in vault (`vault_step_ca_prov_*_password`) and only reaches the issuing host via Ansible's vault → 0400 tmpfile → unlink pattern from `roles/ssh_service_cert/tasks/_per_container.yml`. Steady-state renewals do not touch the password at all. - All renewals share one well-trodden path (`step ca renew`) so an operator who has debugged the vhost cert renewal already knows how to debug a Redis cert renewal. - An open question (see below): does `step ca renew` succeed against an already-expired auth cert? If not, an HSM outage that exceeds the renewal budget *plus* the window between renewal and expiry forces a fall-back to the JWK-password path. --- ## Decision: Migration style — hard cutover, no dual-trust soak ### Context The architect's default recommendation for any TLS migration is a two-week dual-trust soak (old CA + new CA both accepted, then flip). The PRD instead proposes a hard cutover for all three services. ### Decision **Hard cutover.** This is consistent with the existing TLS-hard-cutover-policy ADR captured in project memory (2026-05-07): step-ca trust + DB certs roll out via deploy-time restart, not dual-trust. ### Why this stands here, even though architects would normally object Production is currently in **setup mode**: there are no end users on the operations DB, no live data streams flowing through the transfer pipeline, no externally consumed Grafana dashboards depending on the InfluxDB datasource. A two-week soak buys nothing because the "availability we'd be protecting" doesn't exist yet. The cost of a soak (double-config, more code paths, more places for a misconfigured client to silently fall back to the old trust path) is real today; the benefit is zero today. ### Time-bound — read this before reusing this precedent The above is **only** true while production is unpopulated. Once the operations DB carries real observation records, once the data-transfer pipeline is moving live telescope data, once Grafana dashboards are being watched by humans on call — the calculus flips. **Any future similar migration on a populated production stack must use a soak.** Do not point at this ADR as precedent for skipping a soak on a live system. The precedent is "skip soak when there are no users", not "skip soak in general". ### Alternatives considered - **2-week F→G dual-trust soak.** Standard playbook. Rejected on the cost/benefit argument above, time-bound to the empty-production state. - **Service-by-service phased cutover** (Redis first, then Postgres, then Influx). Rejected as not actually safer in the current state — each service still hard-cuts when its turn comes; the phasing only spreads operator attention thinner. We will sequence by readiness of the reload adapter (probably Postgres first because `pg_reload_conf()` is the cheapest), not by risk-mitigation. ### Consequences - A failed cutover is a service outage on whichever datastore failed. Mitigation: rehearse on staging first; the staging environment uses the same step-ca and the same role. - This ADR must be revisited (and likely rewritten) before the next TLS migration on a populated production stack. Add a checkbox to the production-readiness review. --- ## Decision: Renewal architecture — one role, three reload adapters ### Context The PRD as drafted proposed a single Ansible module that takes a `cert_spec` dict (name, SANs, lifetime, owner, mode, reload-command) and handles Redis (mTLS + `redis-cli CONFIG SET`), Postgres (server-only + `pg_reload_conf()`), and InfluxDB (server-only + container restart) through that one shape. The architect review pushed back: a single dict that has to fork on `if redis else if postgres else if influx` inside the module is a deep-module fiction — the fork is inherent to the problem and pretending it isn't makes the module's interface lie. ### Decision Build **one parameterised role** (`step_ca_vhost_cert`) that handles: - issuance via JWK provisioner, - on-disk cert layout, ownership, mode, - the renewal timer/script (modelled on `step-ca/renew-vhost-cert.sh` with PRE_MTIME / POST_MTIME conditional reload), - trust-anchor consumption from `roles/ca_trust/`. …and expose a **pluggable reload-strategy interface** with four adapter implementations: | Adapter | Service | Reload mechanism | Downtime | |---|---|---|---| | `runtime_redis` | Redis | `redis-cli CONFIG SET tls-cert-file ...; CONFIG SET tls-key-file ...` | zero | | `runtime_postgres` | Postgres | `SELECT pg_reload_conf();` (or `pg_ctl reload`) | zero | | `restart_influx` | InfluxDB | `docker restart influxdb` | ~30s | | `noop` | (canary or no-service-attached cert) | nothing — write files, exit 0 | n/a | The role takes a `reload_strategy` parameter that selects one of these four; the adapter's contract is "given a cert that was just renewed, make the running service serve it" (or, for `noop`, "verify the new files exist and exit"). Anything that doesn't fit one of these adapters is an implementation surprise that deserves a new adapter, not a special case inside the existing ones. `noop` is the fourth adapter; it exists for certs that have no service to reload — the x509 canary on `input-c.staging` (see "Decision: x509 canary") is its first user. A future cert that participates in the trust chain but is read by external tooling rather than a running service (e.g., a public-facing inspection endpoint) can also use it. ### Why four adapters is honest deep-module design Ousterhout's "deep module" guidance is *narrow interface, broad implementation* — emphatically not "one interface that secretly does four different things". The reload mechanism is genuinely different across the four cases (CONFIG SET vs SQL function call vs container restart vs no-op) and the operational consequences differ (zero vs zero vs 30s downtime vs none). Forcing them into one cert-spec dict makes the caller's mental model wrong: they think they have one knob, they actually have four with different blast radii. The pluggable adapter makes the asymmetry visible at the call site: ```yaml - role: step_ca_vhost_cert vars: cert_spec: { ... } reload_strategy: restart_influx # explicit: this one restarts ``` ### Alternatives considered - **One module, fork-on-service inside.** Rejected per above — hides the asymmetry from the caller. - **Three independent roles** (`redis_step_cert`, `postgres_step_cert`, `influx_step_cert`). Rejected because the issuance + on-disk + renewal-timer machinery would be duplicated three ways. The whole *point* of the consolidation in #95 is to retire bespoke per-service PKI plumbing. - **One module, reload-command as a literal shell string parameter.** Rejected because the contract for "reload after renewal" is more than one shell line: it includes idempotency (no reload on no-op renewal), error handling (a failed reload should *not* leave the cert file half-installed), and in the InfluxDB case a wait-for- healthy step. That logic belongs in named adapters, not in free-form shell. ### Consequences - Adding a fifth datastore later (e.g. MinIO, Loki) is "write a fifth adapter", not "extend the cert-spec dict". - The role's interface stays narrow (`cert_spec` + `reload_strategy`) while the implementation is honest about the three-way fork. - Tests can target each adapter independently — important because the InfluxDB adapter is the only one with downtime semantics and needs different verification. --- ## Decision: Reload mechanisms (per service) This is the per-service detail behind the table in the previous section. ### Redis — `CONFIG SET`, zero downtime Redis 6+ accepts runtime updates of `tls-cert-file` / `tls-key-file` / `tls-ca-cert-file` via `CONFIG SET`. The connection pool isn't churned; existing TLS sessions live out their natural deaths and new sessions pick up the new material. Failure mode to test: if `CONFIG SET` succeeds but the new files are unreadable by the redis user (UID 999 in our containers), Redis logs the error and keeps using the old in-memory cert. The renewal script must verify post-CONFIG-SET that the active cert serial matches the on-disk cert serial. ### Postgres — `pg_reload_conf()`, zero downtime `SELECT pg_reload_conf();` re-reads `postgresql.conf`, including `ssl_cert_file` and `ssl_key_file`. Existing connections keep their TLS context; new connections get the new cert. Same caveat as Redis: verify the postmaster actually picked up the new cert; a typo in the config path is a silent fallback. ### InfluxDB — `docker restart`, ~30s downtime InfluxDB OSS does not have a runtime reload for TLS material. We accept the restart. The 30s window is acceptable on the InfluxDB role: it ingests metrics from telegraf, which buffers locally, and serves Grafana dashboards, which retry. No write path depends on InfluxDB being up second-by-second. The restart adapter must: - pre-flight that the new cert is syntactically valid (`openssl x509 -noout -text`) before bouncing the container, - `docker restart` (not `docker stop && docker start` — the former preserves the container's IP / aliases on the user-defined network), - wait for `/health` to return 200 before declaring success. --- ## Decision: HSM blast radius / soft-offline budget ### Math The CCAT root CA lives on an HSM. If the HSM is offline for any reason (physical access loss, ceremony in progress, hardware fault), the CA cannot issue or renew. Every cert lives until its `notAfter`; the "soft-offline budget" is how long the HSM can be offline before something starts hard-failing. | Environment | Cert lifetime | `step ca renew` window opens at | Renewal cadence | Soft-offline budget | |---|---|---|---|---| | Production | 90d | day 60 (2/3 lifetime) | 12h timer = 60 fires before expiry | 30d / 60 fires | | Staging (PRD draft) | 30d | day 20 | 12h timer = 20 fires before expiry | 10d / 20 fires | | Staging (revised) | 45d | day 30 | 12h timer = 30 fires before expiry | 15d / 30 fires | ### Decision Production stays at 90d / 30d budget — comfortable headroom for an HSM ceremony (typically 1-2 days) plus one weekend of bad luck. **Architect-mandated change to the PRD:** staging at 30d / 10d budget is too tight. A long weekend plus a sick on-call plus a stuck CI run eats most of the budget. **Extend staging cert lifetime to 45d** (budget 15d / 30 fires). ### Alternatives considered - **Document the operational acceptance of 10d on staging.** Available if anyone has a strong reason for keeping cert lifetimes short-on-staging (often "make rotation visible in CI cadence"). Rejected because staging exists to rehearse production failure modes, and a tighter-than-production budget makes staging a worse rehearsal, not a better one. - **Match staging to production at 90d.** Rejected because we *do* want staging to exercise the renewal path more frequently than production; 45d gives us that without making the budget uncomfortable. ### Consequences - One more variable to keep aligned across the three services on staging. The role's `cert_spec.lifetime` parameter handles this. - The PRD's table needs a one-line edit; flag for the implementation PR. ### Open question to pin before implementation **Does `step ca renew` succeed against an already-expired authenticating cert?** If yes, the budget math above is straightforwardly correct: lose the HSM for 30d, recover, every host catches up on the next timer fire. If no, then once a host's cert expires we drop back to the JWK-password path for that host, which means the password file has to be ready to materialise on demand. This is testable in staging with a deliberately back-dated cert. **Do this test before merging the implementation.** Decision below assumes the answer is "no" until confirmed; the role's renewal script will fall back to JWK-password issuance if `step ca renew` fails for an expired-cert reason. --- ## Decision: SAN policy ### Decision Each cert carries multiple DNS SANs: - the docker-network alias the service is reached at (e.g. `redis`, `postgres`, `influxdb`), - the public FQDN (e.g. `redis.ccat.uni-koeln.de`), - the host FQDN (e.g. `input-b.ccat.uni-koeln.de`). **No wildcards. No IP SANs.** ### Reasoning - **No wildcards:** a leaked `*.ccat.uni-koeln.de` cert grants the attacker every vhost we've ever named under that domain. Multi-SAN per cert keeps the leak blast radius to "this one service". - **No IP SANs:** IP SANs make the cert tied to a specific deployment topology. Move the service to a different host and the cert silently mis-matches. DNS-only SANs decouple identity from placement; renumbering the IP plan stays a DNS-only operation. The redis_certs precedent included an IP SAN (`redis-certs_staging.conf` lists `IP:134.95.40.103`); we are retiring that. - **Multi-SAN per cert** instead of "one cert per SAN": one renewal path per service, one cert file in one place. The reload adapters don't have to juggle three cert files for the same daemon. ### Consequences - Adding a new alias to a service is a re-issuance, not a config edit. Acceptable because aliases change rarely and the role makes re-issuance trivial. - The cert will list multiple SANs under `Subject Alternative Name` in `openssl x509 -noout -text` — do not treat this as a misconfiguration in inspection scripts. --- ## Decision: mTLS scope asymmetry ### Decision - **Redis: keep mTLS.** Both server and client present certs. - **Postgres: server-auth-only.** Server presents a cert; client authenticates with username + password as today. - **InfluxDB: server-auth-only.** Server presents a cert; client authenticates with API token as today. ### Reasoning — and being honest about it Redis stays mTLS because it's already mTLS today (homegrown `redis_certs` role) and because the application clients (data-transfer workers, ops-db-api, etc.) already know how to present client certs. Migrating Redis off mTLS at the same time as moving its trust root is two changes at once. We are not doing two changes at once. This is **inertia, not principle.** A clean-sheet design might well land all three on server-auth-only-with-password/token; mTLS for Redis buys us a marginal extra layer (compromise of the Redis password isn't enough; you'd also need the client cert) but at the cost of distributing client material to every Redis-using service. **Revisit:** when data-transfer or ops-db-api next has a credentials refactor, evaluate whether Redis mTLS is still pulling its weight or whether server-auth-only-with-password is enough. Track this as a follow-up; do not block #95 on resolving it. ### Consequences - `runtime_redis` reload adapter has to manage three files (`tls-cert-file`, `tls-key-file`, `tls-ca-cert-file`) — the CA file is what lets the server validate client certs. The other two adapters manage two files (cert + key only). - Client-side trust distribution is asymmetric: Redis clients need *both* the CCAT root (to validate the server) *and* a client cert+key (to be validated by the server). Postgres/Influx clients only need the CCAT root. The `ca_trust` role already drops the root at `/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt`; client cert distribution stays where it is (per-service, today via redis_certs) for now. --- ## Decision: Cert-spec schema — parameterised UIDs, no defaults baked in ### Context The PRD as drafted hardcoded the container UIDs (Redis 999, Postgres 999, InfluxDB 1000) as constants inside the role. The architect review pushed back: upstream image rebases historically shift UIDs without major-version bumps, so a baked-in constant is a silent foot-gun. Validation runbook Check 5 (2026-05-08) confirmed the values on `input-b.staging` are 999/999/1000 today, but it also confirmed `influxdb:latest` is the only unpinned image in scope — exactly the drift candidate. The role needs a schema that (a) takes UID as a per-cert parameter with no role-level default, (b) sources the value from a per-host fact so different hosts can have different UIDs without code changes, (c) is the same schema the runtime drift-detection step (TODO 7) reads at renewal time. This section follows the same shape as the SSH-cert plane's `ansible/roles/ssh_service_cert/defaults/main.yml` schema — same pattern of "list of cert-spec dicts in `host_vars`, role is a no-op when the list is empty". ### Decision The role (working name `step_ca_vhost_cert`, modelled on `step-ca/issue-vhost-cert.sh` + `step-ca/renew-vhost-cert.sh` plus `roles/ssh_service_cert/`) takes a list of cert-spec dicts called `service_tls_certs`. Per-host enable lives in `ansible/host_vars//vars_step_ca_vhost_cert.yml`. The role defaults file declares `service_tls_certs: []` so the role is a no-op on hosts where the list is undefined. ```yaml # ansible/host_vars/input-b.staging/vars_step_ca_vhost_cert.yml host_container_uids: postgres: 999 redis: 999 influxdb: 1000 service_tls_certs: # ────────────────────────────── postgres ────────────────────────────── - service: postgres-main sans: - postgres - postgres.staging.data.ccat.uni-koeln.de - input-b.staging.data.ccat.uni-koeln.de cert_path: /etc/postgres-certs/server.crt key_path: /etc/postgres-certs/server.key owner_uid: "{{ host_container_uids.postgres | mandatory }}" owner_gid: "{{ host_container_uids.postgres | mandatory }}" cert_mode: "0644" key_mode: "0600" lifetime: "{{ step_ca_x509_cert_lifetime }}" reload_strategy: runtime_postgres container: system-integration-postgres-1 provisioner: "{{ step_ca_x509_provisioner }}" vault_var_name: vault_step_ca_prov_staging_services_password mtls: false # ──────────────────────────────── redis ─────────────────────────────── - service: redis-main sans: - redis - redis.staging.data.ccat.uni-koeln.de - input-b.staging.data.ccat.uni-koeln.de cert_path: /opt/redis-certs/staging/server.crt key_path: /opt/redis-certs/staging/server.key ca_path: /opt/redis-certs/staging/ca.crt # mtls=true only owner_uid: "{{ host_container_uids.redis | mandatory }}" owner_gid: "{{ host_container_uids.redis | mandatory }}" cert_mode: "0644" key_mode: "0600" lifetime: "{{ step_ca_x509_cert_lifetime }}" reload_strategy: runtime_redis container: system-integration-redis-1 provisioner: "{{ step_ca_x509_provisioner }}" vault_var_name: vault_step_ca_prov_staging_services_password mtls: true # mTLS scope asymmetry # ────────────────────────────── influxdb ────────────────────────────── - service: influxdb-main sans: - influxdb - influxdb.staging.data.ccat.uni-koeln.de - input-b.staging.data.ccat.uni-koeln.de cert_path: /etc/influxdb-certs/server.crt key_path: /etc/influxdb-certs/server.key owner_uid: "{{ host_container_uids.influxdb | mandatory }}" owner_gid: "{{ host_container_uids.influxdb | mandatory }}" cert_mode: "0644" key_mode: "0600" lifetime: "{{ step_ca_x509_cert_lifetime }}" reload_strategy: restart_influx container: system-integration-influxdb-1 provisioner: "{{ step_ca_x509_provisioner }}" vault_var_name: vault_step_ca_prov_staging_services_password mtls: false ``` ### Field reference | Field | Required | Description | |---|---|---| | `service` | ✓ | Canonical service name. Used in metric labels (Tier 1+3 of the alert substrate), filename suffixes, and journald audit-log lines. Format: `-` (e.g. `redis-main`, `redis-ccat`, `postgres-main`). | | `sans` | ✓ | DNS SAN list per the SAN-policy decision (no IP, no wildcard, multi-SAN). At minimum: docker-network alias + public FQDN + host FQDN. | | `cert_path`, `key_path` | ✓ | On-host filesystem paths. The cert is bind-mounted into the container by the per-machine compose file. | | `ca_path` | mtls=true only | Path to the CA bundle the server will use to validate client certs. Set only when `mtls: true`. | | `owner_uid`, `owner_gid` | ✓ | **No role-level default.** Must be sourced from `host_container_uids.` (or another per-host fact). The `\| mandatory` filter forces an explicit failure if the host fact is missing — fail-loud on misconfiguration is the desired property. | | `cert_mode`, `key_mode` | ✓ | Typically `0644` and `0600`. Postgres rejects keys with group/world access (validation runbook Check 9 pinned the FATAL phrasing); the role asserts these post-issuance and aborts the play before reload if they drift. | | `lifetime` | ✓ | Sourced from a role-level default (`step_ca_x509_cert_lifetime`) which the per-environment vars file overrides — `90d` for production, `45d` for staging (per HSM blast-radius decision). | | `reload_strategy` | ✓ | One of `runtime_redis`, `runtime_postgres`, `restart_influx` (per renewal-architecture decision). Adding a fourth datastore = adding a fourth adapter, not extending the cert-spec. | | `container` | ✓ | **Compose-namespaced container name** (e.g. `system-integration-postgres-1`), not the bare service name — runbook Check 5 captured this gotcha. The reload adapter and the runtime UID-drift probe (TODO 7) both `docker exec` into this container. | | `provisioner` | ✓ | Name of the JWK provisioner on step-ca that issues this cert. Sourced from a role-level default (`step_ca_x509_provisioner`) — `staging-services` or `prod-services`. | | `vault_var_name` | ✓ | Name of the Ansible vault variable holding the provisioner password. Used at issuance only; renewals are cert-as-auth (`step ca renew`) and never touch the password. Same vault → 0400 host tmpfile → unlink convention as `roles/ssh_service_cert/tasks/_per_container.yml`. | | `mtls` | ✓ | `true` for Redis (server validates client certs); `false` for Postgres and InfluxDB (server-auth-only) per the mTLS scope asymmetry decision. Controls whether `ca_path` is written and whether the reload adapter manages two files (cert+key) or three (cert+key+ca). | ### Per-host UID fact `host_container_uids` is a separate dict in the same `host_vars` file. Two reasons: 1. **Reuse for TODO 7 runtime drift detection.** The renewal script reads `host_container_uids.` and compares to `docker exec cat /proc/1/status` (PID 1's effective UID — `docker exec ... id` defaults to root and is the wrong probe; runbook Check 5 captured this gotcha). Mismatch = non-zero exit + Tier-1 + Tier-2 alert. 2. **Single source of truth per host.** A future change that pins `influxdb:2.7-rootless` (UID 1001) is a one-line edit in `host_container_uids` rather than a hunt across multiple cert-spec entries. ### Acceptance against TODO 3 clauses 1. ✓ The role takes `owner_uid` as a per-cert-spec parameter — no defaults baked into the role. Defaults file sets `service_tls_certs: []`; cert-specs supply UIDs explicitly. 2. ✓ Per-host vars under `ansible/host_vars//vars_step_ca_vhost_cert.yml` set the UIDs via `host_container_uids` and reference them in cert-specs. 3. ✓ Actual UIDs as observed via `docker exec cat /proc/1/status` are recorded in validation runbook Check 5 (2026-05-08, `input-b.staging`): postgres=999, redis=999, influxdb=1000. Cross-referenced from this section. 4. → TODO 7 (runtime drift detection): the same `host_container_uids` dict is the source of truth at renewal time. ### Alternatives considered - **One flat dict per cert-spec mixing UID with everything else.** Rejected: makes UID drift harder to reason about, and the runtime drift script would have to walk the cert-spec list to find the value rather than reading the UID dict directly. - **Role-level UID defaults** (e.g., `default_postgres_uid: 999` in the role's `defaults/main.yml`). Rejected: defeats TODO 3's purpose. A future operator adding a host with non-default UIDs has to remember to override the default. Better to fail-loud than to silently use a wrong default. - **Per-environment cert-spec lists in `group_vars//`** rather than per-host. Rejected because UIDs are a host-level fact (different hosts can run different image variants); SANs and paths are also host-level. Putting them in `group_vars` would force every host in the group to have identical UIDs, which is exactly the drift-foot-gun we are avoiding. ### Consequences - **The schema is the contract** between the role and the TODO 7 runtime-drift script and the TODO 4 alert substrate scripts. Field renames are role-version-bump events. - **Cert-spec count grows linearly with services × variants.** Today: 3 services × 2 environments × {main, ccat} variant where applicable = ~6-9 cert-specs across all hosts. Manageable. - **`influxdb:latest` UID drift risk** stays open (no version pin), but is now a one-line `host_container_uids.influxdb` edit if it shifts. TODO 7's runtime probe catches the shift at the next renewal fire and refuses to write the new key — fail-closed before the reload would brick InfluxDB. --- ## Decision: x509 canary on `input-c.staging` — leading-indicator for the cert plane ### Context Option A on `allowRenewalAfterExpiry` (Resolved, 2026-05-08) makes the protection contingent on detection: a cert+key snapshot leak auto-bounds at `notAfter` only if the operator notices the renewal chain has been broken before then. The SSH-cert plane already runs 24h user certs that act as an HSM/CA-health canary for the SSH side; the x509 plane has no equivalent today. Service certs are 90d (production) / 45d (staging) and only renew in the last 1/3 of lifetime, so a stuck renewal gives the alert substrate days-of-warning *if it works*. A 24h x509 canary fails within hours of any HSM/CA breakage on the x509 plane — long before any production cert is at risk. It is the *leading indicator* that proves the alert path is alive, and the smoke test for the JWK provisioner cert-as-auth flow specifically. ### Decision Issue a 24h-lifetime x509 cert from the `staging-services` JWK provisioner to a non-prod host. **Target host: `input-c.staging`** — deliberately not `input-b.staging` so the canary does not share fate with the CA host itself. Cert-spec entry (lives in `ansible/host_vars/input-c.staging/vars_step_ca_vhost_cert.yml`): ```yaml host_container_uids: {} # canary has no container; no UID needed service_tls_certs: - service: x509-canary sans: - x509-canary.input-c.staging.data.ccat.uni-koeln.de - input-c.staging.data.ccat.uni-koeln.de cert_path: /opt/x509-canary/canary.crt key_path: /opt/x509-canary/canary.key owner_uid: 0 owner_gid: 0 cert_mode: "0644" key_mode: "0600" lifetime: "24h" reload_strategy: noop # new fourth adapter, see below container: "" # no container; canary is host-only provisioner: staging-services vault_var_name: vault_step_ca_prov_staging_services_password mtls: false ``` This relies on a **fourth reload adapter, `noop`**, listed in the Renewal architecture decision: "renew the cert, write the new files, do nothing else". No reload — nothing reads the canary at runtime. The cert exists only for its own lifecycle metrics. `noop` is general-purpose (the canary is its first user, but a future cert that doesn't need a service reload — e.g., a dual-purpose cert inspected by external tooling — can use it too) and is exempt from the operational-consequence asymmetry argument because there is no service to reload. ### Renewal cadence and failure semantics - **Cert lifetime:** 24h. - **Timer cadence:** 12h (matches production cert plan, so the canary exercises the same code path as production renewals). - **`step ca renew --expires-in` threshold:** 18h. Below that threshold a renew attempt actually contacts the CA; above it the timer fire is a no-op (same gate the production timers will use, just with smaller numbers). - **Failure threshold for paging:** failure to successfully renew within 18h of `notAfter` = Tier 1 + Tier 2 alert. The 6h gap between the renewal threshold and the page threshold gives one natural retry without paging. If the canary cert expires (no successful renewal for >24h after notAfter), the alert substrate is itself broken — Tier 2 mail is the canary on the canary. ### Acceptance against TODO 15 clauses 1. ✓ 24h-lifetime x509 cert from `staging-services` JWK on a non-prod, non-CA host (`input-c.staging`). 2. ✓ Renewal timer fires at 12h cadence; failure to renew within 18h of `notAfter` triggers a Tier 1 + Tier 2 page on the substrate from TODO 4 (alert substrate decision). 3. ✓ `step_x509_cert{service=x509-canary} seconds_to_expiry` and `step_x509_cert_last_renewal_success{service=x509-canary}` are wired into the alert substrate **as the first metrics** — end-to-end verification (Tier 1 alert visible in Grafana, Tier 2 mail actually delivered) happens *before* any production cert is enrolled. This is also Check 8 (page-path E2E) in the validation runbook. 4. ✓ The cert-spec entry above is the canary configuration; the `noop` adapter is the implementation. Single artifact for both purposes. ### Why a non-CA host The canary is supposed to fail fast when the HSM is unreachable. If it lives on `input-b.staging` (which hosts the CA), an `input-b` outage takes down the CA *and* the canary together — the canary's failure is then ambiguous between "CA is down" and "input-b is down and the CA might be fine". Hosting the canary on `input-c.staging` removes that ambiguity: a canary failure with `input-c.staging` up means the CA is unreachable from a peer host, which is exactly the condition the canary exists to detect. ### Consequences - **Phase A scope adds the `noop` adapter** (fourth adapter; trivial — write files, exit 0, emit metrics). Phase A scaffolding gains one cert-spec on `input-c.staging`. - **The canary is the validation-runbook Check 8 target.** Check 8 is currently BLOCKED on TODO 15 (and on Phase A producing the role). Closing TODO 15 design unblocks Check 8 once Phase A lands. - **A new operational duty:** if the canary alerts but no production cert has alerted, the operator's first move is "is the CA reachable from `input-c.staging`?" — `step ca health`, `nc -zv ca.ccat.uni-koeln.de 443` from input-c. Document in the on-call runbook (when one exists). --- ## Decision: Revocation stance — lifetime-as-revocation, no CRL/OCSP ### Decision We do not stand up a CRL or OCSP responder. Compromised certs are handled by **rotating the secret material and waiting for the cert to expire** (90d production, 45d staging). For acute compromises, the runbook below is the response. ### Trade-offs - **CRL.** Operationally simple to publish, but every client has to fetch and trust it. Adding a fetch-and-trust step to telegraf, Grafana, three Celery worker fleets, and ops-db-api is real work for a threat model where we can already roll the underlying secret. - **OCSP.** Real-time but adds a hard dependency on the CA being reachable from every TLS handshake. We just spent ADR-0001 (`docs/source/adr/0001-ca-per-vhost-cert-split.md`) carefully containing the CA's reachability surface; OCSP would re-expand it. - **Lifetime-as-revocation.** The 90d ceiling means a compromised cert is automatically not-trusted within 90d without operator action. For acute compromise we roll the secret immediately; the cert remains technically valid until it expires but the secret it protected is already changed. ### Compromise modes — runbook headlines Full runbook: see the threat-model document (TODO: link when written). | Mode | Headline response | |---|---| | **Server key leaked** (Redis/Postgres/Influx host private key on disk readable by attacker) | Re-issue the cert with the role (`ccat rotate `), reload via the adapter. Old cert remains valid until `notAfter` but no longer protects anything. | | **Client key leaked** (Redis client cert on a compromised app host) | Rotate the client cert via the redis_certs successor flow. Same lifetime caveat. | | **HSM key leaked** (root CA private key compromised) | Stop the CA; cut a new root via ceremony; redistribute via `ca_trust` role; re-issue every leaf cert. This is the catastrophic case and is what `step-ca/ceremony-playbook.pdf` exists for. | ### Consequences - Operators need to internalise "rolling the secret + waiting for expiry" as the revocation primitive. This is documented at the runbook level, not on every `ccat` CLI invocation. - A future regulatory audit that asks for "CRL endpoint" gets the answer "no CRL; lifetime ceiling and operator-led rotation". Be prepared to defend that. --- ## Decision: Trust distribution — bind-mount + env vars, not image rebuild ### Decision The CCAT root CA is distributed to containers via a **bind-mount** of `/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` (placed there by `roles/ca_trust/`) and an env var pointing each application's TLS library at it (e.g. `SSL_CERT_FILE`, `PGSSLROOTCERT`, etc). We do **not** bake the CA root into the application container images. ### Reasoning Baking the root into the image couples root-rotation cadence to CI build cadence: every root rotation triggers a rebuild and redeploy of every image. With bind-mount + env var, root rotation is "update one file on disk via the `ca_trust` role, restart consumers" — independent of CI. This is the same separation already in effect for the SSH cert plane (`roles/ssh_service_cert/` mounts `~/.ssh` from the host into spawned agents, see commit `ce87baa`). ### Consequences - Container images stay smaller and rebuild less often. - The `ca_trust` role is now a hard dependency for every host that hosts a TLS-consuming container. This is already true today. - A misconfigured bind-mount path silently turns into "no CA root" at the container level. The role must verify post-mount that the expected fingerprint is present. --- ## Decision: Compose layering — trust anchor in shared file, layered per context ### Context Validation runbook Check 6 (2026-05-08) surfaced a structural fact while inventorying the per-machine compose files: every staging-input, prod-input, and chile context deploys with a **single self-contained per-machine compose file**. There is no shared `docker-compose.yml` base in the layering for those contexts. From `src/ccat_dc/_constants.py`: ```python CONTEXT_COMPOSE: dict[str, list[str]] = { ... "staging-input-a": ["docker-compose.staging.input-a.yml"], "staging-input-b": ["docker-compose.staging.input-b.yml"], "staging-input-c": ["docker-compose.staging.input-c.yml"], "prod-input-a": ["docker-compose.production.input-a.yml"], "prod-input-b": ["docker-compose.production.input-b.yml"], "prod-input-c": ["docker-compose.production.input-c.yml"], } ``` This invalidates the implicit assumption in #95 that an `x-ccat-trust:` YAML anchor could live in a single base file and merge into each app service via `<<: *ccat-trust`. There is no single base for the contexts that matter; YAML anchors only resolve within a single file. So the anchor cannot be "defined once, merged everywhere" by accident — it needs an explicit wiring decision. ### Decision Define the anchor and the per-service merge entries in a new file `docker-compose.trust.yml`, and layer it into every applicable context via `CONTEXT_COMPOSE`. The trust file is the **single source of truth** for "which services get the trust bundle bind-mount": ```yaml # docker-compose.trust.yml (sketch) x-ccat-trust: &ccat-trust volumes: - /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem:/etc/ssl/certs/ca-bundle.crt:ro services: postgres: { <<: *ccat-trust } redis: { <<: *ccat-trust } influxdb: { <<: *ccat-trust } ops-db-api: { <<: *ccat-trust } ops-db-ui: { <<: *ccat-trust } grafana: { <<: *ccat-trust } pgadmin: { <<: *ccat-trust } db-backup: { <<: *ccat-trust } # ...one line per app service across all 7 per-machine files... ``` Plus the corresponding `_constants.py` change: ```python "staging-input-a": ["docker-compose.staging.input-a.yml", "docker-compose.trust.yml"], "staging-input-b": ["docker-compose.staging.input-b.yml", "docker-compose.trust.yml"], "staging-input-c": ["docker-compose.staging.input-c.yml", "docker-compose.trust.yml"], "prod-input-a": ["docker-compose.production.input-a.yml", "docker-compose.trust.yml"], "prod-input-b": ["docker-compose.production.input-b.yml", "docker-compose.trust.yml"], "prod-input-c": ["docker-compose.production.input-c.yml", "docker-compose.trust.yml"], ``` The chile production context (`docker-compose.production.chile.yml`) is not in `CONTEXT_COMPOSE` today; its layering choice mirrors the input nodes when it is added. On deploy, `docker compose -f -f docker-compose.trust.yml` merges each per-service entry from the trust file into the same-named service in the per-machine file. Adding a new app service that needs trust is a one-line addition to `docker-compose.trust.yml`, not a touch in seven separate per-machine files. ### Alternatives considered - **Option A — duplicate the anchor in each per-machine file.** Concrete, no `_constants.py` wiring change. Rejected because it creates a 7-file sync target — adding a new app service that needs trust means remembering to annotate it in whichever per-machine file it lands in, and the failure mode of forgetting is a silent TLS rejection at runtime. Validation runbook Check 6's per-service `has_trust: true` spot-check catches the oversight, but only if it is actually run. - **Option C — bind-mount the parent directory** (`/etc/pki/ca-trust/extracted/pem/`) instead of the single bundle file. Would also resolve TODO 14 (single- file bind-mount staleness on rotation) because directory bind-mounts re-resolve dentries on lookup. Rejected from this section because it changes the *what* of the bind-mount; the layering question is the *how-to-wire*. Track the directory-vs-file question under TODO 14; if option C is chosen there, this layering decision is unaffected (the trust file's volume entry just changes shape). ### Why option (b) wins - **One source of truth.** The list of services-that-need-trust lives in `docker-compose.trust.yml`, not scattered across 7 per-machine files. - **Auditable PR review.** A Phase B PR diff is one file plus a 6-line `_constants.py` change. Reviewers do not have to grep across 7 compose files to verify coverage. - **Adding a new context** (e.g., a future input-d) is "add the per-machine compose, append trust to its `CONTEXT_COMPOSE` entry" — a known shape, not a new sync rule. - **Dev/local opt-out is explicit.** The dev contexts (`dev`, `localdev`, `local`) keep self-signed certs per the TLS-hard-cutover-policy convention (2026-05-07 memory). Not adding the trust file to those entries in `CONTEXT_COMPOSE` is the explicit opt-out, visible in the diff and reviewable. - **YAML anchor mechanics are unchanged.** The `<<: *ccat-trust` merge happens within `docker-compose.trust.yml` itself; no cross-file anchor references are required (docker-compose does not resolve anchors across files anyway). ### Consequences - **`docker-compose.trust.yml` becomes a hard dependency** for every context that lists it. A missing or malformed trust file fails the deploy at compose-render time, before any container starts — fail-closed, which is the desired property. - **`_constants.py` is the source of truth for context wiring.** Future ADRs that touch deployment topology should reference here. - **Phase B PR shape**: one new file (`docker-compose.trust.yml`, ~70 lines including the anchor and 60 service entries), one `_constants.py` patch (6 lines), no per-machine compose edits. Validation runbook Check 6's deferred per-service spot-check becomes `docker compose -f ... config | yq` against the merged render and lands in the same PR. - **Service inventory (validation runbook Check 6, 2026-05-08)** — this is the input set for `docker-compose.trust.yml`'s `services:` mapping: | File | Total | needs trust | exempt | |---|---|---|---| | `production.input-a.yml` | 12 | 11 | promtail | | `production.input-b.yml` | 10 | 8 | loki, promtail | | `production.input-c.yml` | 8 | 7 | promtail | | `production.chile.yml` | 9 | 8 | promtail | | `staging.input-a.yml` | 13 | 11 | loki, promtail | | `staging.input-b.yml` | 10 | 8 | loki, promtail | | `staging.input-c.yml` | 8 | 7 | promtail | Total: **60 app-service annotations** across the 7 files. - **Exemptions are deliberate, not oversights.** `promtail` ships logs to Loki via plain HTTP and has no DB connection; `loki` is a log store with no DB clients in the compose graph. Both are defence-in-depth candidates if a future change makes them speak to a step-ca-issued vhost — at which point they become a one-line addition to `docker-compose.trust.yml`. Not load-bearing for #95. --- ## Decision: Alert substrate — tiered, with a TLS-independent backstop ### Context The PRD draft proposed renewal-failure alerts flowing telegraf → InfluxDB → Grafana → ops chat. The architect review caught a circular dependency: that alert path itself depends on the TLS trust chain we're trying to monitor. If the trust chain breaks, the alert telling us so is silenced by the same break. An earlier draft of this ADR section recommended "piggyback on the SSH-cert plane" on the premise that the SSH-cert plane's failure-notification path is *by construction independent* of the database TLS chain. **That premise was wrong.** Inspection of `ansible/roles/ssh_service_cert/templates/step-cert-monitor.sh.j2` plus `ansible/roles/system_setup/files/telegraf.conf:960` shows the SSH-cert plane emits `step_cert` and `step_renew_failed` measurements via Telegraf `[[inputs.exec]]`, and Telegraf's `[[outputs.influxdb_v2]]` writes to `http://db.data.ccat.uni-koeln.de:8086` — the same InfluxDB on input-b that this PRD is hardening. The SSH-cert plane shares fate with the database TLS chain. Piggybacking on it does not break the circular dependency; it just inherits it under a different name. The fix is not to rebuild on a different single substrate — it is to accept that any single substrate convenient enough to use day-to-day will share fate with *something* in the stack. We need a backstop tier that is genuinely independent. ### Decision **Tiered substrate, three independent paths:** #### Tier 1 — Primary (visibility + everyday paging) Telegraf `[[inputs.exec]]` on every cert host emits, mirroring the existing SSH-cert plane's `step-cert-monitor.sh.j2`: - `step_x509_cert,service=...,host=... seconds_to_expiry=Ni` - `step_x509_renew_failed,service=...,unit=... value=0|1` - `step_x509_cert_last_renewal_success,service=... seconds_ago=Ni` Telegraf → InfluxDB → Grafana → Matrix room (page channel: `#ccat-ops:matrix.data.ccat.uni-koeln.de`). Catches single-service renewal failures, perms drift, image UID drift (TODO 7) — anything that doesn't take down InfluxDB or Grafana itself. Tier 1 *does* transit step-ca-issued TLS once Phase E lands (Telegraf → InfluxDB will use the new server cert). This is acknowledged, not denied. It is the "convenience" tier; it is not load-bearing for the catastrophic case. #### Tier 2 — Backstop (TLS-independent, catches catastrophic failures) Two host-local mechanisms, both calling `mailx` to the existing `admin_email_addresses` alias (already configured by `ansible/roles/system_setup/tasks/sendmail.yml` — root → admin alias is in place via `/etc/aliases`): - **`OnFailure=`** unit on every renewal systemd timer. Fires immediately when a renewal unit reports `failed`. Mail body includes hostname, service, unit name, last 20 lines of `journalctl -u `. - **Daily heartbeat cron** at 06:00 UTC sends mail "all certs OK on $HOSTNAME, soonest-expiry=Nd, issuance-events-today=N" with one line per cert. **Absence of mail for 36h on any host = problem**, even if no specific failure was detected. The mail path goes via the host MTA (sendmail) → uni-köln SMTP relay → admin inbox. **This tier does not transit any step-ca-issued TLS cert.** It is the only path that survives: - HSM offline (no new certs issuable). - InfluxDB down on input-b (Tier 1 metrics black-hole). - Grafana down on input-b. - Matrix homeserver down on input-b. - Network partition between input-a/c and input-b. The only break-conditions for Tier 2 are host network down or the external SMTP relay down — both known operational classes that are not silently coupled to step-ca. #### Tier 3 — Issuance audit (anomaly detection) Every JWK-password-using `step ca certificate` invocation in the role wraps its call site in a logger trap that writes a structured journald line: ``` ccat-step-issuance: host=$HOSTNAME service=$SVC ts=$ISO triggered_by=$USER ``` promtail ships journald to Loki; a Grafana alert fires when issuance-events-per-week exceeds the expected baseline (production: ~6/year per service after Phase A; staging: ~12/year per service). Mostly Tier-1 plumbing, but the daily heartbeat mail (Tier 2) also includes `issuance-events-today=N` — so an attacker who silences Loki and InfluxDB still has to silence the host MTA path to hide issuance events. ### How this maps to the seven acceptance clauses | # | Clause | Tier 1 | Tier 2 | Tier 3 | |---|---|---|---|---| | 1 | Substrate | Telegraf+Grafana+Matrix | cron+`OnFailure`+`mailx` | journald+Loki+Grafana | | 2 | Page channel | Matrix `#ccat-ops` | email to `admin_email_addresses` | (rolls up into 1+2) | | 3 | On-call contract | **deferred to a follow-up — see "Out-of-scope" below** | | | | 4 | No-step-ca-TLS statement | acknowledged transits step-ca TLS | **does not transit** | partial (Loki not on step-ca today) + Tier-2 backstop | | 5 | Renewal-failure alert | `step_x509_renew_failed > 0` → Grafana alert | systemd `OnFailure=` mail | n/a | | 6 | Renewal-success heartbeat | `last_renewal_success_seconds_ago > 24h` alert | daily 06:00 mail; absence ≥ 36h = problem | n/a | | 7 | Issuance audit log | n/a | "issuance-events-today=N" line in heartbeat mail | structured journald + Loki alert on >2σ above 30d baseline | ### What this changes in the existing infrastructure - **New Telegraf input-exec script** templated by the `step_ca_vhost_cert` role, parallel to `roles/ssh_service_cert/templates/step-cert-monitor.sh.j2` — same pattern, x509 measurements instead of SSH ones. ~1 small PR. - **Renewal systemd timer template** gains `OnFailure= ccat-cert-mail@%i.service` and a sibling `ccat-cert-mail@.service` unit that calls `mailx`. ~1 small PR. - **Daily heartbeat cron entry** (`/etc/cron.d/ccat-cert-heartbeat`) templated per host from the cert-spec list. ~1 small PR. - **Issuance audit log** is a one-line `logger -t ccat-step-issuance` wrapper around the issuance script and a Grafana/Loki alert rule. Folded into the role's issuance task. Three small Phase A PRs, parallelisable with the role itself; none depends on the cert-issuance role being complete first. ### Alternatives considered (why this won over the original options) - **Single-substrate, "use what exists and is independent" (original recommendation: SSH-plane piggyback).** Rejected because the SSH-plane is not actually independent — it shares the same Telegraf → InfluxDB pipe. Single-substrate framing is the bug. - **cron + mailx as the *only* path.** Robust, but operationally thin: no silencing, no ack, no per-service severity, no dashboard. Acceptable as a backstop; not enough as the everyday path. - **Pushgateway + Alertmanager over plain HTTP on a private docker network.** Adds two new components for one alert class. Yagni until we have at least three substrates that would benefit from a unified alerting layer. Revisit when the alerting story is mature enough to consolidate. ### Why a tiered design is the honest answer A single substrate convenient enough to be the daily path will share fate with something. The architect's worry was real; the fix is not "find a magically-independent single path" (no such path exists at this scale of infrastructure) but "have a backstop that is deliberately inconvenient — mail to a mailing list — so it is actually independent". Tier 2 is operationally annoying on purpose: mail is not a great paging UX, but it is a great *backstop UX* because it doesn't transit any of the things we are trying to alert on. ### Consequences - **Three artifacts to maintain** instead of one. Worth it for the load-bearing independence guarantee. - **`admin_email_addresses` is now load-bearing.** Document the alias contents and the SMTP relay path in the on-call runbook (when one exists). Test the path during Phase A by deliberately failing a staging renewal and confirming the mail arrives. - **Grafana / InfluxDB / Matrix outage scenarios are now page-quiet on Tier 1 by design.** Operators must internalise that "no Tier 1 alert" means "Tier 1 is up", not "all is well". The daily Tier 2 heartbeat is the positive-confirmation signal. - **Future consolidation** (e.g., Alertmanager) replaces Tier 1 without disturbing Tier 2. Tier 2 is the architectural floor. --- ## Open questions - **Is `runtime_redis` (CONFIG SET) sufficient on Redis 7 with TLS-only listeners?** The `tls-port` directive isn't reloadable via CONFIG SET in some Redis versions; verify on the version we ship. If not, `runtime_redis` degrades to a `restart_redis` adapter and Redis joins InfluxDB in the 30s-downtime club. - **mTLS asymmetry follow-up.** Schedule a review at the next data-transfer credentials refactor. Don't block #95. - **Threat-model document link.** The full leak-response runbook lives there; this ADR carries the headlines. Link when written. ### Resolved - **(2026-05-08) Does `step ca renew` succeed against an already-expired authenticating cert?** Resolved by configuration inspection (validation runbook Check 4). `step ca provisioner add --allow-renewal-after-expiry` exists as a flag; `step-ca/provisioners-add.sh` does NOT pass it on `prod-services` or `staging-services`. Default is `false`. Therefore `step ca renew` on an expired cert is refused under the current CA config. **Decision: keep the strict default (`allowRenewalAfterExpiry: false`, i.e. Option A).** Threat-model trade-off: - Service-host snapshot leak (cert+key only): the JWK provisioner password is NOT on service hosts in steady state — vault-staged as a 0400 host tmpfile during issuance, unlinked in the `always:` block of the issuance play. Steady-state renewal uses cert-as-auth and needs no password. So a snapshot leak gives the attacker cert+key but not the password, and Option A's "expired = denied" semantic auto-bounds the leak at `notAfter` *if the attacker fails to renew in time*. Detection-then-host-rotation breaks the renewal chain. - Controller compromise (saiyajin / Jenkins-on-input-b): both options are equally lost. Vault key lives there. - Persistent service-host compromise spanning an issuance window: attacker eventually grabs the 0400 tmpfile. Both options equally lost. - Operational cost of Option A: HSM offline > 30d production budget (15d staging) requires manual re-issuance ceremony — vault → 0400 tmpfile → run issuance script. Same pattern as today's vhost cert and `ssh_service_cert/_per_container.yml`. Option A's protection is contingent on detection. Therefore monitoring + canary become load-bearing (TODO 15 in the pre-implementation TODO list, plus expanded acceptance for the alert substrate in TODO 4). - **(2026-05-08) `update-ca-trust extract` atomicity.** Resolved by validation runbook Check 3. `update-ca-trust` swaps the bundle via atomic rename on RHEL 10.1 (inode change verified). No partial-read window on the host filesystem. Downstream nuance: Linux single-file bind-mounts pin the source inode, so atomic rename on the host means containers see the *old* bundle until restart — tracked as TODO 14, not a blocker for this ADR. - **(2026-05-08) Trust-anchor compose layering.** Resolved by validation runbook Check 6 + this ADR's "Decision: Compose layering" section. New `docker-compose.trust.yml` is the single source of truth for service-needs-trust; layered into each applicable context via `CONTEXT_COMPOSE`. TODO 16 closed on this ADR section landing. - **(2026-05-08) Break-glass SSH access during HSM-down >24h.** Resolved by static review of existing infrastructure rather than by adding a new artifact. The architect's concern presumed step-ca-issued user certs are the only operator auth path; `ansible/roles/system_setup/tasks/nitrokey_ssh.yml` applies per-operator FIDO2 hardware-key pubkeys to plain `authorized_keys` on every managed host (outside the `AuthorizedPrincipalsFile` cert path), and out-of-band hardware consoles cover hardware-level recovery. The Nitrokey path survives any step-ca outage by construction. TODO 5 dropped; Check 11 signed off as N/A. See "Operational notes" for the role-split rationale (Nitrokey for core admins, step-ca SSH certs for remote admins). - **(2026-05-08) x509 canary on `input-c.staging`.** Resolved by this ADR's "Decision: x509 canary on `input-c.staging`" section. 24h cert from `staging-services` JWK on a non-CA host (`input-c.staging`); 12h timer cadence; failure to renew within 18h of `notAfter` triggers Tier 1 + Tier 2 alert. Adds a fourth `noop` reload adapter (general-purpose; canary is its first user). Doubles as validation runbook Check 8 (page-path E2E). TODO 15 closed on this ADR section landing. - **(2026-05-08) Cert-spec schema and UID parameterisation.** Resolved by this ADR's "Decision: Cert-spec schema — parameterised UIDs, no defaults baked in" section. UIDs are per-host facts (`host_container_uids` dict) referenced from cert-specs via `\| mandatory` so a missing fact fails the play loudly. The schema is the shared contract for the role, the TODO 7 runtime-drift script, and the TODO 4 alert substrate's service labels. TODO 3 closed on this ADR section landing. - **(2026-05-08) Alert substrate.** Resolved by replacing the single-substrate framing (SSH-plane piggyback) with a tiered design — see "Decision: Alert substrate — tiered, with a TLS-independent backstop". The SSH-plane piggyback recommendation in an earlier draft of this ADR was based on a wrong premise (the SSH plane shares the same Telegraf → InfluxDB pipe and so shares fate with the database TLS chain it was supposed to monitor). The tiered fix: Tier 1 Telegraf+ Grafana+Matrix for everyday paging, Tier 2 cron+`OnFailure`+ `mailx` to `admin_email_addresses` as the load-bearing TLS-independent backstop, Tier 3 journald+Loki+Grafana for issuance-frequency anomaly detection. TODO 4 closed on this ADR section landing. The on-call hand-off contract clause is explicitly deferred to a follow-up — channels exist; rotation contract is a team-structure decision for when the rotation exists. --- ## Out-of-scope Things this PRD and ADR explicitly do not address. Each item is here because someone has asked or might reasonably ask, and the answer is "not in this rollout": - **2-week F→G dual-trust soak.** Waived under the time-bound setup-mode argument in "Decision: Migration style". Revisit if production becomes populated before Phase G ships. Do not cite this ADR as precedent for skipping a soak on a populated production stack. - **Migrating Redis off mTLS to server-auth-only-with-password.** Inertia, not principle (see "Decision: mTLS scope asymmetry"). Revisit at the next data-transfer or ops-db-api credentials refactor. - **CRL or OCSP infrastructure.** Lifetime-as-revocation only (see "Decision: Revocation stance"). A regulatory ask for a CRL endpoint is a future ADR. - **`ops-db-api` inbound TLS** (`nginx-proxy → ops-db-api`). Currently undecided (TODO 11). Once chosen, the answer goes into "Operational notes" if in-scope, or remains here if explicitly out-of-scope, or moves to its own ADR. - **Cert-transparency / public-log integration.** Step-ca is a private CA; not applicable. - **Baking the CCAT root into application images.** Trust distribution decision: bind-mount + env var, not image rebuild. - **A unified CLI surface upfront** (`ccat tls rotate`, `ccat tls status`). YAGNI; design after the role works. - **Renewal-job log retention beyond `journalctl`.** Covered by the general logging / Loki policy, not this PRD. - **Backup-as-cert-recovery-path.** Backup coverage of service-cert directories is not confirmed by ITCC (TODO 17). The role re-applying after a host reinstall is the recovery path; backups are best-effort defence in depth, not load-bearing. - **F→G soak in any future TLS migration on populated production.** See "Decision: Migration style" → Time-bound. Future migrations on a populated stack must use a soak. - **On-call hand-off contract for the alert substrate** (who acks, escalation timeout, expected MTTR). Channels exist (Tier 1: Matrix `#ccat-ops`, Tier 2: `admin_email_addresses` mail). The rotation/ack/MTTR contract is a team-structure decision deferred until a real on-call rotation exists. Track as a follow-up; not blocking PRD #95. --- ## Operational notes This section consolidates the operational concerns surfaced through the per-decision sections above into one place for on-call. Each item points back to where the rationale lives. - **HSM offline budget.** 30d production / 15d staging (HSM blast radius decision). Beyond budget: manual re-issuance ceremony via JWK provisioner password (vault → 0400 host tmpfile → unlink in an `always:` block of the issuance play). This is not auto-recovery — it requires an operator with vault access to run the issuance script. - **Renewal cadence.** 12h timer per host, modelled on `step-ca/renew-vhost-cert.sh`. Most fires are no-ops because `step ca renew` only contacts the CA in the last 1/3 of cert lifetime. A misconfigured timer (or a `--force` storm during rollout) is a CA-DoS risk; throttle / serialise mass issuance during phase rollouts (TODO 6). - **Trust-anchor rotation requires container restart.** Single-file bind-mounts pin the source inode (TODO 14). Any change to `/etc/pki/ca-trust/source/anchors/` followed by `update-ca-trust extract` REQUIRES a rolling restart of every container that bind-mounts the trust bundle. The host gets the new file atomically; running containers do not. Either accept this and document the restart in the rotation procedure, or move to a directory bind-mount (TODO 14 alternative). - **Postgres replica during rotation.** Primary and replica must not renew simultaneously while replication is mid-write (TODO 10). The chosen ordering — primary-first with a wait gate, or replica-first, or a coordination lock — is recorded under TODO 10 acceptance and migrates here once decided. - **Alert path independence — tiered substrate.** Three paths (alert substrate decision, TODO 4 closed). Tier 1 (Telegraf → InfluxDB → Grafana → Matrix `#ccat-ops`) is the everyday paging path and shares fate with input-b services. Tier 2 (systemd `OnFailure=` + daily 06:00 cron heartbeat → `mailx` to `admin_email_addresses`) is the load-bearing TLS-independent backstop — does not transit any step-ca-issued cert. Tier 3 (journald → promtail → Loki → Grafana) is the issuance-anomaly audit. **Operational rule for on-call:** "no Tier 1 alert" means "Tier 1 is up", not "all is well"; the daily Tier 2 heartbeat mail is the positive- confirmation signal — absence ≥ 36h on any host is itself a problem. - **Container UIDs are per-host parameters.** Cert-spec UIDs (TODO 3) are parameterised, not hardcoded; runtime UID drift is detected by the renewal script (TODO 7) by reading `/proc/1/status` from PID 1 inside each container (`docker exec ... id` defaults to root and is the wrong probe — runbook Check 5 captured this gotcha). Today's values, observed on `input-b.staging` 2026-05-08: Redis 999, Postgres 999, InfluxDB 1000. `influxdb:latest` is the only unpinned image in the stack — drift risk concentrates there. - **Backup is not the cert recovery path.** Service-cert directories may not be in the central Commvault policy (TODO 17, ITCC ticket pending). Recovery on host reinstall is "re-run the `step_ca_vhost_cert` role". Document this in the role README. - **Break-glass SSH already provided by existing infrastructure.** The architect's worry — "if step-ca is down >24h, every operator's SSH cert expires and nobody can SSH in to fix it" — assumed step-ca-issued user certs are the only operator auth path. They are not. `ansible/roles/system_setup/tasks/nitrokey_ssh.yml` applies per-operator FIDO2 hardware-key pubkeys (`roles/system_setup/files/pubkeys//*.pub`) directly to `authorized_keys` on every managed host, outside the `AuthorizedPrincipalsFile` cert path. Out-of-band hardware access (iDRAC / hypervisor console) provides the second tier for hardware-level recovery. The role split is: Nitrokey for **core admins** (physically present, hardware key in pocket), step-ca SSH certs for **out-of-core / remote admins** where shipping a hardware key is impractical. TODO 5 is dropped on this basis; Check 11 signed off as N/A. - **Compose layering is anchored in `docker-compose.trust.yml`.** Single source of truth for the service-trust bind-mount matrix (compose-layering decision). Validation runbook Check 6 inventory is the input set; per-service `has_trust: true` spot-check lands in the same Phase B PR as the trust file itself. --- ## Consequences — overall **What becomes easier:** - One trust root for the whole CCAT stack (SSH plane, vhost cert, three databases). Operators need to know one CA, one root file path, one renewal model. - Retiring `roles/redis_certs/` and `redis//certs/` removes a homegrown PKI with four parallel CAs that nobody outside this team can audit. - Adding a fourth TLS-consuming datastore later is "add a reload-strategy adapter", not "build new PKI". **What becomes harder:** - The CCAT root CA / HSM is now load-bearing for *more* things. HSM ceremony cadence and HSM availability matter more than they did. The soft-offline budget gives us 30d production / 15d staging headroom but the calculus is now "how long can the HSM be offline" not "how long can the redis-certs CA be offline" (which was effectively infinite because that CA was a file on input-b's disk). - The pluggable-adapter design means the role has three test surfaces, not one. Plan for that in the test plan. **New operational duties:** - Watch the SSH-cert-plane notification stream for DB cert renewal failures (decision-section: alert substrate). - Maintain the schema entry for any new `vault_step_ca_prov_*` passwords (lines up with the existing vault schema work in `data-center-computer-setup/vars_application_schema.yml`). - The `ccat redis-certs` CLI commands (currently in `ctl`) get superseded; plan a CLI surface for the new role (`ccat tls rotate `, `ccat tls status`). Don't build it before the role works; YAGNI. --- ## References Files verified to exist in the repo at the time of writing: - `step-ca/issue-vhost-cert.sh` — one-shot issuance pattern (JWK provisioner password via `--password-file`, atomic `.new` install, docker exec reload). - `step-ca/renew-vhost-cert.sh` — `step ca renew` cert-as-auth pattern, PRE_MTIME/POST_MTIME conditional reload, 12h timer cadence. - `ansible/roles/ssh_service_cert/tasks/_per_container.yml` — password-staging-from-vault → 0400 host tmpfile → unlink convention, `community.docker.docker_container_exec` with stdin-only password delivery. - `ansible/roles/ca_trust/` — RHEL system-anchor distribution for the CCAT root. - `ansible/roles/redis_certs/` — homegrown PKI being retired by this ADR. - `redis/{main,ccat,develop,develop-ccat}/certs/` — per-variant CAs being sunset. - `grafana/provisioning/{production,staging}/datasources/influxdb-datasource.yaml` — current `tlsSkipVerify: true` lines, plain HTTP datasource URL. - `docs/source/adr/0001-ca-per-vhost-cert-split.md` — prior ADR on the CA's own vhost cert; format and reasoning style mirrored here. - PRD: ccatobs/system-integration#95 — defers full decision tree to this document.