ADR-0002 — Step-CA-issued TLS certificates for Redis, Postgres, and InfluxDB#

Status. Proposed, 2026-05-08.

Related. ccatobs/system-integration#95 (PRD).

Supersedes. Nothing on disk; the homegrown PKI under ansible/roles/redis_certs/ and redis/<variant>/certs/ is what this retires.

Context#

Today three datastores in the stack use ad-hoc or absent TLS:

  • Redis — homegrown CA per environment variant (ansible/roles/redis_certs/, four self-signed CAs under redis/{main,ccat,develop,develop-ccat}/certs/). mTLS is configured but the CA story is bespoke and cannot be audited as part of the rest of the CCAT trust chain.

  • Postgres — TLS not enforced; client traffic in the clear or with trust-on-first-use.

  • InfluxDB — fronted today on plain HTTP. The Grafana datasource at grafana/provisioning/production/datasources/influxdb-datasource.yaml literally points at http://data.ccat.uni-koeln.de:8086 with tlsSkipVerify: true. The same pattern lives in grafana/provisioning/staging/datasources/.

The CCAT step-ca endpoint already issues:

  • the SSH user-cert plane (ansible/roles/ssh_service_cert/),

  • the public TLS cert for the CA’s own vhost (step-ca/issue-vhost-cert.sh, step-ca/renew-vhost-cert.sh, step_ca_vhost_cert.timer every 12h).

PRD #95 proposes routing the three datastores onto the same step-ca-issued path: a single root of trust, predictable lifetimes, and the same operator muscle memory.

A senior-architect review of #95 blocked the PRD on this ADR existing. The PRD names the headline decisions (“JWK”, “hard cutover”, “multi-SAN no IP”) but defers the reasoning to here. This ADR is that reasoning.

Important PRD correction#

The PRD claims it copies an existing step_ca_vhost_cert Ansible role verbatim. There is no such role on disk. The prior art is:

  • the scripts at step-ca/issue-vhost-cert.sh (one-shot issuance) and step-ca/renew-vhost-cert.sh (PRE_MTIME / step ca renew / POST_MTIME / conditional reload),

  • the per-container password-staging convention in ansible/roles/ssh_service_cert/tasks/_per_container.yml (vault → 0400 host tmpfile → unlink in an always: block; or stdin-only via docker_container_exec for the in-container case).

The implementation must build the Ansible role from these patterns, not from a role that does not exist. Any reader of this ADR or the PRD should not waste time grepping for step_ca_vhost_cert/ under ansible/roles/.

Decision#

Issue TLS certs for Redis, Postgres, and InfluxDB from the CCAT step-ca via the JWK provisioner, using cert-as-auth (step ca renew) on a 12h timer cadence. Cut over hard, no dual-trust soak. Implement as a single parameterised Ansible role (step_ca_vhost_cert) plus three pluggable reload-strategy adapters — not as one deep module pretending all three services are the same shape.

Per-decision detail follows.


Decision: Provisioner choice — JWK over ACME and X5C#

Context#

step-ca offers three provisioner classes for non-interactive cert flows: ACME (HTTP-01, DNS-01, TLS-ALPN-01), X5C (cert-presented-as-auth, but chained to an external trust root), and JWK (password-or-key-protected provisioner credentials).

Decision#

Use JWK, with step ca renew (cert-as-auth) for steady-state renewal. Initial issuance presents the JWK provisioner password; every renewal thereafter authorises with the cert’s own private key, so the provisioner password never has to live on the renewing host past the one-shot issuance step.

Working precedent: step-ca/renew-vhost-cert.sh does exactly this for the ca.ccat.uni-koeln.de vhost cert. step ca renew only contacts the CA inside the renewal window (last 1/3 of lifetime by default), so a 12h timer is benign — most fires are no-ops.

Alternatives considered#

  • ACME HTTP-01. Would require the CA reach the requesting service on port 80. In our topology Redis on input-b is firewalled off the public internet; Postgres on input-a is internal; InfluxDB has its own vhost path. Opening HTTP-01 challenge paths through the proxy for three more vhosts adds a brittle coupling between CA, proxy config, and ACME challenge timing — and is the operational class of problem that ADR-0001 already had to navigate to get the CA’s own vhost cert working.

  • ACME DNS-01. Would require the CA orchestrate DNS records in the Uni-Köln DNS zone. We do not control that zone programmatically; a manual record-flip per renewal is unacceptable on a 12h cadence.

  • ACME TLS-ALPN-01. Same firewall constraint as HTTP-01, plus the Redis/Postgres/Influx daemons are not HTTP servers and cannot serve the challenge.

  • X5C. Would require us to bootstrap a separate trust root just to authorise these provisioners, then maintain it. It does not solve a problem JWK doesn’t already solve; it adds a parallel trust path we’d then have to monitor.

Consequences#

  • The JWK provisioner password is in vault (vault_step_ca_prov_*_password) and only reaches the issuing host via Ansible’s vault → 0400 tmpfile → unlink pattern from roles/ssh_service_cert/tasks/_per_container.yml. Steady-state renewals do not touch the password at all.

  • All renewals share one well-trodden path (step ca renew) so an operator who has debugged the vhost cert renewal already knows how to debug a Redis cert renewal.

  • An open question (see below): does step ca renew succeed against an already-expired auth cert? If not, an HSM outage that exceeds the renewal budget plus the window between renewal and expiry forces a fall-back to the JWK-password path.


Decision: Migration style — hard cutover, no dual-trust soak#

Context#

The architect’s default recommendation for any TLS migration is a two-week dual-trust soak (old CA + new CA both accepted, then flip). The PRD instead proposes a hard cutover for all three services.

Decision#

Hard cutover. This is consistent with the existing TLS-hard-cutover-policy ADR captured in project memory (2026-05-07): step-ca trust + DB certs roll out via deploy-time restart, not dual-trust.

Why this stands here, even though architects would normally object#

Production is currently in setup mode: there are no end users on the operations DB, no live data streams flowing through the transfer pipeline, no externally consumed Grafana dashboards depending on the InfluxDB datasource. A two-week soak buys nothing because the “availability we’d be protecting” doesn’t exist yet. The cost of a soak (double-config, more code paths, more places for a misconfigured client to silently fall back to the old trust path) is real today; the benefit is zero today.

Time-bound — read this before reusing this precedent#

The above is only true while production is unpopulated. Once the operations DB carries real observation records, once the data-transfer pipeline is moving live telescope data, once Grafana dashboards are being watched by humans on call — the calculus flips. Any future similar migration on a populated production stack must use a soak. Do not point at this ADR as precedent for skipping a soak on a live system. The precedent is “skip soak when there are no users”, not “skip soak in general”.

Alternatives considered#

  • 2-week F→G dual-trust soak. Standard playbook. Rejected on the cost/benefit argument above, time-bound to the empty-production state.

  • Service-by-service phased cutover (Redis first, then Postgres, then Influx). Rejected as not actually safer in the current state — each service still hard-cuts when its turn comes; the phasing only spreads operator attention thinner. We will sequence by readiness of the reload adapter (probably Postgres first because pg_reload_conf() is the cheapest), not by risk-mitigation.

Consequences#

  • A failed cutover is a service outage on whichever datastore failed. Mitigation: rehearse on staging first; the staging environment uses the same step-ca and the same role.

  • This ADR must be revisited (and likely rewritten) before the next TLS migration on a populated production stack. Add a checkbox to the production-readiness review.


Decision: Renewal architecture — one role, three reload adapters#

Context#

The PRD as drafted proposed a single Ansible module that takes a cert_spec dict (name, SANs, lifetime, owner, mode, reload-command) and handles Redis (mTLS + redis-cli CONFIG SET), Postgres (server-only

  • pg_reload_conf()), and InfluxDB (server-only + container restart) through that one shape. The architect review pushed back: a single dict that has to fork on if redis else if postgres else if influx inside the module is a deep-module fiction — the fork is inherent to the problem and pretending it isn’t makes the module’s interface lie.

Decision#

Build one parameterised role (step_ca_vhost_cert) that handles:

  • issuance via JWK provisioner,

  • on-disk cert layout, ownership, mode,

  • the renewal timer/script (modelled on step-ca/renew-vhost-cert.sh with PRE_MTIME / POST_MTIME conditional reload),

  • trust-anchor consumption from roles/ca_trust/.

…and expose a pluggable reload-strategy interface with four adapter implementations:

Adapter

Service

Reload mechanism

Downtime

runtime_redis

Redis

redis-cli CONFIG SET tls-cert-file ...; CONFIG SET tls-key-file ...

zero

runtime_postgres

Postgres

SELECT pg_reload_conf(); (or pg_ctl reload)

zero

restart_influx

InfluxDB

docker restart influxdb

~30s

noop

(canary or no-service-attached cert)

nothing — write files, exit 0

n/a

The role takes a reload_strategy parameter that selects one of these four; the adapter’s contract is “given a cert that was just renewed, make the running service serve it” (or, for noop, “verify the new files exist and exit”). Anything that doesn’t fit one of these adapters is an implementation surprise that deserves a new adapter, not a special case inside the existing ones.

noop is the fourth adapter; it exists for certs that have no service to reload — the x509 canary on input-c.staging (see “Decision: x509 canary”) is its first user. A future cert that participates in the trust chain but is read by external tooling rather than a running service (e.g., a public-facing inspection endpoint) can also use it.

Why four adapters is honest deep-module design#

Ousterhout’s “deep module” guidance is narrow interface, broad implementation — emphatically not “one interface that secretly does four different things”. The reload mechanism is genuinely different across the four cases (CONFIG SET vs SQL function call vs container restart vs no-op) and the operational consequences differ (zero vs zero vs 30s downtime vs none). Forcing them into one cert-spec dict makes the caller’s mental model wrong: they think they have one knob, they actually have four with different blast radii. The pluggable adapter makes the asymmetry visible at the call site:

- role: step_ca_vhost_cert
  vars:
    cert_spec: { ... }
    reload_strategy: restart_influx   # explicit: this one restarts

Alternatives considered#

  • One module, fork-on-service inside. Rejected per above — hides the asymmetry from the caller.

  • Three independent roles (redis_step_cert, postgres_step_cert, influx_step_cert). Rejected because the issuance + on-disk + renewal-timer machinery would be duplicated three ways. The whole point of the consolidation in #95 is to retire bespoke per-service PKI plumbing.

  • One module, reload-command as a literal shell string parameter. Rejected because the contract for “reload after renewal” is more than one shell line: it includes idempotency (no reload on no-op renewal), error handling (a failed reload should not leave the cert file half-installed), and in the InfluxDB case a wait-for- healthy step. That logic belongs in named adapters, not in free-form shell.

Consequences#

  • Adding a fifth datastore later (e.g. MinIO, Loki) is “write a fifth adapter”, not “extend the cert-spec dict”.

  • The role’s interface stays narrow (cert_spec + reload_strategy) while the implementation is honest about the three-way fork.

  • Tests can target each adapter independently — important because the InfluxDB adapter is the only one with downtime semantics and needs different verification.


Decision: Reload mechanisms (per service)#

This is the per-service detail behind the table in the previous section.

Redis — CONFIG SET, zero downtime#

Redis 6+ accepts runtime updates of tls-cert-file / tls-key-file / tls-ca-cert-file via CONFIG SET. The connection pool isn’t churned; existing TLS sessions live out their natural deaths and new sessions pick up the new material.

Failure mode to test: if CONFIG SET succeeds but the new files are unreadable by the redis user (UID 999 in our containers), Redis logs the error and keeps using the old in-memory cert. The renewal script must verify post-CONFIG-SET that the active cert serial matches the on-disk cert serial.

Postgres — pg_reload_conf(), zero downtime#

SELECT pg_reload_conf(); re-reads postgresql.conf, including ssl_cert_file and ssl_key_file. Existing connections keep their TLS context; new connections get the new cert. Same caveat as Redis: verify the postmaster actually picked up the new cert; a typo in the config path is a silent fallback.

InfluxDB — docker restart, ~30s downtime#

InfluxDB OSS does not have a runtime reload for TLS material. We accept the restart. The 30s window is acceptable on the InfluxDB role: it ingests metrics from telegraf, which buffers locally, and serves Grafana dashboards, which retry. No write path depends on InfluxDB being up second-by-second.

The restart adapter must:

  • pre-flight that the new cert is syntactically valid (openssl x509 -noout -text) before bouncing the container,

  • docker restart (not docker stop && docker start — the former preserves the container’s IP / aliases on the user-defined network),

  • wait for /health to return 200 before declaring success.


Decision: HSM blast radius / soft-offline budget#

Math#

The CCAT root CA lives on an HSM. If the HSM is offline for any reason (physical access loss, ceremony in progress, hardware fault), the CA cannot issue or renew. Every cert lives until its notAfter; the “soft-offline budget” is how long the HSM can be offline before something starts hard-failing.

Environment

Cert lifetime

step ca renew window opens at

Renewal cadence

Soft-offline budget

Production

90d

day 60 (2/3 lifetime)

12h timer = 60 fires before expiry

30d / 60 fires

Staging (PRD draft)

30d

day 20

12h timer = 20 fires before expiry

10d / 20 fires

Staging (revised)

45d

day 30

12h timer = 30 fires before expiry

15d / 30 fires

Decision#

Production stays at 90d / 30d budget — comfortable headroom for an HSM ceremony (typically 1-2 days) plus one weekend of bad luck.

Architect-mandated change to the PRD: staging at 30d / 10d budget is too tight. A long weekend plus a sick on-call plus a stuck CI run eats most of the budget. Extend staging cert lifetime to 45d (budget 15d / 30 fires).

Alternatives considered#

  • Document the operational acceptance of 10d on staging. Available if anyone has a strong reason for keeping cert lifetimes short-on-staging (often “make rotation visible in CI cadence”). Rejected because staging exists to rehearse production failure modes, and a tighter-than-production budget makes staging a worse rehearsal, not a better one.

  • Match staging to production at 90d. Rejected because we do want staging to exercise the renewal path more frequently than production; 45d gives us that without making the budget uncomfortable.

Consequences#

  • One more variable to keep aligned across the three services on staging. The role’s cert_spec.lifetime parameter handles this.

  • The PRD’s table needs a one-line edit; flag for the implementation PR.

Open question to pin before implementation#

Does step ca renew succeed against an already-expired authenticating cert? If yes, the budget math above is straightforwardly correct: lose the HSM for 30d, recover, every host catches up on the next timer fire. If no, then once a host’s cert expires we drop back to the JWK-password path for that host, which means the password file has to be ready to materialise on demand.

This is testable in staging with a deliberately back-dated cert. Do this test before merging the implementation. Decision below assumes the answer is “no” until confirmed; the role’s renewal script will fall back to JWK-password issuance if step ca renew fails for an expired-cert reason.


Decision: SAN policy#

Decision#

Each cert carries multiple DNS SANs:

  • the docker-network alias the service is reached at (e.g. redis, postgres, influxdb),

  • the public FQDN (e.g. redis.ccat.uni-koeln.de),

  • the host FQDN (e.g. input-b.ccat.uni-koeln.de).

No wildcards. No IP SANs.

Reasoning#

  • No wildcards: a leaked *.ccat.uni-koeln.de cert grants the attacker every vhost we’ve ever named under that domain. Multi-SAN per cert keeps the leak blast radius to “this one service”.

  • No IP SANs: IP SANs make the cert tied to a specific deployment topology. Move the service to a different host and the cert silently mis-matches. DNS-only SANs decouple identity from placement; renumbering the IP plan stays a DNS-only operation. The redis_certs precedent included an IP SAN (redis-certs_staging.conf lists IP:134.95.40.103); we are retiring that.

  • Multi-SAN per cert instead of “one cert per SAN”: one renewal path per service, one cert file in one place. The reload adapters don’t have to juggle three cert files for the same daemon.

Consequences#

  • Adding a new alias to a service is a re-issuance, not a config edit. Acceptable because aliases change rarely and the role makes re-issuance trivial.

  • The cert will list multiple SANs under Subject Alternative Name in openssl x509 -noout -text — do not treat this as a misconfiguration in inspection scripts.


Decision: mTLS scope asymmetry#

Decision#

  • Redis: keep mTLS. Both server and client present certs.

  • Postgres: server-auth-only. Server presents a cert; client authenticates with username + password as today.

  • InfluxDB: server-auth-only. Server presents a cert; client authenticates with API token as today.

Reasoning — and being honest about it#

Redis stays mTLS because it’s already mTLS today (homegrown redis_certs role) and because the application clients (data-transfer workers, ops-db-api, etc.) already know how to present client certs. Migrating Redis off mTLS at the same time as moving its trust root is two changes at once. We are not doing two changes at once.

This is inertia, not principle. A clean-sheet design might well land all three on server-auth-only-with-password/token; mTLS for Redis buys us a marginal extra layer (compromise of the Redis password isn’t enough; you’d also need the client cert) but at the cost of distributing client material to every Redis-using service.

Revisit: when data-transfer or ops-db-api next has a credentials refactor, evaluate whether Redis mTLS is still pulling its weight or whether server-auth-only-with-password is enough. Track this as a follow-up; do not block #95 on resolving it.

Consequences#

  • runtime_redis reload adapter has to manage three files (tls-cert-file, tls-key-file, tls-ca-cert-file) — the CA file is what lets the server validate client certs. The other two adapters manage two files (cert + key only).

  • Client-side trust distribution is asymmetric: Redis clients need both the CCAT root (to validate the server) and a client cert+key (to be validated by the server). Postgres/Influx clients only need the CCAT root. The ca_trust role already drops the root at /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt; client cert distribution stays where it is (per-service, today via redis_certs) for now.


Decision: Cert-spec schema — parameterised UIDs, no defaults baked in#

Context#

The PRD as drafted hardcoded the container UIDs (Redis 999, Postgres 999, InfluxDB 1000) as constants inside the role. The architect review pushed back: upstream image rebases historically shift UIDs without major-version bumps, so a baked-in constant is a silent foot-gun. Validation runbook Check 5 (2026-05-08) confirmed the values on input-b.staging are 999/999/1000 today, but it also confirmed influxdb:latest is the only unpinned image in scope — exactly the drift candidate.

The role needs a schema that (a) takes UID as a per-cert parameter with no role-level default, (b) sources the value from a per-host fact so different hosts can have different UIDs without code changes, (c) is the same schema the runtime drift-detection step (TODO 7) reads at renewal time.

This section follows the same shape as the SSH-cert plane’s ansible/roles/ssh_service_cert/defaults/main.yml schema — same pattern of “list of cert-spec dicts in host_vars, role is a no-op when the list is empty”.

Decision#

The role (working name step_ca_vhost_cert, modelled on step-ca/issue-vhost-cert.sh + step-ca/renew-vhost-cert.sh plus roles/ssh_service_cert/) takes a list of cert-spec dicts called service_tls_certs. Per-host enable lives in ansible/host_vars/<host>/vars_step_ca_vhost_cert.yml. The role defaults file declares service_tls_certs: [] so the role is a no-op on hosts where the list is undefined.

# ansible/host_vars/input-b.staging/vars_step_ca_vhost_cert.yml
host_container_uids:
  postgres: 999
  redis:    999
  influxdb: 1000

service_tls_certs:
  # ────────────────────────────── postgres ──────────────────────────────
  - service: postgres-main
    sans:
      - postgres
      - postgres.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /etc/postgres-certs/server.crt
    key_path:  /etc/postgres-certs/server.key
    owner_uid: "{{ host_container_uids.postgres | mandatory }}"
    owner_gid: "{{ host_container_uids.postgres | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: runtime_postgres
    container: system-integration-postgres-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false

  # ──────────────────────────────── redis ───────────────────────────────
  - service: redis-main
    sans:
      - redis
      - redis.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /opt/redis-certs/staging/server.crt
    key_path:  /opt/redis-certs/staging/server.key
    ca_path:   /opt/redis-certs/staging/ca.crt    # mtls=true only
    owner_uid: "{{ host_container_uids.redis | mandatory }}"
    owner_gid: "{{ host_container_uids.redis | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: runtime_redis
    container: system-integration-redis-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: true                                    # mTLS scope asymmetry

  # ────────────────────────────── influxdb ──────────────────────────────
  - service: influxdb-main
    sans:
      - influxdb
      - influxdb.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /etc/influxdb-certs/server.crt
    key_path:  /etc/influxdb-certs/server.key
    owner_uid: "{{ host_container_uids.influxdb | mandatory }}"
    owner_gid: "{{ host_container_uids.influxdb | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: restart_influx
    container: system-integration-influxdb-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false

Field reference#

Field

Required

Description

service

Canonical service name. Used in metric labels (Tier 1+3 of the alert substrate), filename suffixes, and journald audit-log lines. Format: <service>-<variant> (e.g. redis-main, redis-ccat, postgres-main).

sans

DNS SAN list per the SAN-policy decision (no IP, no wildcard, multi-SAN). At minimum: docker-network alias + public FQDN + host FQDN.

cert_path, key_path

On-host filesystem paths. The cert is bind-mounted into the container by the per-machine compose file.

ca_path

mtls=true only

Path to the CA bundle the server will use to validate client certs. Set only when mtls: true.

owner_uid, owner_gid

No role-level default. Must be sourced from host_container_uids.<service> (or another per-host fact). The | mandatory filter forces an explicit failure if the host fact is missing — fail-loud on misconfiguration is the desired property.

cert_mode, key_mode

Typically 0644 and 0600. Postgres rejects keys with group/world access (validation runbook Check 9 pinned the FATAL phrasing); the role asserts these post-issuance and aborts the play before reload if they drift.

lifetime

Sourced from a role-level default (step_ca_x509_cert_lifetime) which the per-environment vars file overrides — 90d for production, 45d for staging (per HSM blast-radius decision).

reload_strategy

One of runtime_redis, runtime_postgres, restart_influx (per renewal-architecture decision). Adding a fourth datastore = adding a fourth adapter, not extending the cert-spec.

container

Compose-namespaced container name (e.g. system-integration-postgres-1), not the bare service name — runbook Check 5 captured this gotcha. The reload adapter and the runtime UID-drift probe (TODO 7) both docker exec into this container.

provisioner

Name of the JWK provisioner on step-ca that issues this cert. Sourced from a role-level default (step_ca_x509_provisioner) — staging-services or prod-services.

vault_var_name

Name of the Ansible vault variable holding the provisioner password. Used at issuance only; renewals are cert-as-auth (step ca renew) and never touch the password. Same vault → 0400 host tmpfile → unlink convention as roles/ssh_service_cert/tasks/_per_container.yml.

mtls

true for Redis (server validates client certs); false for Postgres and InfluxDB (server-auth-only) per the mTLS scope asymmetry decision. Controls whether ca_path is written and whether the reload adapter manages two files (cert+key) or three (cert+key+ca).

Per-host UID fact#

host_container_uids is a separate dict in the same host_vars file. Two reasons:

  1. Reuse for TODO 7 runtime drift detection. The renewal script reads host_container_uids.<service> and compares to docker exec <container> cat /proc/1/status (PID 1’s effective UID — docker exec ... id defaults to root and is the wrong probe; runbook Check 5 captured this gotcha). Mismatch = non-zero exit + Tier-1 + Tier-2 alert.

  2. Single source of truth per host. A future change that pins influxdb:2.7-rootless (UID 1001) is a one-line edit in host_container_uids rather than a hunt across multiple cert-spec entries.

Acceptance against TODO 3 clauses#

  1. ✓ The role takes owner_uid as a per-cert-spec parameter — no defaults baked into the role. Defaults file sets service_tls_certs: []; cert-specs supply UIDs explicitly.

  2. ✓ Per-host vars under ansible/host_vars/<host>/vars_step_ca_vhost_cert.yml set the UIDs via host_container_uids and reference them in cert-specs.

  3. ✓ Actual UIDs as observed via docker exec <container> cat /proc/1/status are recorded in validation runbook Check 5 (2026-05-08, input-b.staging): postgres=999, redis=999, influxdb=1000. Cross-referenced from this section.

  4. → TODO 7 (runtime drift detection): the same host_container_uids dict is the source of truth at renewal time.

Alternatives considered#

  • One flat dict per cert-spec mixing UID with everything else. Rejected: makes UID drift harder to reason about, and the runtime drift script would have to walk the cert-spec list to find the value rather than reading the UID dict directly.

  • Role-level UID defaults (e.g., default_postgres_uid: 999 in the role’s defaults/main.yml). Rejected: defeats TODO 3’s purpose. A future operator adding a host with non-default UIDs has to remember to override the default. Better to fail-loud than to silently use a wrong default.

  • Per-environment cert-spec lists in group_vars/<env>/ rather than per-host. Rejected because UIDs are a host-level fact (different hosts can run different image variants); SANs and paths are also host-level. Putting them in group_vars would force every host in the group to have identical UIDs, which is exactly the drift-foot-gun we are avoiding.

Consequences#

  • The schema is the contract between the role and the TODO 7 runtime-drift script and the TODO 4 alert substrate scripts. Field renames are role-version-bump events.

  • Cert-spec count grows linearly with services × variants. Today: 3 services × 2 environments × {main, ccat} variant where applicable = ~6-9 cert-specs across all hosts. Manageable.

  • influxdb:latest UID drift risk stays open (no version pin), but is now a one-line host_container_uids.influxdb edit if it shifts. TODO 7’s runtime probe catches the shift at the next renewal fire and refuses to write the new key — fail-closed before the reload would brick InfluxDB.


Decision: x509 canary on input-c.staging — leading-indicator for the cert plane#

Context#

Option A on allowRenewalAfterExpiry (Resolved, 2026-05-08) makes the protection contingent on detection: a cert+key snapshot leak auto-bounds at notAfter only if the operator notices the renewal chain has been broken before then. The SSH-cert plane already runs 24h user certs that act as an HSM/CA-health canary for the SSH side; the x509 plane has no equivalent today.

Service certs are 90d (production) / 45d (staging) and only renew in the last 1/3 of lifetime, so a stuck renewal gives the alert substrate days-of-warning if it works. A 24h x509 canary fails within hours of any HSM/CA breakage on the x509 plane — long before any production cert is at risk. It is the leading indicator that proves the alert path is alive, and the smoke test for the JWK provisioner cert-as-auth flow specifically.

Decision#

Issue a 24h-lifetime x509 cert from the staging-services JWK provisioner to a non-prod host. Target host: input-c.staging — deliberately not input-b.staging so the canary does not share fate with the CA host itself.

Cert-spec entry (lives in ansible/host_vars/input-c.staging/vars_step_ca_vhost_cert.yml):

host_container_uids: {}   # canary has no container; no UID needed

service_tls_certs:
  - service: x509-canary
    sans:
      - x509-canary.input-c.staging.data.ccat.uni-koeln.de
      - input-c.staging.data.ccat.uni-koeln.de
    cert_path: /opt/x509-canary/canary.crt
    key_path:  /opt/x509-canary/canary.key
    owner_uid: 0
    owner_gid: 0
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "24h"
    reload_strategy: noop    # new fourth adapter, see below
    container: ""                   # no container; canary is host-only
    provisioner: staging-services
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false

This relies on a fourth reload adapter, noop, listed in the Renewal architecture decision: “renew the cert, write the new files, do nothing else”. No reload — nothing reads the canary at runtime. The cert exists only for its own lifecycle metrics. noop is general-purpose (the canary is its first user, but a future cert that doesn’t need a service reload — e.g., a dual-purpose cert inspected by external tooling — can use it too) and is exempt from the operational-consequence asymmetry argument because there is no service to reload.

Renewal cadence and failure semantics#

  • Cert lifetime: 24h.

  • Timer cadence: 12h (matches production cert plan, so the canary exercises the same code path as production renewals).

  • step ca renew --expires-in threshold: 18h. Below that threshold a renew attempt actually contacts the CA; above it the timer fire is a no-op (same gate the production timers will use, just with smaller numbers).

  • Failure threshold for paging: failure to successfully renew within 18h of notAfter = Tier 1 + Tier 2 alert. The 6h gap between the renewal threshold and the page threshold gives one natural retry without paging.

If the canary cert expires (no successful renewal for >24h after notAfter), the alert substrate is itself broken — Tier 2 mail is the canary on the canary.

Acceptance against TODO 15 clauses#

  1. ✓ 24h-lifetime x509 cert from staging-services JWK on a non-prod, non-CA host (input-c.staging).

  2. ✓ Renewal timer fires at 12h cadence; failure to renew within 18h of notAfter triggers a Tier 1 + Tier 2 page on the substrate from TODO 4 (alert substrate decision).

  3. step_x509_cert{service=x509-canary} seconds_to_expiry and step_x509_cert_last_renewal_success{service=x509-canary} are wired into the alert substrate as the first metrics — end-to-end verification (Tier 1 alert visible in Grafana, Tier 2 mail actually delivered) happens before any production cert is enrolled. This is also Check 8 (page-path E2E) in the validation runbook.

  4. ✓ The cert-spec entry above is the canary configuration; the noop adapter is the implementation. Single artifact for both purposes.

Why a non-CA host#

The canary is supposed to fail fast when the HSM is unreachable. If it lives on input-b.staging (which hosts the CA), an input-b outage takes down the CA and the canary together — the canary’s failure is then ambiguous between “CA is down” and “input-b is down and the CA might be fine”. Hosting the canary on input-c.staging removes that ambiguity: a canary failure with input-c.staging up means the CA is unreachable from a peer host, which is exactly the condition the canary exists to detect.

Consequences#

  • Phase A scope adds the noop adapter (fourth adapter; trivial — write files, exit 0, emit metrics). Phase A scaffolding gains one cert-spec on input-c.staging.

  • The canary is the validation-runbook Check 8 target. Check 8 is currently BLOCKED on TODO 15 (and on Phase A producing the role). Closing TODO 15 design unblocks Check 8 once Phase A lands.

  • A new operational duty: if the canary alerts but no production cert has alerted, the operator’s first move is “is the CA reachable from input-c.staging?” — step ca health, nc -zv ca.ccat.uni-koeln.de 443 from input-c. Document in the on-call runbook (when one exists).


Decision: Revocation stance — lifetime-as-revocation, no CRL/OCSP#

Decision#

We do not stand up a CRL or OCSP responder. Compromised certs are handled by rotating the secret material and waiting for the cert to expire (90d production, 45d staging). For acute compromises, the runbook below is the response.

Trade-offs#

  • CRL. Operationally simple to publish, but every client has to fetch and trust it. Adding a fetch-and-trust step to telegraf, Grafana, three Celery worker fleets, and ops-db-api is real work for a threat model where we can already roll the underlying secret.

  • OCSP. Real-time but adds a hard dependency on the CA being reachable from every TLS handshake. We just spent ADR-0001 (docs/source/adr/0001-ca-per-vhost-cert-split.md) carefully containing the CA’s reachability surface; OCSP would re-expand it.

  • Lifetime-as-revocation. The 90d ceiling means a compromised cert is automatically not-trusted within 90d without operator action. For acute compromise we roll the secret immediately; the cert remains technically valid until it expires but the secret it protected is already changed.

Compromise modes — runbook headlines#

Full runbook: see the threat-model document (TODO: link when written).

Mode

Headline response

Server key leaked (Redis/Postgres/Influx host private key on disk readable by attacker)

Re-issue the cert with the role (ccat <something> rotate <service>), reload via the adapter. Old cert remains valid until notAfter but no longer protects anything.

Client key leaked (Redis client cert on a compromised app host)

Rotate the client cert via the redis_certs successor flow. Same lifetime caveat.

HSM key leaked (root CA private key compromised)

Stop the CA; cut a new root via ceremony; redistribute via ca_trust role; re-issue every leaf cert. This is the catastrophic case and is what step-ca/ceremony-playbook.pdf exists for.

Consequences#

  • Operators need to internalise “rolling the secret + waiting for expiry” as the revocation primitive. This is documented at the runbook level, not on every ccat CLI invocation.

  • A future regulatory audit that asks for “CRL endpoint” gets the answer “no CRL; lifetime ceiling and operator-led rotation”. Be prepared to defend that.


Decision: Trust distribution — bind-mount + env vars, not image rebuild#

Decision#

The CCAT root CA is distributed to containers via a bind-mount of /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt (placed there by roles/ca_trust/) and an env var pointing each application’s TLS library at it (e.g. SSL_CERT_FILE, PGSSLROOTCERT, etc).

We do not bake the CA root into the application container images.

Reasoning#

Baking the root into the image couples root-rotation cadence to CI build cadence: every root rotation triggers a rebuild and redeploy of every image. With bind-mount + env var, root rotation is “update one file on disk via the ca_trust role, restart consumers” — independent of CI.

This is the same separation already in effect for the SSH cert plane (roles/ssh_service_cert/ mounts ~/.ssh from the host into spawned agents, see commit ce87baa).

Consequences#

  • Container images stay smaller and rebuild less often.

  • The ca_trust role is now a hard dependency for every host that hosts a TLS-consuming container. This is already true today.

  • A misconfigured bind-mount path silently turns into “no CA root” at the container level. The role must verify post-mount that the expected fingerprint is present.


Decision: Compose layering — trust anchor in shared file, layered per context#

Context#

Validation runbook Check 6 (2026-05-08) surfaced a structural fact while inventorying the per-machine compose files: every staging-input, prod-input, and chile context deploys with a single self-contained per-machine compose file. There is no shared docker-compose.yml base in the layering for those contexts. From src/ccat_dc/_constants.py:

CONTEXT_COMPOSE: dict[str, list[str]] = {
    ...
    "staging-input-a":   ["docker-compose.staging.input-a.yml"],
    "staging-input-b":   ["docker-compose.staging.input-b.yml"],
    "staging-input-c":   ["docker-compose.staging.input-c.yml"],
    "prod-input-a":      ["docker-compose.production.input-a.yml"],
    "prod-input-b":      ["docker-compose.production.input-b.yml"],
    "prod-input-c":      ["docker-compose.production.input-c.yml"],
}

This invalidates the implicit assumption in #95 that an x-ccat-trust: YAML anchor could live in a single base file and merge into each app service via <<: *ccat-trust. There is no single base for the contexts that matter; YAML anchors only resolve within a single file. So the anchor cannot be “defined once, merged everywhere” by accident — it needs an explicit wiring decision.

Decision#

Define the anchor and the per-service merge entries in a new file docker-compose.trust.yml, and layer it into every applicable context via CONTEXT_COMPOSE. The trust file is the single source of truth for “which services get the trust bundle bind-mount”:

# docker-compose.trust.yml (sketch)
x-ccat-trust: &ccat-trust
  volumes:
    - /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem:/etc/ssl/certs/ca-bundle.crt:ro

services:
  postgres:        { <<: *ccat-trust }
  redis:           { <<: *ccat-trust }
  influxdb:        { <<: *ccat-trust }
  ops-db-api:      { <<: *ccat-trust }
  ops-db-ui:       { <<: *ccat-trust }
  grafana:         { <<: *ccat-trust }
  pgadmin:         { <<: *ccat-trust }
  db-backup:       { <<: *ccat-trust }
  # ...one line per app service across all 7 per-machine files...

Plus the corresponding _constants.py change:

"staging-input-a": ["docker-compose.staging.input-a.yml", "docker-compose.trust.yml"],
"staging-input-b": ["docker-compose.staging.input-b.yml", "docker-compose.trust.yml"],
"staging-input-c": ["docker-compose.staging.input-c.yml", "docker-compose.trust.yml"],
"prod-input-a":    ["docker-compose.production.input-a.yml", "docker-compose.trust.yml"],
"prod-input-b":    ["docker-compose.production.input-b.yml", "docker-compose.trust.yml"],
"prod-input-c":    ["docker-compose.production.input-c.yml", "docker-compose.trust.yml"],

The chile production context (docker-compose.production.chile.yml) is not in CONTEXT_COMPOSE today; its layering choice mirrors the input nodes when it is added.

On deploy, docker compose -f <per-machine> -f docker-compose.trust.yml merges each per-service entry from the trust file into the same-named service in the per-machine file. Adding a new app service that needs trust is a one-line addition to docker-compose.trust.yml, not a touch in seven separate per-machine files.

Alternatives considered#

  • Option A — duplicate the anchor in each per-machine file. Concrete, no _constants.py wiring change. Rejected because it creates a 7-file sync target — adding a new app service that needs trust means remembering to annotate it in whichever per-machine file it lands in, and the failure mode of forgetting is a silent TLS rejection at runtime. Validation runbook Check 6’s per-service has_trust: true spot-check catches the oversight, but only if it is actually run.

  • Option C — bind-mount the parent directory (/etc/pki/ca-trust/extracted/pem/) instead of the single bundle file. Would also resolve TODO 14 (single- file bind-mount staleness on rotation) because directory bind-mounts re-resolve dentries on lookup. Rejected from this section because it changes the what of the bind-mount; the layering question is the how-to-wire. Track the directory-vs-file question under TODO 14; if option C is chosen there, this layering decision is unaffected (the trust file’s volume entry just changes shape).

Why option (b) wins#

  • One source of truth. The list of services-that-need-trust lives in docker-compose.trust.yml, not scattered across 7 per-machine files.

  • Auditable PR review. A Phase B PR diff is one file plus a 6-line _constants.py change. Reviewers do not have to grep across 7 compose files to verify coverage.

  • Adding a new context (e.g., a future input-d) is “add the per-machine compose, append trust to its CONTEXT_COMPOSE entry” — a known shape, not a new sync rule.

  • Dev/local opt-out is explicit. The dev contexts (dev, localdev, local) keep self-signed certs per the TLS-hard-cutover-policy convention (2026-05-07 memory). Not adding the trust file to those entries in CONTEXT_COMPOSE is the explicit opt-out, visible in the diff and reviewable.

  • YAML anchor mechanics are unchanged. The <<: *ccat-trust merge happens within docker-compose.trust.yml itself; no cross-file anchor references are required (docker-compose does not resolve anchors across files anyway).

Consequences#

  • docker-compose.trust.yml becomes a hard dependency for every context that lists it. A missing or malformed trust file fails the deploy at compose-render time, before any container starts — fail-closed, which is the desired property.

  • _constants.py is the source of truth for context wiring. Future ADRs that touch deployment topology should reference here.

  • Phase B PR shape: one new file (docker-compose.trust.yml, ~70 lines including the anchor and 60 service entries), one _constants.py patch (6 lines), no per-machine compose edits. Validation runbook Check 6’s deferred per-service spot-check becomes docker compose -f ... config | yq against the merged render and lands in the same PR.

  • Service inventory (validation runbook Check 6, 2026-05-08) — this is the input set for docker-compose.trust.yml’s services: mapping:

    File

    Total

    needs trust

    exempt

    production.input-a.yml

    12

    11

    promtail

    production.input-b.yml

    10

    8

    loki, promtail

    production.input-c.yml

    8

    7

    promtail

    production.chile.yml

    9

    8

    promtail

    staging.input-a.yml

    13

    11

    loki, promtail

    staging.input-b.yml

    10

    8

    loki, promtail

    staging.input-c.yml

    8

    7

    promtail

    Total: 60 app-service annotations across the 7 files.

  • Exemptions are deliberate, not oversights. promtail ships logs to Loki via plain HTTP and has no DB connection; loki is a log store with no DB clients in the compose graph. Both are defence-in-depth candidates if a future change makes them speak to a step-ca-issued vhost — at which point they become a one-line addition to docker-compose.trust.yml. Not load-bearing for #95.


Decision: Alert substrate — tiered, with a TLS-independent backstop#

Context#

The PRD draft proposed renewal-failure alerts flowing telegraf → InfluxDB → Grafana → ops chat. The architect review caught a circular dependency: that alert path itself depends on the TLS trust chain we’re trying to monitor. If the trust chain breaks, the alert telling us so is silenced by the same break.

An earlier draft of this ADR section recommended “piggyback on the SSH-cert plane” on the premise that the SSH-cert plane’s failure-notification path is by construction independent of the database TLS chain. That premise was wrong. Inspection of ansible/roles/ssh_service_cert/templates/step-cert-monitor.sh.j2 plus ansible/roles/system_setup/files/telegraf.conf:960 shows the SSH-cert plane emits step_cert and step_renew_failed measurements via Telegraf [[inputs.exec]], and Telegraf’s [[outputs.influxdb_v2]] writes to http://db.data.ccat.uni-koeln.de:8086 — the same InfluxDB on input-b that this PRD is hardening. The SSH-cert plane shares fate with the database TLS chain. Piggybacking on it does not break the circular dependency; it just inherits it under a different name.

The fix is not to rebuild on a different single substrate — it is to accept that any single substrate convenient enough to use day-to-day will share fate with something in the stack. We need a backstop tier that is genuinely independent.

Decision#

Tiered substrate, three independent paths:

Tier 1 — Primary (visibility + everyday paging)#

Telegraf [[inputs.exec]] on every cert host emits, mirroring the existing SSH-cert plane’s step-cert-monitor.sh.j2:

  • step_x509_cert,service=...,host=... seconds_to_expiry=Ni

  • step_x509_renew_failed,service=...,unit=... value=0|1

  • step_x509_cert_last_renewal_success,service=... seconds_ago=Ni

Telegraf → InfluxDB → Grafana → Matrix room (page channel: #ccat-ops:matrix.data.ccat.uni-koeln.de). Catches single-service renewal failures, perms drift, image UID drift (TODO 7) — anything that doesn’t take down InfluxDB or Grafana itself.

Tier 1 does transit step-ca-issued TLS once Phase E lands (Telegraf → InfluxDB will use the new server cert). This is acknowledged, not denied. It is the “convenience” tier; it is not load-bearing for the catastrophic case.

Tier 2 — Backstop (TLS-independent, catches catastrophic failures)#

Two host-local mechanisms, both calling mailx to the existing admin_email_addresses alias (already configured by ansible/roles/system_setup/tasks/sendmail.yml — root → admin alias is in place via /etc/aliases):

  • OnFailure= unit on every renewal systemd timer. Fires immediately when a renewal unit reports failed. Mail body includes hostname, service, unit name, last 20 lines of journalctl -u <unit>.

  • Daily heartbeat cron at 06:00 UTC sends mail “all certs OK on $HOSTNAME, soonest-expiry=Nd, issuance-events-today=N” with one line per cert. Absence of mail for 36h on any host = problem, even if no specific failure was detected.

The mail path goes via the host MTA (sendmail) → uni-köln SMTP relay → admin inbox. This tier does not transit any step-ca-issued TLS cert. It is the only path that survives:

  • HSM offline (no new certs issuable).

  • InfluxDB down on input-b (Tier 1 metrics black-hole).

  • Grafana down on input-b.

  • Matrix homeserver down on input-b.

  • Network partition between input-a/c and input-b.

The only break-conditions for Tier 2 are host network down or the external SMTP relay down — both known operational classes that are not silently coupled to step-ca.

Tier 3 — Issuance audit (anomaly detection)#

Every JWK-password-using step ca certificate invocation in the role wraps its call site in a logger trap that writes a structured journald line:

ccat-step-issuance: host=$HOSTNAME service=$SVC ts=$ISO triggered_by=$USER

promtail ships journald to Loki; a Grafana alert fires when issuance-events-per-week exceeds the expected baseline (production: ~6/year per service after Phase A; staging: ~12/year per service).

Mostly Tier-1 plumbing, but the daily heartbeat mail (Tier 2) also includes issuance-events-today=N — so an attacker who silences Loki and InfluxDB still has to silence the host MTA path to hide issuance events.

How this maps to the seven acceptance clauses#

#

Clause

Tier 1

Tier 2

Tier 3

1

Substrate

Telegraf+Grafana+Matrix

cron+OnFailure+mailx

journald+Loki+Grafana

2

Page channel

Matrix #ccat-ops

email to admin_email_addresses

(rolls up into 1+2)

3

On-call contract

deferred to a follow-up — see “Out-of-scope” below

4

No-step-ca-TLS statement

acknowledged transits step-ca TLS

does not transit

partial (Loki not on step-ca today) + Tier-2 backstop

5

Renewal-failure alert

step_x509_renew_failed > 0 → Grafana alert

systemd OnFailure= mail

n/a

6

Renewal-success heartbeat

last_renewal_success_seconds_ago > 24h alert

daily 06:00 mail; absence ≥ 36h = problem

n/a

7

Issuance audit log

n/a

“issuance-events-today=N” line in heartbeat mail

structured journald + Loki alert on >2σ above 30d baseline

What this changes in the existing infrastructure#

  • New Telegraf input-exec script templated by the step_ca_vhost_cert role, parallel to roles/ssh_service_cert/templates/step-cert-monitor.sh.j2 — same pattern, x509 measurements instead of SSH ones. ~1 small PR.

  • Renewal systemd timer template gains OnFailure= ccat-cert-mail@%i.service and a sibling ccat-cert-mail@.service unit that calls mailx. ~1 small PR.

  • Daily heartbeat cron entry (/etc/cron.d/ccat-cert-heartbeat) templated per host from the cert-spec list. ~1 small PR.

  • Issuance audit log is a one-line logger -t ccat-step-issuance wrapper around the issuance script and a Grafana/Loki alert rule. Folded into the role’s issuance task.

Three small Phase A PRs, parallelisable with the role itself; none depends on the cert-issuance role being complete first.

Alternatives considered (why this won over the original options)#

  • Single-substrate, “use what exists and is independent” (original recommendation: SSH-plane piggyback). Rejected because the SSH-plane is not actually independent — it shares the same Telegraf → InfluxDB pipe. Single-substrate framing is the bug.

  • cron + mailx as the only path. Robust, but operationally thin: no silencing, no ack, no per-service severity, no dashboard. Acceptable as a backstop; not enough as the everyday path.

  • Pushgateway + Alertmanager over plain HTTP on a private docker network. Adds two new components for one alert class. Yagni until we have at least three substrates that would benefit from a unified alerting layer. Revisit when the alerting story is mature enough to consolidate.

Why a tiered design is the honest answer#

A single substrate convenient enough to be the daily path will share fate with something. The architect’s worry was real; the fix is not “find a magically-independent single path” (no such path exists at this scale of infrastructure) but “have a backstop that is deliberately inconvenient — mail to a mailing list — so it is actually independent”. Tier 2 is operationally annoying on purpose: mail is not a great paging UX, but it is a great backstop UX because it doesn’t transit any of the things we are trying to alert on.

Consequences#

  • Three artifacts to maintain instead of one. Worth it for the load-bearing independence guarantee.

  • admin_email_addresses is now load-bearing. Document the alias contents and the SMTP relay path in the on-call runbook (when one exists). Test the path during Phase A by deliberately failing a staging renewal and confirming the mail arrives.

  • Grafana / InfluxDB / Matrix outage scenarios are now page-quiet on Tier 1 by design. Operators must internalise that “no Tier 1 alert” means “Tier 1 is up”, not “all is well”. The daily Tier 2 heartbeat is the positive-confirmation signal.

  • Future consolidation (e.g., Alertmanager) replaces Tier 1 without disturbing Tier 2. Tier 2 is the architectural floor.


Open questions#

  • Is runtime_redis (CONFIG SET) sufficient on Redis 7 with TLS-only listeners? The tls-port directive isn’t reloadable via CONFIG SET in some Redis versions; verify on the version we ship. If not, runtime_redis degrades to a restart_redis adapter and Redis joins InfluxDB in the 30s-downtime club.

  • mTLS asymmetry follow-up. Schedule a review at the next data-transfer credentials refactor. Don’t block #95.

  • Threat-model document link. The full leak-response runbook lives there; this ADR carries the headlines. Link when written.

Resolved#

  • (2026-05-08) Does step ca renew succeed against an already-expired authenticating cert? Resolved by configuration inspection (validation runbook Check 4). step ca provisioner add --allow-renewal-after-expiry exists as a flag; step-ca/provisioners-add.sh does NOT pass it on prod-services or staging-services. Default is false. Therefore step ca renew on an expired cert is refused under the current CA config.

    Decision: keep the strict default (allowRenewalAfterExpiry: false, i.e. Option A). Threat-model trade-off:

    • Service-host snapshot leak (cert+key only): the JWK provisioner password is NOT on service hosts in steady state — vault-staged as a 0400 host tmpfile during issuance, unlinked in the always: block of the issuance play. Steady-state renewal uses cert-as-auth and needs no password. So a snapshot leak gives the attacker cert+key but not the password, and Option A’s “expired = denied” semantic auto-bounds the leak at notAfter if the attacker fails to renew in time. Detection-then-host-rotation breaks the renewal chain.

    • Controller compromise (saiyajin / Jenkins-on-input-b): both options are equally lost. Vault key lives there.

    • Persistent service-host compromise spanning an issuance window: attacker eventually grabs the 0400 tmpfile. Both options equally lost.

    • Operational cost of Option A: HSM offline > 30d production budget (15d staging) requires manual re-issuance ceremony — vault → 0400 tmpfile → run issuance script. Same pattern as today’s vhost cert and ssh_service_cert/_per_container.yml.

    Option A’s protection is contingent on detection. Therefore monitoring + canary become load-bearing (TODO 15 in the pre-implementation TODO list, plus expanded acceptance for the alert substrate in TODO 4).

  • (2026-05-08) update-ca-trust extract atomicity. Resolved by validation runbook Check 3. update-ca-trust swaps the bundle via atomic rename on RHEL 10.1 (inode change verified). No partial-read window on the host filesystem. Downstream nuance: Linux single-file bind-mounts pin the source inode, so atomic rename on the host means containers see the old bundle until restart — tracked as TODO 14, not a blocker for this ADR.

  • (2026-05-08) Trust-anchor compose layering. Resolved by validation runbook Check 6 + this ADR’s “Decision: Compose layering” section. New docker-compose.trust.yml is the single source of truth for service-needs-trust; layered into each applicable context via CONTEXT_COMPOSE. TODO 16 closed on this ADR section landing.

  • (2026-05-08) Break-glass SSH access during HSM-down >24h. Resolved by static review of existing infrastructure rather than by adding a new artifact. The architect’s concern presumed step-ca-issued user certs are the only operator auth path; ansible/roles/system_setup/tasks/nitrokey_ssh.yml applies per-operator FIDO2 hardware-key pubkeys to plain authorized_keys on every managed host (outside the AuthorizedPrincipalsFile cert path), and out-of-band hardware consoles cover hardware-level recovery. The Nitrokey path survives any step-ca outage by construction. TODO 5 dropped; Check 11 signed off as N/A. See “Operational notes” for the role-split rationale (Nitrokey for core admins, step-ca SSH certs for remote admins).

  • (2026-05-08) x509 canary on input-c.staging. Resolved by this ADR’s “Decision: x509 canary on input-c.staging” section. 24h cert from staging-services JWK on a non-CA host (input-c.staging); 12h timer cadence; failure to renew within 18h of notAfter triggers Tier 1 + Tier 2 alert. Adds a fourth noop reload adapter (general-purpose; canary is its first user). Doubles as validation runbook Check 8 (page-path E2E). TODO 15 closed on this ADR section landing.

  • (2026-05-08) Cert-spec schema and UID parameterisation. Resolved by this ADR’s “Decision: Cert-spec schema — parameterised UIDs, no defaults baked in” section. UIDs are per-host facts (host_container_uids dict) referenced from cert-specs via \| mandatory so a missing fact fails the play loudly. The schema is the shared contract for the role, the TODO 7 runtime-drift script, and the TODO 4 alert substrate’s service labels. TODO 3 closed on this ADR section landing.

  • (2026-05-08) Alert substrate. Resolved by replacing the single-substrate framing (SSH-plane piggyback) with a tiered design — see “Decision: Alert substrate — tiered, with a TLS-independent backstop”. The SSH-plane piggyback recommendation in an earlier draft of this ADR was based on a wrong premise (the SSH plane shares the same Telegraf → InfluxDB pipe and so shares fate with the database TLS chain it was supposed to monitor). The tiered fix: Tier 1 Telegraf+ Grafana+Matrix for everyday paging, Tier 2 cron+OnFailure+ mailx to admin_email_addresses as the load-bearing TLS-independent backstop, Tier 3 journald+Loki+Grafana for issuance-frequency anomaly detection. TODO 4 closed on this ADR section landing. The on-call hand-off contract clause is explicitly deferred to a follow-up — channels exist; rotation contract is a team-structure decision for when the rotation exists.


Out-of-scope#

Things this PRD and ADR explicitly do not address. Each item is here because someone has asked or might reasonably ask, and the answer is “not in this rollout”:

  • 2-week F→G dual-trust soak. Waived under the time-bound setup-mode argument in “Decision: Migration style”. Revisit if production becomes populated before Phase G ships. Do not cite this ADR as precedent for skipping a soak on a populated production stack.

  • Migrating Redis off mTLS to server-auth-only-with-password. Inertia, not principle (see “Decision: mTLS scope asymmetry”). Revisit at the next data-transfer or ops-db-api credentials refactor.

  • CRL or OCSP infrastructure. Lifetime-as-revocation only (see “Decision: Revocation stance”). A regulatory ask for a CRL endpoint is a future ADR.

  • ops-db-api inbound TLS (nginx-proxy ops-db-api). Currently undecided (TODO 11). Once chosen, the answer goes into “Operational notes” if in-scope, or remains here if explicitly out-of-scope, or moves to its own ADR.

  • Cert-transparency / public-log integration. Step-ca is a private CA; not applicable.

  • Baking the CCAT root into application images. Trust distribution decision: bind-mount + env var, not image rebuild.

  • A unified CLI surface upfront (ccat tls rotate, ccat tls status). YAGNI; design after the role works.

  • Renewal-job log retention beyond journalctl. Covered by the general logging / Loki policy, not this PRD.

  • Backup-as-cert-recovery-path. Backup coverage of service-cert directories is not confirmed by ITCC (TODO 17). The role re-applying after a host reinstall is the recovery path; backups are best-effort defence in depth, not load-bearing.

  • F→G soak in any future TLS migration on populated production. See “Decision: Migration style” → Time-bound. Future migrations on a populated stack must use a soak.

  • On-call hand-off contract for the alert substrate (who acks, escalation timeout, expected MTTR). Channels exist (Tier 1: Matrix #ccat-ops, Tier 2: admin_email_addresses mail). The rotation/ack/MTTR contract is a team-structure decision deferred until a real on-call rotation exists. Track as a follow-up; not blocking PRD #95.


Operational notes#

This section consolidates the operational concerns surfaced through the per-decision sections above into one place for on-call. Each item points back to where the rationale lives.

  • HSM offline budget. 30d production / 15d staging (HSM blast radius decision). Beyond budget: manual re-issuance ceremony via JWK provisioner password (vault → 0400 host tmpfile → unlink in an always: block of the issuance play). This is not auto-recovery — it requires an operator with vault access to run the issuance script.

  • Renewal cadence. 12h timer per host, modelled on step-ca/renew-vhost-cert.sh. Most fires are no-ops because step ca renew only contacts the CA in the last 1/3 of cert lifetime. A misconfigured timer (or a --force storm during rollout) is a CA-DoS risk; throttle / serialise mass issuance during phase rollouts (TODO 6).

  • Trust-anchor rotation requires container restart. Single-file bind-mounts pin the source inode (TODO 14). Any change to /etc/pki/ca-trust/source/anchors/ followed by update-ca-trust extract REQUIRES a rolling restart of every container that bind-mounts the trust bundle. The host gets the new file atomically; running containers do not. Either accept this and document the restart in the rotation procedure, or move to a directory bind-mount (TODO 14 alternative).

  • Postgres replica during rotation. Primary and replica must not renew simultaneously while replication is mid-write (TODO 10). The chosen ordering — primary-first with a wait gate, or replica-first, or a coordination lock — is recorded under TODO 10 acceptance and migrates here once decided.

  • Alert path independence — tiered substrate. Three paths (alert substrate decision, TODO 4 closed). Tier 1 (Telegraf → InfluxDB → Grafana → Matrix #ccat-ops) is the everyday paging path and shares fate with input-b services. Tier 2 (systemd OnFailure= + daily 06:00 cron heartbeat → mailx to admin_email_addresses) is the load-bearing TLS-independent backstop — does not transit any step-ca-issued cert. Tier 3 (journald → promtail → Loki → Grafana) is the issuance-anomaly audit. Operational rule for on-call: “no Tier 1 alert” means “Tier 1 is up”, not “all is well”; the daily Tier 2 heartbeat mail is the positive- confirmation signal — absence ≥ 36h on any host is itself a problem.

  • Container UIDs are per-host parameters. Cert-spec UIDs (TODO 3) are parameterised, not hardcoded; runtime UID drift is detected by the renewal script (TODO 7) by reading /proc/1/status from PID 1 inside each container (docker exec ... id defaults to root and is the wrong probe — runbook Check 5 captured this gotcha). Today’s values, observed on input-b.staging 2026-05-08: Redis 999, Postgres 999, InfluxDB 1000. influxdb:latest is the only unpinned image in the stack — drift risk concentrates there.

  • Backup is not the cert recovery path. Service-cert directories may not be in the central Commvault policy (TODO 17, ITCC ticket pending). Recovery on host reinstall is “re-run the step_ca_vhost_cert role”. Document this in the role README.

  • Break-glass SSH already provided by existing infrastructure. The architect’s worry — “if step-ca is down >24h, every operator’s SSH cert expires and nobody can SSH in to fix it” — assumed step-ca-issued user certs are the only operator auth path. They are not. ansible/roles/system_setup/tasks/nitrokey_ssh.yml applies per-operator FIDO2 hardware-key pubkeys (roles/system_setup/files/pubkeys/<username>/*.pub) directly to authorized_keys on every managed host, outside the AuthorizedPrincipalsFile cert path. Out-of-band hardware access (iDRAC / hypervisor console) provides the second tier for hardware-level recovery. The role split is: Nitrokey for core admins (physically present, hardware key in pocket), step-ca SSH certs for out-of-core / remote admins where shipping a hardware key is impractical. TODO 5 is dropped on this basis; Check 11 signed off as N/A.

  • Compose layering is anchored in docker-compose.trust.yml. Single source of truth for the service-trust bind-mount matrix (compose-layering decision). Validation runbook Check 6 inventory is the input set; per-service has_trust: true spot-check lands in the same Phase B PR as the trust file itself.


Consequences — overall#

What becomes easier:

  • One trust root for the whole CCAT stack (SSH plane, vhost cert, three databases). Operators need to know one CA, one root file path, one renewal model.

  • Retiring roles/redis_certs/ and redis/<variant>/certs/ removes a homegrown PKI with four parallel CAs that nobody outside this team can audit.

  • Adding a fourth TLS-consuming datastore later is “add a reload-strategy adapter”, not “build new PKI”.

What becomes harder:

  • The CCAT root CA / HSM is now load-bearing for more things. HSM ceremony cadence and HSM availability matter more than they did. The soft-offline budget gives us 30d production / 15d staging headroom but the calculus is now “how long can the HSM be offline” not “how long can the redis-certs CA be offline” (which was effectively infinite because that CA was a file on input-b’s disk).

  • The pluggable-adapter design means the role has three test surfaces, not one. Plan for that in the test plan.

New operational duties:

  • Watch the SSH-cert-plane notification stream for DB cert renewal failures (decision-section: alert substrate).

  • Maintain the schema entry for any new vault_step_ca_prov_* passwords (lines up with the existing vault schema work in data-center-computer-setup/vars_application_schema.yml).

  • The ccat redis-certs CLI commands (currently in ctl) get superseded; plan a CLI surface for the new role (ccat tls rotate <service>, ccat tls status). Don’t build it before the role works; YAGNI.


References#

Files verified to exist in the repo at the time of writing:

  • step-ca/issue-vhost-cert.sh — one-shot issuance pattern (JWK provisioner password via --password-file, atomic .new install, docker exec reload).

  • step-ca/renew-vhost-cert.shstep ca renew cert-as-auth pattern, PRE_MTIME/POST_MTIME conditional reload, 12h timer cadence.

  • ansible/roles/ssh_service_cert/tasks/_per_container.yml — password-staging-from-vault → 0400 host tmpfile → unlink convention, community.docker.docker_container_exec with stdin-only password delivery.

  • ansible/roles/ca_trust/ — RHEL system-anchor distribution for the CCAT root.

  • ansible/roles/redis_certs/ — homegrown PKI being retired by this ADR.

  • redis/{main,ccat,develop,develop-ccat}/certs/ — per-variant CAs being sunset.

  • grafana/provisioning/{production,staging}/datasources/influxdb-datasource.yaml — current tlsSkipVerify: true lines, plain HTTP datasource URL.

  • docs/source/adr/0001-ca-per-vhost-cert-split.md — prior ADR on the CA’s own vhost cert; format and reasoning style mirrored here.

  • PRD: ccatobs/system-integration#95 — defers full decision tree to this document.