ADR-0002 — Step-CA-issued TLS certificates for Redis, Postgres, and InfluxDB#
Status. Proposed, 2026-05-08.
Related. ccatobs/system-integration#95 (PRD).
Supersedes. Nothing on disk; the homegrown PKI under ansible/roles/redis_certs/ and redis/<variant>/certs/ is what this retires.
Context#
Today three datastores in the stack use ad-hoc or absent TLS:
Redis — homegrown CA per environment variant (
ansible/roles/redis_certs/, four self-signed CAs underredis/{main,ccat,develop,develop-ccat}/certs/). mTLS is configured but the CA story is bespoke and cannot be audited as part of the rest of the CCAT trust chain.Postgres — TLS not enforced; client traffic in the clear or with trust-on-first-use.
InfluxDB — fronted today on plain HTTP. The Grafana datasource at
grafana/provisioning/production/datasources/influxdb-datasource.yamlliterally points athttp://data.ccat.uni-koeln.de:8086withtlsSkipVerify: true. The same pattern lives ingrafana/provisioning/staging/datasources/.
The CCAT step-ca endpoint already issues:
the SSH user-cert plane (
ansible/roles/ssh_service_cert/),the public TLS cert for the CA’s own vhost (
step-ca/issue-vhost-cert.sh,step-ca/renew-vhost-cert.sh,step_ca_vhost_cert.timerevery 12h).
PRD #95 proposes routing the three datastores onto the same step-ca-issued path: a single root of trust, predictable lifetimes, and the same operator muscle memory.
A senior-architect review of #95 blocked the PRD on this ADR existing. The PRD names the headline decisions (“JWK”, “hard cutover”, “multi-SAN no IP”) but defers the reasoning to here. This ADR is that reasoning.
Important PRD correction#
The PRD claims it copies an existing step_ca_vhost_cert Ansible role
verbatim. There is no such role on disk. The prior art is:
the scripts at
step-ca/issue-vhost-cert.sh(one-shot issuance) andstep-ca/renew-vhost-cert.sh(PRE_MTIME /step ca renew/ POST_MTIME / conditional reload),the per-container password-staging convention in
ansible/roles/ssh_service_cert/tasks/_per_container.yml(vault → 0400 host tmpfile → unlink in analways:block; or stdin-only viadocker_container_execfor the in-container case).
The implementation must build the Ansible role from these patterns,
not from a role that does not exist. Any reader of this ADR or the PRD
should not waste time grepping for step_ca_vhost_cert/ under
ansible/roles/.
Decision#
Issue TLS certs for Redis, Postgres, and InfluxDB from the CCAT step-ca
via the JWK provisioner, using cert-as-auth (step ca renew) on a
12h timer cadence. Cut over hard, no dual-trust soak. Implement as a
single parameterised Ansible role (step_ca_vhost_cert) plus three
pluggable reload-strategy adapters — not as one deep module pretending
all three services are the same shape.
Per-decision detail follows.
Decision: Provisioner choice — JWK over ACME and X5C#
Context#
step-ca offers three provisioner classes for non-interactive cert flows: ACME (HTTP-01, DNS-01, TLS-ALPN-01), X5C (cert-presented-as-auth, but chained to an external trust root), and JWK (password-or-key-protected provisioner credentials).
Decision#
Use JWK, with step ca renew (cert-as-auth) for steady-state
renewal. Initial issuance presents the JWK provisioner password; every
renewal thereafter authorises with the cert’s own private key, so the
provisioner password never has to live on the renewing host past the
one-shot issuance step.
Working precedent: step-ca/renew-vhost-cert.sh does exactly this for
the ca.ccat.uni-koeln.de vhost cert. step ca renew only contacts
the CA inside the renewal window (last 1/3 of lifetime by default), so
a 12h timer is benign — most fires are no-ops.
Alternatives considered#
ACME HTTP-01. Would require the CA reach the requesting service on port 80. In our topology Redis on input-b is firewalled off the public internet; Postgres on input-a is internal; InfluxDB has its own vhost path. Opening HTTP-01 challenge paths through the proxy for three more vhosts adds a brittle coupling between CA, proxy config, and ACME challenge timing — and is the operational class of problem that ADR-0001 already had to navigate to get the CA’s own vhost cert working.
ACME DNS-01. Would require the CA orchestrate DNS records in the Uni-Köln DNS zone. We do not control that zone programmatically; a manual record-flip per renewal is unacceptable on a 12h cadence.
ACME TLS-ALPN-01. Same firewall constraint as HTTP-01, plus the Redis/Postgres/Influx daemons are not HTTP servers and cannot serve the challenge.
X5C. Would require us to bootstrap a separate trust root just to authorise these provisioners, then maintain it. It does not solve a problem JWK doesn’t already solve; it adds a parallel trust path we’d then have to monitor.
Consequences#
The JWK provisioner password is in vault (
vault_step_ca_prov_*_password) and only reaches the issuing host via Ansible’s vault → 0400 tmpfile → unlink pattern fromroles/ssh_service_cert/tasks/_per_container.yml. Steady-state renewals do not touch the password at all.All renewals share one well-trodden path (
step ca renew) so an operator who has debugged the vhost cert renewal already knows how to debug a Redis cert renewal.An open question (see below): does
step ca renewsucceed against an already-expired auth cert? If not, an HSM outage that exceeds the renewal budget plus the window between renewal and expiry forces a fall-back to the JWK-password path.
Decision: Migration style — hard cutover, no dual-trust soak#
Context#
The architect’s default recommendation for any TLS migration is a two-week dual-trust soak (old CA + new CA both accepted, then flip). The PRD instead proposes a hard cutover for all three services.
Decision#
Hard cutover. This is consistent with the existing TLS-hard-cutover-policy ADR captured in project memory (2026-05-07): step-ca trust + DB certs roll out via deploy-time restart, not dual-trust.
Why this stands here, even though architects would normally object#
Production is currently in setup mode: there are no end users on the operations DB, no live data streams flowing through the transfer pipeline, no externally consumed Grafana dashboards depending on the InfluxDB datasource. A two-week soak buys nothing because the “availability we’d be protecting” doesn’t exist yet. The cost of a soak (double-config, more code paths, more places for a misconfigured client to silently fall back to the old trust path) is real today; the benefit is zero today.
Time-bound — read this before reusing this precedent#
The above is only true while production is unpopulated. Once the operations DB carries real observation records, once the data-transfer pipeline is moving live telescope data, once Grafana dashboards are being watched by humans on call — the calculus flips. Any future similar migration on a populated production stack must use a soak. Do not point at this ADR as precedent for skipping a soak on a live system. The precedent is “skip soak when there are no users”, not “skip soak in general”.
Alternatives considered#
2-week F→G dual-trust soak. Standard playbook. Rejected on the cost/benefit argument above, time-bound to the empty-production state.
Service-by-service phased cutover (Redis first, then Postgres, then Influx). Rejected as not actually safer in the current state — each service still hard-cuts when its turn comes; the phasing only spreads operator attention thinner. We will sequence by readiness of the reload adapter (probably Postgres first because
pg_reload_conf()is the cheapest), not by risk-mitigation.
Consequences#
A failed cutover is a service outage on whichever datastore failed. Mitigation: rehearse on staging first; the staging environment uses the same step-ca and the same role.
This ADR must be revisited (and likely rewritten) before the next TLS migration on a populated production stack. Add a checkbox to the production-readiness review.
Decision: Renewal architecture — one role, three reload adapters#
Context#
The PRD as drafted proposed a single Ansible module that takes a
cert_spec dict (name, SANs, lifetime, owner, mode, reload-command)
and handles Redis (mTLS + redis-cli CONFIG SET), Postgres (server-only
pg_reload_conf()), and InfluxDB (server-only + container restart) through that one shape. The architect review pushed back: a single dict that has to fork onif redis else if postgres else if influxinside the module is a deep-module fiction — the fork is inherent to the problem and pretending it isn’t makes the module’s interface lie.
Decision#
Build one parameterised role (step_ca_vhost_cert) that handles:
issuance via JWK provisioner,
on-disk cert layout, ownership, mode,
the renewal timer/script (modelled on
step-ca/renew-vhost-cert.shwith PRE_MTIME / POST_MTIME conditional reload),trust-anchor consumption from
roles/ca_trust/.
…and expose a pluggable reload-strategy interface with four adapter implementations:
Adapter |
Service |
Reload mechanism |
Downtime |
|---|---|---|---|
|
Redis |
|
zero |
|
Postgres |
|
zero |
|
InfluxDB |
|
~30s |
|
(canary or no-service-attached cert) |
nothing — write files, exit 0 |
n/a |
The role takes a reload_strategy parameter that selects one of
these four; the adapter’s contract is “given a cert that was just
renewed, make the running service serve it” (or, for noop, “verify
the new files exist and exit”). Anything that doesn’t fit one of
these adapters is an implementation surprise that deserves a new
adapter, not a special case inside the existing ones.
noop is the fourth adapter; it exists for certs that have no
service to reload — the x509 canary on input-c.staging (see
“Decision: x509 canary”) is its first user. A future cert that
participates in the trust chain but is read by external tooling
rather than a running service (e.g., a public-facing inspection
endpoint) can also use it.
Why four adapters is honest deep-module design#
Ousterhout’s “deep module” guidance is narrow interface, broad implementation — emphatically not “one interface that secretly does four different things”. The reload mechanism is genuinely different across the four cases (CONFIG SET vs SQL function call vs container restart vs no-op) and the operational consequences differ (zero vs zero vs 30s downtime vs none). Forcing them into one cert-spec dict makes the caller’s mental model wrong: they think they have one knob, they actually have four with different blast radii. The pluggable adapter makes the asymmetry visible at the call site:
- role: step_ca_vhost_cert
vars:
cert_spec: { ... }
reload_strategy: restart_influx # explicit: this one restarts
Alternatives considered#
One module, fork-on-service inside. Rejected per above — hides the asymmetry from the caller.
Three independent roles (
redis_step_cert,postgres_step_cert,influx_step_cert). Rejected because the issuance + on-disk + renewal-timer machinery would be duplicated three ways. The whole point of the consolidation in #95 is to retire bespoke per-service PKI plumbing.One module, reload-command as a literal shell string parameter. Rejected because the contract for “reload after renewal” is more than one shell line: it includes idempotency (no reload on no-op renewal), error handling (a failed reload should not leave the cert file half-installed), and in the InfluxDB case a wait-for- healthy step. That logic belongs in named adapters, not in free-form shell.
Consequences#
Adding a fifth datastore later (e.g. MinIO, Loki) is “write a fifth adapter”, not “extend the cert-spec dict”.
The role’s interface stays narrow (
cert_spec+reload_strategy) while the implementation is honest about the three-way fork.Tests can target each adapter independently — important because the InfluxDB adapter is the only one with downtime semantics and needs different verification.
Decision: Reload mechanisms (per service)#
This is the per-service detail behind the table in the previous section.
Redis — CONFIG SET, zero downtime#
Redis 6+ accepts runtime updates of tls-cert-file / tls-key-file /
tls-ca-cert-file via CONFIG SET. The connection pool isn’t churned;
existing TLS sessions live out their natural deaths and new sessions
pick up the new material.
Failure mode to test: if CONFIG SET succeeds but the new files are
unreadable by the redis user (UID 999 in our containers), Redis logs
the error and keeps using the old in-memory cert. The renewal script
must verify post-CONFIG-SET that the active cert serial matches the
on-disk cert serial.
Postgres — pg_reload_conf(), zero downtime#
SELECT pg_reload_conf(); re-reads postgresql.conf, including
ssl_cert_file and ssl_key_file. Existing connections keep their
TLS context; new connections get the new cert. Same caveat as Redis:
verify the postmaster actually picked up the new cert; a typo in the
config path is a silent fallback.
InfluxDB — docker restart, ~30s downtime#
InfluxDB OSS does not have a runtime reload for TLS material. We accept the restart. The 30s window is acceptable on the InfluxDB role: it ingests metrics from telegraf, which buffers locally, and serves Grafana dashboards, which retry. No write path depends on InfluxDB being up second-by-second.
The restart adapter must:
pre-flight that the new cert is syntactically valid (
openssl x509 -noout -text) before bouncing the container,docker restart(notdocker stop && docker start— the former preserves the container’s IP / aliases on the user-defined network),wait for
/healthto return 200 before declaring success.
Decision: HSM blast radius / soft-offline budget#
Math#
The CCAT root CA lives on an HSM. If the HSM is offline for any reason
(physical access loss, ceremony in progress, hardware fault), the CA
cannot issue or renew. Every cert lives until its notAfter; the
“soft-offline budget” is how long the HSM can be offline before
something starts hard-failing.
Environment |
Cert lifetime |
|
Renewal cadence |
Soft-offline budget |
|---|---|---|---|---|
Production |
90d |
day 60 (2/3 lifetime) |
12h timer = 60 fires before expiry |
30d / 60 fires |
Staging (PRD draft) |
30d |
day 20 |
12h timer = 20 fires before expiry |
10d / 20 fires |
Staging (revised) |
45d |
day 30 |
12h timer = 30 fires before expiry |
15d / 30 fires |
Decision#
Production stays at 90d / 30d budget — comfortable headroom for an HSM ceremony (typically 1-2 days) plus one weekend of bad luck.
Architect-mandated change to the PRD: staging at 30d / 10d budget is too tight. A long weekend plus a sick on-call plus a stuck CI run eats most of the budget. Extend staging cert lifetime to 45d (budget 15d / 30 fires).
Alternatives considered#
Document the operational acceptance of 10d on staging. Available if anyone has a strong reason for keeping cert lifetimes short-on-staging (often “make rotation visible in CI cadence”). Rejected because staging exists to rehearse production failure modes, and a tighter-than-production budget makes staging a worse rehearsal, not a better one.
Match staging to production at 90d. Rejected because we do want staging to exercise the renewal path more frequently than production; 45d gives us that without making the budget uncomfortable.
Consequences#
One more variable to keep aligned across the three services on staging. The role’s
cert_spec.lifetimeparameter handles this.The PRD’s table needs a one-line edit; flag for the implementation PR.
Open question to pin before implementation#
Does step ca renew succeed against an already-expired
authenticating cert? If yes, the budget math above is
straightforwardly correct: lose the HSM for 30d, recover, every host
catches up on the next timer fire. If no, then once a host’s cert
expires we drop back to the JWK-password path for that host, which
means the password file has to be ready to materialise on demand.
This is testable in staging with a deliberately back-dated cert. Do
this test before merging the implementation. Decision below assumes
the answer is “no” until confirmed; the role’s renewal script will
fall back to JWK-password issuance if step ca renew fails for an
expired-cert reason.
Decision: SAN policy#
Decision#
Each cert carries multiple DNS SANs:
the docker-network alias the service is reached at (e.g.
redis,postgres,influxdb),the public FQDN (e.g.
redis.ccat.uni-koeln.de),the host FQDN (e.g.
input-b.ccat.uni-koeln.de).
No wildcards. No IP SANs.
Reasoning#
No wildcards: a leaked
*.ccat.uni-koeln.decert grants the attacker every vhost we’ve ever named under that domain. Multi-SAN per cert keeps the leak blast radius to “this one service”.No IP SANs: IP SANs make the cert tied to a specific deployment topology. Move the service to a different host and the cert silently mis-matches. DNS-only SANs decouple identity from placement; renumbering the IP plan stays a DNS-only operation. The redis_certs precedent included an IP SAN (
redis-certs_staging.conflistsIP:134.95.40.103); we are retiring that.Multi-SAN per cert instead of “one cert per SAN”: one renewal path per service, one cert file in one place. The reload adapters don’t have to juggle three cert files for the same daemon.
Consequences#
Adding a new alias to a service is a re-issuance, not a config edit. Acceptable because aliases change rarely and the role makes re-issuance trivial.
The cert will list multiple SANs under
Subject Alternative Nameinopenssl x509 -noout -text— do not treat this as a misconfiguration in inspection scripts.
Decision: mTLS scope asymmetry#
Decision#
Redis: keep mTLS. Both server and client present certs.
Postgres: server-auth-only. Server presents a cert; client authenticates with username + password as today.
InfluxDB: server-auth-only. Server presents a cert; client authenticates with API token as today.
Reasoning — and being honest about it#
Redis stays mTLS because it’s already mTLS today (homegrown
redis_certs role) and because the application clients (data-transfer
workers, ops-db-api, etc.) already know how to present client certs.
Migrating Redis off mTLS at the same time as moving its trust root is
two changes at once. We are not doing two changes at once.
This is inertia, not principle. A clean-sheet design might well land all three on server-auth-only-with-password/token; mTLS for Redis buys us a marginal extra layer (compromise of the Redis password isn’t enough; you’d also need the client cert) but at the cost of distributing client material to every Redis-using service.
Revisit: when data-transfer or ops-db-api next has a credentials refactor, evaluate whether Redis mTLS is still pulling its weight or whether server-auth-only-with-password is enough. Track this as a follow-up; do not block #95 on resolving it.
Consequences#
runtime_redisreload adapter has to manage three files (tls-cert-file,tls-key-file,tls-ca-cert-file) — the CA file is what lets the server validate client certs. The other two adapters manage two files (cert + key only).Client-side trust distribution is asymmetric: Redis clients need both the CCAT root (to validate the server) and a client cert+key (to be validated by the server). Postgres/Influx clients only need the CCAT root. The
ca_trustrole already drops the root at/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt; client cert distribution stays where it is (per-service, today via redis_certs) for now.
Decision: Cert-spec schema — parameterised UIDs, no defaults baked in#
Context#
The PRD as drafted hardcoded the container UIDs (Redis 999,
Postgres 999, InfluxDB 1000) as constants inside the role. The
architect review pushed back: upstream image rebases historically
shift UIDs without major-version bumps, so a baked-in constant is a
silent foot-gun. Validation runbook Check 5 (2026-05-08) confirmed
the values on input-b.staging are 999/999/1000 today, but it also
confirmed influxdb:latest is the only unpinned image in scope —
exactly the drift candidate.
The role needs a schema that (a) takes UID as a per-cert parameter with no role-level default, (b) sources the value from a per-host fact so different hosts can have different UIDs without code changes, (c) is the same schema the runtime drift-detection step (TODO 7) reads at renewal time.
This section follows the same shape as the SSH-cert plane’s
ansible/roles/ssh_service_cert/defaults/main.yml schema — same
pattern of “list of cert-spec dicts in host_vars, role is a no-op
when the list is empty”.
Decision#
The role (working name step_ca_vhost_cert, modelled on
step-ca/issue-vhost-cert.sh + step-ca/renew-vhost-cert.sh plus
roles/ssh_service_cert/) takes a list of cert-spec dicts called
service_tls_certs. Per-host enable lives in
ansible/host_vars/<host>/vars_step_ca_vhost_cert.yml. The role
defaults file declares service_tls_certs: [] so the role is a
no-op on hosts where the list is undefined.
# ansible/host_vars/input-b.staging/vars_step_ca_vhost_cert.yml
host_container_uids:
postgres: 999
redis: 999
influxdb: 1000
service_tls_certs:
# ────────────────────────────── postgres ──────────────────────────────
- service: postgres-main
sans:
- postgres
- postgres.staging.data.ccat.uni-koeln.de
- input-b.staging.data.ccat.uni-koeln.de
cert_path: /etc/postgres-certs/server.crt
key_path: /etc/postgres-certs/server.key
owner_uid: "{{ host_container_uids.postgres | mandatory }}"
owner_gid: "{{ host_container_uids.postgres | mandatory }}"
cert_mode: "0644"
key_mode: "0600"
lifetime: "{{ step_ca_x509_cert_lifetime }}"
reload_strategy: runtime_postgres
container: system-integration-postgres-1
provisioner: "{{ step_ca_x509_provisioner }}"
vault_var_name: vault_step_ca_prov_staging_services_password
mtls: false
# ──────────────────────────────── redis ───────────────────────────────
- service: redis-main
sans:
- redis
- redis.staging.data.ccat.uni-koeln.de
- input-b.staging.data.ccat.uni-koeln.de
cert_path: /opt/redis-certs/staging/server.crt
key_path: /opt/redis-certs/staging/server.key
ca_path: /opt/redis-certs/staging/ca.crt # mtls=true only
owner_uid: "{{ host_container_uids.redis | mandatory }}"
owner_gid: "{{ host_container_uids.redis | mandatory }}"
cert_mode: "0644"
key_mode: "0600"
lifetime: "{{ step_ca_x509_cert_lifetime }}"
reload_strategy: runtime_redis
container: system-integration-redis-1
provisioner: "{{ step_ca_x509_provisioner }}"
vault_var_name: vault_step_ca_prov_staging_services_password
mtls: true # mTLS scope asymmetry
# ────────────────────────────── influxdb ──────────────────────────────
- service: influxdb-main
sans:
- influxdb
- influxdb.staging.data.ccat.uni-koeln.de
- input-b.staging.data.ccat.uni-koeln.de
cert_path: /etc/influxdb-certs/server.crt
key_path: /etc/influxdb-certs/server.key
owner_uid: "{{ host_container_uids.influxdb | mandatory }}"
owner_gid: "{{ host_container_uids.influxdb | mandatory }}"
cert_mode: "0644"
key_mode: "0600"
lifetime: "{{ step_ca_x509_cert_lifetime }}"
reload_strategy: restart_influx
container: system-integration-influxdb-1
provisioner: "{{ step_ca_x509_provisioner }}"
vault_var_name: vault_step_ca_prov_staging_services_password
mtls: false
Field reference#
Field |
Required |
Description |
|---|---|---|
|
✓ |
Canonical service name. Used in metric labels (Tier 1+3 of the alert substrate), filename suffixes, and journald audit-log lines. Format: |
|
✓ |
DNS SAN list per the SAN-policy decision (no IP, no wildcard, multi-SAN). At minimum: docker-network alias + public FQDN + host FQDN. |
|
✓ |
On-host filesystem paths. The cert is bind-mounted into the container by the per-machine compose file. |
|
mtls=true only |
Path to the CA bundle the server will use to validate client certs. Set only when |
|
✓ |
No role-level default. Must be sourced from |
|
✓ |
Typically |
|
✓ |
Sourced from a role-level default ( |
|
✓ |
One of |
|
✓ |
Compose-namespaced container name (e.g. |
|
✓ |
Name of the JWK provisioner on step-ca that issues this cert. Sourced from a role-level default ( |
|
✓ |
Name of the Ansible vault variable holding the provisioner password. Used at issuance only; renewals are cert-as-auth ( |
|
✓ |
|
Per-host UID fact#
host_container_uids is a separate dict in the same host_vars
file. Two reasons:
Reuse for TODO 7 runtime drift detection. The renewal script reads
host_container_uids.<service>and compares todocker exec <container> cat /proc/1/status(PID 1’s effective UID —docker exec ... iddefaults to root and is the wrong probe; runbook Check 5 captured this gotcha). Mismatch = non-zero exit + Tier-1 + Tier-2 alert.Single source of truth per host. A future change that pins
influxdb:2.7-rootless(UID 1001) is a one-line edit inhost_container_uidsrather than a hunt across multiple cert-spec entries.
Acceptance against TODO 3 clauses#
✓ The role takes
owner_uidas a per-cert-spec parameter — no defaults baked into the role. Defaults file setsservice_tls_certs: []; cert-specs supply UIDs explicitly.✓ Per-host vars under
ansible/host_vars/<host>/vars_step_ca_vhost_cert.ymlset the UIDs viahost_container_uidsand reference them in cert-specs.✓ Actual UIDs as observed via
docker exec <container> cat /proc/1/statusare recorded in validation runbook Check 5 (2026-05-08,input-b.staging): postgres=999, redis=999, influxdb=1000. Cross-referenced from this section.→ TODO 7 (runtime drift detection): the same
host_container_uidsdict is the source of truth at renewal time.
Alternatives considered#
One flat dict per cert-spec mixing UID with everything else. Rejected: makes UID drift harder to reason about, and the runtime drift script would have to walk the cert-spec list to find the value rather than reading the UID dict directly.
Role-level UID defaults (e.g.,
default_postgres_uid: 999in the role’sdefaults/main.yml). Rejected: defeats TODO 3’s purpose. A future operator adding a host with non-default UIDs has to remember to override the default. Better to fail-loud than to silently use a wrong default.Per-environment cert-spec lists in
group_vars/<env>/rather than per-host. Rejected because UIDs are a host-level fact (different hosts can run different image variants); SANs and paths are also host-level. Putting them ingroup_varswould force every host in the group to have identical UIDs, which is exactly the drift-foot-gun we are avoiding.
Consequences#
The schema is the contract between the role and the TODO 7 runtime-drift script and the TODO 4 alert substrate scripts. Field renames are role-version-bump events.
Cert-spec count grows linearly with services × variants. Today: 3 services × 2 environments × {main, ccat} variant where applicable = ~6-9 cert-specs across all hosts. Manageable.
influxdb:latestUID drift risk stays open (no version pin), but is now a one-linehost_container_uids.influxdbedit if it shifts. TODO 7’s runtime probe catches the shift at the next renewal fire and refuses to write the new key — fail-closed before the reload would brick InfluxDB.
Decision: x509 canary on input-c.staging — leading-indicator for the cert plane#
Context#
Option A on allowRenewalAfterExpiry (Resolved, 2026-05-08) makes
the protection contingent on detection: a cert+key snapshot leak
auto-bounds at notAfter only if the operator notices the renewal
chain has been broken before then. The SSH-cert plane already runs
24h user certs that act as an HSM/CA-health canary for the SSH
side; the x509 plane has no equivalent today.
Service certs are 90d (production) / 45d (staging) and only renew in the last 1/3 of lifetime, so a stuck renewal gives the alert substrate days-of-warning if it works. A 24h x509 canary fails within hours of any HSM/CA breakage on the x509 plane — long before any production cert is at risk. It is the leading indicator that proves the alert path is alive, and the smoke test for the JWK provisioner cert-as-auth flow specifically.
Decision#
Issue a 24h-lifetime x509 cert from the staging-services JWK
provisioner to a non-prod host. Target host: input-c.staging —
deliberately not input-b.staging so the canary does not share fate
with the CA host itself.
Cert-spec entry (lives in
ansible/host_vars/input-c.staging/vars_step_ca_vhost_cert.yml):
host_container_uids: {} # canary has no container; no UID needed
service_tls_certs:
- service: x509-canary
sans:
- x509-canary.input-c.staging.data.ccat.uni-koeln.de
- input-c.staging.data.ccat.uni-koeln.de
cert_path: /opt/x509-canary/canary.crt
key_path: /opt/x509-canary/canary.key
owner_uid: 0
owner_gid: 0
cert_mode: "0644"
key_mode: "0600"
lifetime: "24h"
reload_strategy: noop # new fourth adapter, see below
container: "" # no container; canary is host-only
provisioner: staging-services
vault_var_name: vault_step_ca_prov_staging_services_password
mtls: false
This relies on a fourth reload adapter, noop, listed in the
Renewal architecture decision: “renew the cert, write the new
files, do nothing else”. No reload — nothing reads the canary at
runtime. The cert exists only for its own lifecycle metrics. noop
is general-purpose (the canary is its first user, but a future cert
that doesn’t need a service reload — e.g., a dual-purpose cert
inspected by external tooling — can use it too) and is exempt from
the operational-consequence asymmetry argument because there is no
service to reload.
Renewal cadence and failure semantics#
Cert lifetime: 24h.
Timer cadence: 12h (matches production cert plan, so the canary exercises the same code path as production renewals).
step ca renew --expires-inthreshold: 18h. Below that threshold a renew attempt actually contacts the CA; above it the timer fire is a no-op (same gate the production timers will use, just with smaller numbers).Failure threshold for paging: failure to successfully renew within 18h of
notAfter= Tier 1 + Tier 2 alert. The 6h gap between the renewal threshold and the page threshold gives one natural retry without paging.
If the canary cert expires (no successful renewal for >24h after notAfter), the alert substrate is itself broken — Tier 2 mail is the canary on the canary.
Acceptance against TODO 15 clauses#
✓ 24h-lifetime x509 cert from
staging-servicesJWK on a non-prod, non-CA host (input-c.staging).✓ Renewal timer fires at 12h cadence; failure to renew within 18h of
notAftertriggers a Tier 1 + Tier 2 page on the substrate from TODO 4 (alert substrate decision).✓
step_x509_cert{service=x509-canary} seconds_to_expiryandstep_x509_cert_last_renewal_success{service=x509-canary}are wired into the alert substrate as the first metrics — end-to-end verification (Tier 1 alert visible in Grafana, Tier 2 mail actually delivered) happens before any production cert is enrolled. This is also Check 8 (page-path E2E) in the validation runbook.✓ The cert-spec entry above is the canary configuration; the
noopadapter is the implementation. Single artifact for both purposes.
Why a non-CA host#
The canary is supposed to fail fast when the HSM is unreachable. If
it lives on input-b.staging (which hosts the CA), an input-b
outage takes down the CA and the canary together — the canary’s
failure is then ambiguous between “CA is down” and “input-b is down
and the CA might be fine”. Hosting the canary on input-c.staging
removes that ambiguity: a canary failure with input-c.staging up
means the CA is unreachable from a peer host, which is exactly the
condition the canary exists to detect.
Consequences#
Phase A scope adds the
noopadapter (fourth adapter; trivial — write files, exit 0, emit metrics). Phase A scaffolding gains one cert-spec oninput-c.staging.The canary is the validation-runbook Check 8 target. Check 8 is currently BLOCKED on TODO 15 (and on Phase A producing the role). Closing TODO 15 design unblocks Check 8 once Phase A lands.
A new operational duty: if the canary alerts but no production cert has alerted, the operator’s first move is “is the CA reachable from
input-c.staging?” —step ca health,nc -zv ca.ccat.uni-koeln.de 443from input-c. Document in the on-call runbook (when one exists).
Decision: Revocation stance — lifetime-as-revocation, no CRL/OCSP#
Decision#
We do not stand up a CRL or OCSP responder. Compromised certs are handled by rotating the secret material and waiting for the cert to expire (90d production, 45d staging). For acute compromises, the runbook below is the response.
Trade-offs#
CRL. Operationally simple to publish, but every client has to fetch and trust it. Adding a fetch-and-trust step to telegraf, Grafana, three Celery worker fleets, and ops-db-api is real work for a threat model where we can already roll the underlying secret.
OCSP. Real-time but adds a hard dependency on the CA being reachable from every TLS handshake. We just spent ADR-0001 (
docs/source/adr/0001-ca-per-vhost-cert-split.md) carefully containing the CA’s reachability surface; OCSP would re-expand it.Lifetime-as-revocation. The 90d ceiling means a compromised cert is automatically not-trusted within 90d without operator action. For acute compromise we roll the secret immediately; the cert remains technically valid until it expires but the secret it protected is already changed.
Compromise modes — runbook headlines#
Full runbook: see the threat-model document (TODO: link when written).
Mode |
Headline response |
|---|---|
Server key leaked (Redis/Postgres/Influx host private key on disk readable by attacker) |
Re-issue the cert with the role ( |
Client key leaked (Redis client cert on a compromised app host) |
Rotate the client cert via the redis_certs successor flow. Same lifetime caveat. |
HSM key leaked (root CA private key compromised) |
Stop the CA; cut a new root via ceremony; redistribute via |
Consequences#
Operators need to internalise “rolling the secret + waiting for expiry” as the revocation primitive. This is documented at the runbook level, not on every
ccatCLI invocation.A future regulatory audit that asks for “CRL endpoint” gets the answer “no CRL; lifetime ceiling and operator-led rotation”. Be prepared to defend that.
Decision: Trust distribution — bind-mount + env vars, not image rebuild#
Decision#
The CCAT root CA is distributed to containers via a bind-mount of
/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt (placed there by
roles/ca_trust/) and an env var pointing each application’s TLS
library at it (e.g. SSL_CERT_FILE,
PGSSLROOTCERT, etc).
We do not bake the CA root into the application container images.
Reasoning#
Baking the root into the image couples root-rotation cadence to CI
build cadence: every root rotation triggers a rebuild and redeploy of
every image. With bind-mount + env var, root rotation is “update one
file on disk via the ca_trust role, restart consumers” — independent
of CI.
This is the same separation already in effect for the SSH cert plane
(roles/ssh_service_cert/ mounts ~/.ssh from the host into spawned
agents, see commit ce87baa).
Consequences#
Container images stay smaller and rebuild less often.
The
ca_trustrole is now a hard dependency for every host that hosts a TLS-consuming container. This is already true today.A misconfigured bind-mount path silently turns into “no CA root” at the container level. The role must verify post-mount that the expected fingerprint is present.
Decision: Alert substrate — tiered, with a TLS-independent backstop#
Context#
The PRD draft proposed renewal-failure alerts flowing telegraf → InfluxDB → Grafana → ops chat. The architect review caught a circular dependency: that alert path itself depends on the TLS trust chain we’re trying to monitor. If the trust chain breaks, the alert telling us so is silenced by the same break.
An earlier draft of this ADR section recommended “piggyback on the
SSH-cert plane” on the premise that the SSH-cert plane’s
failure-notification path is by construction independent of the
database TLS chain. That premise was wrong. Inspection of
ansible/roles/ssh_service_cert/templates/step-cert-monitor.sh.j2
plus ansible/roles/system_setup/files/telegraf.conf:960 shows the
SSH-cert plane emits step_cert and step_renew_failed measurements
via Telegraf [[inputs.exec]], and Telegraf’s
[[outputs.influxdb_v2]] writes to
http://db.data.ccat.uni-koeln.de:8086 — the same InfluxDB on
input-b that this PRD is hardening. The SSH-cert plane shares fate
with the database TLS chain. Piggybacking on it does not break the
circular dependency; it just inherits it under a different name.
The fix is not to rebuild on a different single substrate — it is to accept that any single substrate convenient enough to use day-to-day will share fate with something in the stack. We need a backstop tier that is genuinely independent.
Decision#
Tiered substrate, three independent paths:
Tier 1 — Primary (visibility + everyday paging)#
Telegraf [[inputs.exec]] on every cert host emits, mirroring the
existing SSH-cert plane’s step-cert-monitor.sh.j2:
step_x509_cert,service=...,host=... seconds_to_expiry=Nistep_x509_renew_failed,service=...,unit=... value=0|1step_x509_cert_last_renewal_success,service=... seconds_ago=Ni
Telegraf → InfluxDB → Grafana → Matrix room (page channel:
#ccat-ops:matrix.data.ccat.uni-koeln.de). Catches single-service
renewal failures, perms drift, image UID drift (TODO 7) — anything
that doesn’t take down InfluxDB or Grafana itself.
Tier 1 does transit step-ca-issued TLS once Phase E lands (Telegraf → InfluxDB will use the new server cert). This is acknowledged, not denied. It is the “convenience” tier; it is not load-bearing for the catastrophic case.
Tier 2 — Backstop (TLS-independent, catches catastrophic failures)#
Two host-local mechanisms, both calling mailx to the existing
admin_email_addresses alias (already configured by
ansible/roles/system_setup/tasks/sendmail.yml — root → admin alias
is in place via /etc/aliases):
OnFailure=unit on every renewal systemd timer. Fires immediately when a renewal unit reportsfailed. Mail body includes hostname, service, unit name, last 20 lines ofjournalctl -u <unit>.Daily heartbeat cron at 06:00 UTC sends mail “all certs OK on $HOSTNAME, soonest-expiry=Nd, issuance-events-today=N” with one line per cert. Absence of mail for 36h on any host = problem, even if no specific failure was detected.
The mail path goes via the host MTA (sendmail) → uni-köln SMTP relay → admin inbox. This tier does not transit any step-ca-issued TLS cert. It is the only path that survives:
HSM offline (no new certs issuable).
InfluxDB down on input-b (Tier 1 metrics black-hole).
Grafana down on input-b.
Matrix homeserver down on input-b.
Network partition between input-a/c and input-b.
The only break-conditions for Tier 2 are host network down or the external SMTP relay down — both known operational classes that are not silently coupled to step-ca.
Tier 3 — Issuance audit (anomaly detection)#
Every JWK-password-using step ca certificate invocation in the
role wraps its call site in a logger trap that writes a structured
journald line:
ccat-step-issuance: host=$HOSTNAME service=$SVC ts=$ISO triggered_by=$USER
promtail ships journald to Loki; a Grafana alert fires when issuance-events-per-week exceeds the expected baseline (production: ~6/year per service after Phase A; staging: ~12/year per service).
Mostly Tier-1 plumbing, but the daily heartbeat mail (Tier 2) also
includes issuance-events-today=N — so an attacker who silences
Loki and InfluxDB still has to silence the host MTA path to hide
issuance events.
How this maps to the seven acceptance clauses#
# |
Clause |
Tier 1 |
Tier 2 |
Tier 3 |
|---|---|---|---|---|
1 |
Substrate |
Telegraf+Grafana+Matrix |
cron+ |
journald+Loki+Grafana |
2 |
Page channel |
Matrix |
email to |
(rolls up into 1+2) |
3 |
On-call contract |
deferred to a follow-up — see “Out-of-scope” below |
||
4 |
No-step-ca-TLS statement |
acknowledged transits step-ca TLS |
does not transit |
partial (Loki not on step-ca today) + Tier-2 backstop |
5 |
Renewal-failure alert |
|
systemd |
n/a |
6 |
Renewal-success heartbeat |
|
daily 06:00 mail; absence ≥ 36h = problem |
n/a |
7 |
Issuance audit log |
n/a |
“issuance-events-today=N” line in heartbeat mail |
structured journald + Loki alert on >2σ above 30d baseline |
What this changes in the existing infrastructure#
New Telegraf input-exec script templated by the
step_ca_vhost_certrole, parallel toroles/ssh_service_cert/templates/step-cert-monitor.sh.j2— same pattern, x509 measurements instead of SSH ones. ~1 small PR.Renewal systemd timer template gains
OnFailure= ccat-cert-mail@%i.serviceand a siblingccat-cert-mail@.serviceunit that callsmailx. ~1 small PR.Daily heartbeat cron entry (
/etc/cron.d/ccat-cert-heartbeat) templated per host from the cert-spec list. ~1 small PR.Issuance audit log is a one-line
logger -t ccat-step-issuancewrapper around the issuance script and a Grafana/Loki alert rule. Folded into the role’s issuance task.
Three small Phase A PRs, parallelisable with the role itself; none depends on the cert-issuance role being complete first.
Alternatives considered (why this won over the original options)#
Single-substrate, “use what exists and is independent” (original recommendation: SSH-plane piggyback). Rejected because the SSH-plane is not actually independent — it shares the same Telegraf → InfluxDB pipe. Single-substrate framing is the bug.
cron + mailx as the only path. Robust, but operationally thin: no silencing, no ack, no per-service severity, no dashboard. Acceptable as a backstop; not enough as the everyday path.
Pushgateway + Alertmanager over plain HTTP on a private docker network. Adds two new components for one alert class. Yagni until we have at least three substrates that would benefit from a unified alerting layer. Revisit when the alerting story is mature enough to consolidate.
Why a tiered design is the honest answer#
A single substrate convenient enough to be the daily path will share fate with something. The architect’s worry was real; the fix is not “find a magically-independent single path” (no such path exists at this scale of infrastructure) but “have a backstop that is deliberately inconvenient — mail to a mailing list — so it is actually independent”. Tier 2 is operationally annoying on purpose: mail is not a great paging UX, but it is a great backstop UX because it doesn’t transit any of the things we are trying to alert on.
Consequences#
Three artifacts to maintain instead of one. Worth it for the load-bearing independence guarantee.
admin_email_addressesis now load-bearing. Document the alias contents and the SMTP relay path in the on-call runbook (when one exists). Test the path during Phase A by deliberately failing a staging renewal and confirming the mail arrives.Grafana / InfluxDB / Matrix outage scenarios are now page-quiet on Tier 1 by design. Operators must internalise that “no Tier 1 alert” means “Tier 1 is up”, not “all is well”. The daily Tier 2 heartbeat is the positive-confirmation signal.
Future consolidation (e.g., Alertmanager) replaces Tier 1 without disturbing Tier 2. Tier 2 is the architectural floor.
Open questions#
Is
runtime_redis(CONFIG SET) sufficient on Redis 7 with TLS-only listeners? Thetls-portdirective isn’t reloadable via CONFIG SET in some Redis versions; verify on the version we ship. If not,runtime_redisdegrades to arestart_redisadapter and Redis joins InfluxDB in the 30s-downtime club.mTLS asymmetry follow-up. Schedule a review at the next data-transfer credentials refactor. Don’t block #95.
Threat-model document link. The full leak-response runbook lives there; this ADR carries the headlines. Link when written.
Resolved#
(2026-05-08) Does
step ca renewsucceed against an already-expired authenticating cert? Resolved by configuration inspection (validation runbook Check 4).step ca provisioner add --allow-renewal-after-expiryexists as a flag;step-ca/provisioners-add.shdoes NOT pass it onprod-servicesorstaging-services. Default isfalse. Thereforestep ca renewon an expired cert is refused under the current CA config.Decision: keep the strict default (
allowRenewalAfterExpiry: false, i.e. Option A). Threat-model trade-off:Service-host snapshot leak (cert+key only): the JWK provisioner password is NOT on service hosts in steady state — vault-staged as a 0400 host tmpfile during issuance, unlinked in the
always:block of the issuance play. Steady-state renewal uses cert-as-auth and needs no password. So a snapshot leak gives the attacker cert+key but not the password, and Option A’s “expired = denied” semantic auto-bounds the leak atnotAfterif the attacker fails to renew in time. Detection-then-host-rotation breaks the renewal chain.Controller compromise (saiyajin / Jenkins-on-input-b): both options are equally lost. Vault key lives there.
Persistent service-host compromise spanning an issuance window: attacker eventually grabs the 0400 tmpfile. Both options equally lost.
Operational cost of Option A: HSM offline > 30d production budget (15d staging) requires manual re-issuance ceremony — vault → 0400 tmpfile → run issuance script. Same pattern as today’s vhost cert and
ssh_service_cert/_per_container.yml.
Option A’s protection is contingent on detection. Therefore monitoring + canary become load-bearing (TODO 15 in the pre-implementation TODO list, plus expanded acceptance for the alert substrate in TODO 4).
(2026-05-08)
update-ca-trust extractatomicity. Resolved by validation runbook Check 3.update-ca-trustswaps the bundle via atomic rename on RHEL 10.1 (inode change verified). No partial-read window on the host filesystem. Downstream nuance: Linux single-file bind-mounts pin the source inode, so atomic rename on the host means containers see the old bundle until restart — tracked as TODO 14, not a blocker for this ADR.(2026-05-08) Trust-anchor compose layering. Resolved by validation runbook Check 6 + this ADR’s “Decision: Compose layering” section. New
docker-compose.trust.ymlis the single source of truth for service-needs-trust; layered into each applicable context viaCONTEXT_COMPOSE. TODO 16 closed on this ADR section landing.(2026-05-08) Break-glass SSH access during HSM-down >24h. Resolved by static review of existing infrastructure rather than by adding a new artifact. The architect’s concern presumed step-ca-issued user certs are the only operator auth path;
ansible/roles/system_setup/tasks/nitrokey_ssh.ymlapplies per-operator FIDO2 hardware-key pubkeys to plainauthorized_keyson every managed host (outside theAuthorizedPrincipalsFilecert path), and out-of-band hardware consoles cover hardware-level recovery. The Nitrokey path survives any step-ca outage by construction. TODO 5 dropped; Check 11 signed off as N/A. See “Operational notes” for the role-split rationale (Nitrokey for core admins, step-ca SSH certs for remote admins).(2026-05-08) x509 canary on
input-c.staging. Resolved by this ADR’s “Decision: x509 canary oninput-c.staging” section. 24h cert fromstaging-servicesJWK on a non-CA host (input-c.staging); 12h timer cadence; failure to renew within 18h ofnotAftertriggers Tier 1 + Tier 2 alert. Adds a fourthnoopreload adapter (general-purpose; canary is its first user). Doubles as validation runbook Check 8 (page-path E2E). TODO 15 closed on this ADR section landing.(2026-05-08) Cert-spec schema and UID parameterisation. Resolved by this ADR’s “Decision: Cert-spec schema — parameterised UIDs, no defaults baked in” section. UIDs are per-host facts (
host_container_uidsdict) referenced from cert-specs via\| mandatoryso a missing fact fails the play loudly. The schema is the shared contract for the role, the TODO 7 runtime-drift script, and the TODO 4 alert substrate’s service labels. TODO 3 closed on this ADR section landing.(2026-05-08) Alert substrate. Resolved by replacing the single-substrate framing (SSH-plane piggyback) with a tiered design — see “Decision: Alert substrate — tiered, with a TLS-independent backstop”. The SSH-plane piggyback recommendation in an earlier draft of this ADR was based on a wrong premise (the SSH plane shares the same Telegraf → InfluxDB pipe and so shares fate with the database TLS chain it was supposed to monitor). The tiered fix: Tier 1 Telegraf+ Grafana+Matrix for everyday paging, Tier 2 cron+
OnFailure+mailxtoadmin_email_addressesas the load-bearing TLS-independent backstop, Tier 3 journald+Loki+Grafana for issuance-frequency anomaly detection. TODO 4 closed on this ADR section landing. The on-call hand-off contract clause is explicitly deferred to a follow-up — channels exist; rotation contract is a team-structure decision for when the rotation exists.
Out-of-scope#
Things this PRD and ADR explicitly do not address. Each item is here because someone has asked or might reasonably ask, and the answer is “not in this rollout”:
2-week F→G dual-trust soak. Waived under the time-bound setup-mode argument in “Decision: Migration style”. Revisit if production becomes populated before Phase G ships. Do not cite this ADR as precedent for skipping a soak on a populated production stack.
Migrating Redis off mTLS to server-auth-only-with-password. Inertia, not principle (see “Decision: mTLS scope asymmetry”). Revisit at the next data-transfer or ops-db-api credentials refactor.
CRL or OCSP infrastructure. Lifetime-as-revocation only (see “Decision: Revocation stance”). A regulatory ask for a CRL endpoint is a future ADR.
ops-db-apiinbound TLS (nginx-proxy → ops-db-api). Currently undecided (TODO 11). Once chosen, the answer goes into “Operational notes” if in-scope, or remains here if explicitly out-of-scope, or moves to its own ADR.Cert-transparency / public-log integration. Step-ca is a private CA; not applicable.
Baking the CCAT root into application images. Trust distribution decision: bind-mount + env var, not image rebuild.
A unified CLI surface upfront (
ccat tls rotate,ccat tls status). YAGNI; design after the role works.Renewal-job log retention beyond
journalctl. Covered by the general logging / Loki policy, not this PRD.Backup-as-cert-recovery-path. Backup coverage of service-cert directories is not confirmed by ITCC (TODO 17). The role re-applying after a host reinstall is the recovery path; backups are best-effort defence in depth, not load-bearing.
F→G soak in any future TLS migration on populated production. See “Decision: Migration style” → Time-bound. Future migrations on a populated stack must use a soak.
On-call hand-off contract for the alert substrate (who acks, escalation timeout, expected MTTR). Channels exist (Tier 1: Matrix
#ccat-ops, Tier 2:admin_email_addressesmail). The rotation/ack/MTTR contract is a team-structure decision deferred until a real on-call rotation exists. Track as a follow-up; not blocking PRD #95.
Operational notes#
This section consolidates the operational concerns surfaced through the per-decision sections above into one place for on-call. Each item points back to where the rationale lives.
HSM offline budget. 30d production / 15d staging (HSM blast radius decision). Beyond budget: manual re-issuance ceremony via JWK provisioner password (vault → 0400 host tmpfile → unlink in an
always:block of the issuance play). This is not auto-recovery — it requires an operator with vault access to run the issuance script.Renewal cadence. 12h timer per host, modelled on
step-ca/renew-vhost-cert.sh. Most fires are no-ops becausestep ca renewonly contacts the CA in the last 1/3 of cert lifetime. A misconfigured timer (or a--forcestorm during rollout) is a CA-DoS risk; throttle / serialise mass issuance during phase rollouts (TODO 6).Trust-anchor rotation requires container restart. Single-file bind-mounts pin the source inode (TODO 14). Any change to
/etc/pki/ca-trust/source/anchors/followed byupdate-ca-trust extractREQUIRES a rolling restart of every container that bind-mounts the trust bundle. The host gets the new file atomically; running containers do not. Either accept this and document the restart in the rotation procedure, or move to a directory bind-mount (TODO 14 alternative).Postgres replica during rotation. Primary and replica must not renew simultaneously while replication is mid-write (TODO 10). The chosen ordering — primary-first with a wait gate, or replica-first, or a coordination lock — is recorded under TODO 10 acceptance and migrates here once decided.
Alert path independence — tiered substrate. Three paths (alert substrate decision, TODO 4 closed). Tier 1 (Telegraf → InfluxDB → Grafana → Matrix
#ccat-ops) is the everyday paging path and shares fate with input-b services. Tier 2 (systemdOnFailure=+ daily 06:00 cron heartbeat →mailxtoadmin_email_addresses) is the load-bearing TLS-independent backstop — does not transit any step-ca-issued cert. Tier 3 (journald → promtail → Loki → Grafana) is the issuance-anomaly audit. Operational rule for on-call: “no Tier 1 alert” means “Tier 1 is up”, not “all is well”; the daily Tier 2 heartbeat mail is the positive- confirmation signal — absence ≥ 36h on any host is itself a problem.Container UIDs are per-host parameters. Cert-spec UIDs (TODO 3) are parameterised, not hardcoded; runtime UID drift is detected by the renewal script (TODO 7) by reading
/proc/1/statusfrom PID 1 inside each container (docker exec ... iddefaults to root and is the wrong probe — runbook Check 5 captured this gotcha). Today’s values, observed oninput-b.staging2026-05-08: Redis 999, Postgres 999, InfluxDB 1000.influxdb:latestis the only unpinned image in the stack — drift risk concentrates there.Backup is not the cert recovery path. Service-cert directories may not be in the central Commvault policy (TODO 17, ITCC ticket pending). Recovery on host reinstall is “re-run the
step_ca_vhost_certrole”. Document this in the role README.Break-glass SSH already provided by existing infrastructure. The architect’s worry — “if step-ca is down >24h, every operator’s SSH cert expires and nobody can SSH in to fix it” — assumed step-ca-issued user certs are the only operator auth path. They are not.
ansible/roles/system_setup/tasks/nitrokey_ssh.ymlapplies per-operator FIDO2 hardware-key pubkeys (roles/system_setup/files/pubkeys/<username>/*.pub) directly toauthorized_keyson every managed host, outside theAuthorizedPrincipalsFilecert path. Out-of-band hardware access (iDRAC / hypervisor console) provides the second tier for hardware-level recovery. The role split is: Nitrokey for core admins (physically present, hardware key in pocket), step-ca SSH certs for out-of-core / remote admins where shipping a hardware key is impractical. TODO 5 is dropped on this basis; Check 11 signed off as N/A.Compose layering is anchored in
docker-compose.trust.yml. Single source of truth for the service-trust bind-mount matrix (compose-layering decision). Validation runbook Check 6 inventory is the input set; per-servicehas_trust: truespot-check lands in the same Phase B PR as the trust file itself.
Consequences — overall#
What becomes easier:
One trust root for the whole CCAT stack (SSH plane, vhost cert, three databases). Operators need to know one CA, one root file path, one renewal model.
Retiring
roles/redis_certs/andredis/<variant>/certs/removes a homegrown PKI with four parallel CAs that nobody outside this team can audit.Adding a fourth TLS-consuming datastore later is “add a reload-strategy adapter”, not “build new PKI”.
What becomes harder:
The CCAT root CA / HSM is now load-bearing for more things. HSM ceremony cadence and HSM availability matter more than they did. The soft-offline budget gives us 30d production / 15d staging headroom but the calculus is now “how long can the HSM be offline” not “how long can the redis-certs CA be offline” (which was effectively infinite because that CA was a file on input-b’s disk).
The pluggable-adapter design means the role has three test surfaces, not one. Plan for that in the test plan.
New operational duties:
Watch the SSH-cert-plane notification stream for DB cert renewal failures (decision-section: alert substrate).
Maintain the schema entry for any new
vault_step_ca_prov_*passwords (lines up with the existing vault schema work indata-center-computer-setup/vars_application_schema.yml).The
ccat redis-certsCLI commands (currently inctl) get superseded; plan a CLI surface for the new role (ccat tls rotate <service>,ccat tls status). Don’t build it before the role works; YAGNI.
References#
Files verified to exist in the repo at the time of writing:
step-ca/issue-vhost-cert.sh— one-shot issuance pattern (JWK provisioner password via--password-file, atomic.newinstall, docker exec reload).step-ca/renew-vhost-cert.sh—step ca renewcert-as-auth pattern, PRE_MTIME/POST_MTIME conditional reload, 12h timer cadence.ansible/roles/ssh_service_cert/tasks/_per_container.yml— password-staging-from-vault → 0400 host tmpfile → unlink convention,community.docker.docker_container_execwith stdin-only password delivery.ansible/roles/ca_trust/— RHEL system-anchor distribution for the CCAT root.ansible/roles/redis_certs/— homegrown PKI being retired by this ADR.redis/{main,ccat,develop,develop-ccat}/certs/— per-variant CAs being sunset.grafana/provisioning/{production,staging}/datasources/influxdb-datasource.yaml— currenttlsSkipVerify: truelines, plain HTTP datasource URL.docs/source/adr/0001-ca-per-vhost-cert-split.md— prior ADR on the CA’s own vhost cert; format and reasoning style mirrored here.PRD: ccatobs/system-integration#95 — defers full decision tree to this document.