# ADR-0002 — Step-CA-issued TLS certificates for Redis, Postgres, and InfluxDB

**Status.** Proposed, 2026-05-08.

**Related.** ccatobs/system-integration#95 (PRD).

**Supersedes.** Nothing on disk; the homegrown PKI under `ansible/roles/redis_certs/` and `redis/<variant>/certs/` is what this retires.

## Context

Today three datastores in the stack use ad-hoc or absent TLS:

- **Redis** — homegrown CA per environment variant
  (`ansible/roles/redis_certs/`, four self-signed CAs under
  `redis/{main,ccat,develop,develop-ccat}/certs/`). mTLS is configured
  but the CA story is bespoke and cannot be audited as part of the rest
  of the CCAT trust chain.
- **Postgres** — TLS not enforced; client traffic in the clear or with
  trust-on-first-use.
- **InfluxDB** — fronted today on plain HTTP. The Grafana datasource at
  `grafana/provisioning/production/datasources/influxdb-datasource.yaml`
  literally points at `http://data.ccat.uni-koeln.de:8086` with
  `tlsSkipVerify: true`. The same pattern lives in
  `grafana/provisioning/staging/datasources/`.

The CCAT step-ca endpoint already issues:

- the SSH user-cert plane (`ansible/roles/ssh_service_cert/`),
- the public TLS cert for the CA's own vhost
  (`step-ca/issue-vhost-cert.sh`, `step-ca/renew-vhost-cert.sh`,
  `step_ca_vhost_cert.timer` every 12h).

PRD #95 proposes routing the three datastores onto the same
step-ca-issued path: a single root of trust, predictable lifetimes, and
the same operator muscle memory.

A senior-architect review of #95 blocked the PRD on this ADR existing.
The PRD names the headline decisions ("JWK", "hard cutover", "multi-SAN
no IP") but defers the reasoning to here. This ADR is that reasoning.

### Important PRD correction

The PRD claims it copies an existing `step_ca_vhost_cert` *Ansible role*
verbatim. **There is no such role on disk.** The prior art is:

- the **scripts** at `step-ca/issue-vhost-cert.sh` (one-shot issuance)
  and `step-ca/renew-vhost-cert.sh` (PRE_MTIME / `step ca renew` /
  POST_MTIME / conditional reload),
- the **per-container password-staging convention** in
  `ansible/roles/ssh_service_cert/tasks/_per_container.yml` (vault →
  0400 host tmpfile → unlink in an `always:` block; or stdin-only via
  `docker_container_exec` for the in-container case).

The implementation must build the Ansible role **from these patterns**,
not from a role that does not exist. Any reader of this ADR or the PRD
should not waste time grepping for `step_ca_vhost_cert/` under
`ansible/roles/`.

## Decision

Issue TLS certs for Redis, Postgres, and InfluxDB from the CCAT step-ca
via the **JWK** provisioner, using cert-as-auth (`step ca renew`) on a
12h timer cadence. Cut over hard, no dual-trust soak. Implement as a
single parameterised Ansible role (`step_ca_vhost_cert`) plus three
pluggable reload-strategy adapters — not as one deep module pretending
all three services are the same shape.

Per-decision detail follows.

---

## Decision: Provisioner choice — JWK over ACME and X5C

### Context

step-ca offers three provisioner classes for non-interactive cert flows:
ACME (HTTP-01, DNS-01, TLS-ALPN-01), X5C (cert-presented-as-auth, but
chained to an external trust root), and JWK (password-or-key-protected
provisioner credentials).

### Decision

Use **JWK**, with `step ca renew` (cert-as-auth) for steady-state
renewal. Initial issuance presents the JWK provisioner password; every
renewal thereafter authorises with the cert's own private key, so the
provisioner password never has to live on the renewing host past the
one-shot issuance step.

Working precedent: `step-ca/renew-vhost-cert.sh` does exactly this for
the `ca.ccat.uni-koeln.de` vhost cert. `step ca renew` only contacts
the CA inside the renewal window (last 1/3 of lifetime by default), so
a 12h timer is benign — most fires are no-ops.

### Alternatives considered

- **ACME HTTP-01.** Would require the CA reach the requesting service
  on port 80. In our topology Redis on input-b is firewalled off the
  public internet; Postgres on input-a is internal; InfluxDB has its
  own vhost path. Opening HTTP-01 challenge paths through the proxy
  for three more vhosts adds a brittle coupling between CA, proxy
  config, and ACME challenge timing — and is the operational class of
  problem that ADR-0001 already had to navigate to get the CA's own
  vhost cert working.
- **ACME DNS-01.** Would require the CA orchestrate DNS records in the
  Uni-Köln DNS zone. We do not control that zone programmatically; a
  manual record-flip per renewal is unacceptable on a 12h cadence.
- **ACME TLS-ALPN-01.** Same firewall constraint as HTTP-01, plus the
  Redis/Postgres/Influx daemons are not HTTP servers and cannot serve
  the challenge.
- **X5C.** Would require us to bootstrap a separate trust root just
  to authorise these provisioners, then maintain it. It does not
  solve a problem JWK doesn't already solve; it adds a parallel trust
  path we'd then have to monitor.

### Consequences

- The JWK provisioner password is in vault
  (`vault_step_ca_prov_*_password`) and only reaches the issuing host
  via Ansible's vault → 0400 tmpfile → unlink pattern from
  `roles/ssh_service_cert/tasks/_per_container.yml`. Steady-state
  renewals do not touch the password at all.
- All renewals share one well-trodden path (`step ca renew`) so an
  operator who has debugged the vhost cert renewal already knows how
  to debug a Redis cert renewal.
- An open question (see below): does `step ca renew` succeed against
  an already-expired auth cert? If not, an HSM outage that exceeds the
  renewal budget *plus* the window between renewal and expiry forces
  a fall-back to the JWK-password path.

---

## Decision: Migration style — hard cutover, no dual-trust soak

### Context

The architect's default recommendation for any TLS migration is a
two-week dual-trust soak (old CA + new CA both accepted, then flip). The
PRD instead proposes a hard cutover for all three services.

### Decision

**Hard cutover.** This is consistent with the existing
TLS-hard-cutover-policy ADR captured in project memory
(2026-05-07): step-ca trust + DB certs roll out via deploy-time restart,
not dual-trust.

### Why this stands here, even though architects would normally object

Production is currently in **setup mode**: there are no end users on
the operations DB, no live data streams flowing through the transfer
pipeline, no externally consumed Grafana dashboards depending on the
InfluxDB datasource. A two-week soak buys nothing because the
"availability we'd be protecting" doesn't exist yet. The cost of a soak
(double-config, more code paths, more places for a misconfigured client
to silently fall back to the old trust path) is real today; the benefit
is zero today.

### Time-bound — read this before reusing this precedent

The above is **only** true while production is unpopulated. Once the
operations DB carries real observation records, once the data-transfer
pipeline is moving live telescope data, once Grafana dashboards are
being watched by humans on call — the calculus flips. **Any future
similar migration on a populated production stack must use a soak.**
Do not point at this ADR as precedent for skipping a soak on a live
system. The precedent is "skip soak when there are no users", not
"skip soak in general".

### Alternatives considered

- **2-week F→G dual-trust soak.** Standard playbook. Rejected on the
  cost/benefit argument above, time-bound to the empty-production
  state.
- **Service-by-service phased cutover** (Redis first, then Postgres,
  then Influx). Rejected as not actually safer in the current state —
  each service still hard-cuts when its turn comes; the phasing only
  spreads operator attention thinner. We will sequence by readiness
  of the reload adapter (probably Postgres first because
  `pg_reload_conf()` is the cheapest), not by risk-mitigation.

### Consequences

- A failed cutover is a service outage on whichever datastore failed.
  Mitigation: rehearse on staging first; the staging environment uses
  the same step-ca and the same role.
- This ADR must be revisited (and likely rewritten) before the next
  TLS migration on a populated production stack. Add a checkbox to
  the production-readiness review.

---

## Decision: Renewal architecture — one role, three reload adapters

### Context

The PRD as drafted proposed a single Ansible module that takes a
`cert_spec` dict (name, SANs, lifetime, owner, mode, reload-command)
and handles Redis (mTLS + `redis-cli CONFIG SET`), Postgres (server-only
+ `pg_reload_conf()`), and InfluxDB (server-only + container restart)
through that one shape. The architect review pushed back: a single
dict that has to fork on `if redis else if postgres else if influx`
inside the module is a deep-module fiction — the fork is inherent to
the problem and pretending it isn't makes the module's interface lie.

### Decision

Build **one parameterised role** (`step_ca_vhost_cert`) that handles:

- issuance via JWK provisioner,
- on-disk cert layout, ownership, mode,
- the renewal timer/script (modelled on
  `step-ca/renew-vhost-cert.sh` with PRE_MTIME / POST_MTIME
  conditional reload),
- trust-anchor consumption from `roles/ca_trust/`.

…and expose a **pluggable reload-strategy interface** with four
adapter implementations:

| Adapter | Service | Reload mechanism | Downtime |
|---|---|---|---|
| `runtime_redis` | Redis | `redis-cli CONFIG SET tls-cert-file ...; CONFIG SET tls-key-file ...` | zero |
| `runtime_postgres` | Postgres | `SELECT pg_reload_conf();` (or `pg_ctl reload`) | zero |
| `restart_influx` | InfluxDB | `docker restart influxdb` | ~30s |
| `noop` | (canary or no-service-attached cert) | nothing — write files, exit 0 | n/a |

The role takes a `reload_strategy` parameter that selects one of
these four; the adapter's contract is "given a cert that was just
renewed, make the running service serve it" (or, for `noop`, "verify
the new files exist and exit"). Anything that doesn't fit one of
these adapters is an implementation surprise that deserves a new
adapter, not a special case inside the existing ones.

`noop` is the fourth adapter; it exists for certs that have no
service to reload — the x509 canary on `input-c.staging` (see
"Decision: x509 canary") is its first user. A future cert that
participates in the trust chain but is read by external tooling
rather than a running service (e.g., a public-facing inspection
endpoint) can also use it.

### Why four adapters is honest deep-module design

Ousterhout's "deep module" guidance is *narrow interface, broad
implementation* — emphatically not "one interface that secretly does
four different things". The reload mechanism is genuinely different
across the four cases (CONFIG SET vs SQL function call vs container
restart vs no-op) and the operational consequences differ (zero vs
zero vs 30s downtime vs none). Forcing them into one cert-spec dict
makes the caller's mental model wrong: they think they have one
knob, they actually have four with different blast radii. The
pluggable adapter makes the asymmetry visible at the call site:

```yaml
- role: step_ca_vhost_cert
  vars:
    cert_spec: { ... }
    reload_strategy: restart_influx   # explicit: this one restarts
```

### Alternatives considered

- **One module, fork-on-service inside.** Rejected per above —
  hides the asymmetry from the caller.
- **Three independent roles** (`redis_step_cert`, `postgres_step_cert`,
  `influx_step_cert`). Rejected because the issuance + on-disk +
  renewal-timer machinery would be duplicated three ways. The whole
  *point* of the consolidation in #95 is to retire bespoke per-service
  PKI plumbing.
- **One module, reload-command as a literal shell string parameter.**
  Rejected because the contract for "reload after renewal" is more
  than one shell line: it includes idempotency (no reload on no-op
  renewal), error handling (a failed reload should *not* leave the
  cert file half-installed), and in the InfluxDB case a wait-for-
  healthy step. That logic belongs in named adapters, not in
  free-form shell.

### Consequences

- Adding a fifth datastore later (e.g. MinIO, Loki) is "write a
  fifth adapter", not "extend the cert-spec dict".
- The role's interface stays narrow (`cert_spec` + `reload_strategy`)
  while the implementation is honest about the three-way fork.
- Tests can target each adapter independently — important because
  the InfluxDB adapter is the only one with downtime semantics and
  needs different verification.

---

## Decision: Reload mechanisms (per service)

This is the per-service detail behind the table in the previous
section.

### Redis — `CONFIG SET`, zero downtime

Redis 6+ accepts runtime updates of `tls-cert-file` / `tls-key-file` /
`tls-ca-cert-file` via `CONFIG SET`. The connection pool isn't churned;
existing TLS sessions live out their natural deaths and new sessions
pick up the new material.

Failure mode to test: if `CONFIG SET` succeeds but the new files are
unreadable by the redis user (UID 999 in our containers), Redis logs
the error and keeps using the old in-memory cert. The renewal script
must verify post-CONFIG-SET that the active cert serial matches the
on-disk cert serial.

### Postgres — `pg_reload_conf()`, zero downtime

`SELECT pg_reload_conf();` re-reads `postgresql.conf`, including
`ssl_cert_file` and `ssl_key_file`. Existing connections keep their
TLS context; new connections get the new cert. Same caveat as Redis:
verify the postmaster actually picked up the new cert; a typo in the
config path is a silent fallback.

### InfluxDB — `docker restart`, ~30s downtime

InfluxDB OSS does not have a runtime reload for TLS material. We accept
the restart. The 30s window is acceptable on the InfluxDB role: it
ingests metrics from telegraf, which buffers locally, and serves
Grafana dashboards, which retry. No write path depends on InfluxDB
being up second-by-second.

The restart adapter must:

- pre-flight that the new cert is syntactically valid (`openssl x509
  -noout -text`) before bouncing the container,
- `docker restart` (not `docker stop && docker start` — the former
  preserves the container's IP / aliases on the user-defined network),
- wait for `/health` to return 200 before declaring success.

---

## Decision: HSM blast radius / soft-offline budget

### Math

The CCAT root CA lives on an HSM. If the HSM is offline for any reason
(physical access loss, ceremony in progress, hardware fault), the CA
cannot issue or renew. Every cert lives until its `notAfter`; the
"soft-offline budget" is how long the HSM can be offline before
something starts hard-failing.

| Environment | Cert lifetime | `step ca renew` window opens at | Renewal cadence | Soft-offline budget |
|---|---|---|---|---|
| Production | 90d | day 60 (2/3 lifetime) | 12h timer = 60 fires before expiry | 30d / 60 fires |
| Staging (PRD draft) | 30d | day 20 | 12h timer = 20 fires before expiry | 10d / 20 fires |
| Staging (revised) | 45d | day 30 | 12h timer = 30 fires before expiry | 15d / 30 fires |

### Decision

Production stays at 90d / 30d budget — comfortable headroom for an
HSM ceremony (typically 1-2 days) plus one weekend of bad luck.

**Architect-mandated change to the PRD:** staging at 30d / 10d budget
is too tight. A long weekend plus a sick on-call plus a stuck CI run
eats most of the budget. **Extend staging cert lifetime to 45d**
(budget 15d / 30 fires).

### Alternatives considered

- **Document the operational acceptance of 10d on staging.** Available
  if anyone has a strong reason for keeping cert lifetimes
  short-on-staging (often "make rotation visible in CI cadence").
  Rejected because staging exists to rehearse production failure
  modes, and a tighter-than-production budget makes staging a worse
  rehearsal, not a better one.
- **Match staging to production at 90d.** Rejected because we *do*
  want staging to exercise the renewal path more frequently than
  production; 45d gives us that without making the budget
  uncomfortable.

### Consequences

- One more variable to keep aligned across the three services on
  staging. The role's `cert_spec.lifetime` parameter handles this.
- The PRD's table needs a one-line edit; flag for the implementation
  PR.

### Open question to pin before implementation

**Does `step ca renew` succeed against an already-expired
authenticating cert?** If yes, the budget math above is
straightforwardly correct: lose the HSM for 30d, recover, every host
catches up on the next timer fire. If no, then once a host's cert
expires we drop back to the JWK-password path for that host, which
means the password file has to be ready to materialise on demand.

This is testable in staging with a deliberately back-dated cert. **Do
this test before merging the implementation.** Decision below assumes
the answer is "no" until confirmed; the role's renewal script will
fall back to JWK-password issuance if `step ca renew` fails for an
expired-cert reason.

---

## Decision: SAN policy

### Decision

Each cert carries multiple DNS SANs:

- the docker-network alias the service is reached at (e.g. `redis`,
  `postgres`, `influxdb`),
- the public FQDN (e.g. `redis.ccat.uni-koeln.de`),
- the host FQDN (e.g. `input-b.ccat.uni-koeln.de`).

**No wildcards. No IP SANs.**

### Reasoning

- **No wildcards:** a leaked `*.ccat.uni-koeln.de` cert grants the
  attacker every vhost we've ever named under that domain. Multi-SAN
  per cert keeps the leak blast radius to "this one service".
- **No IP SANs:** IP SANs make the cert tied to a specific deployment
  topology. Move the service to a different host and the cert
  silently mis-matches. DNS-only SANs decouple identity from
  placement; renumbering the IP plan stays a DNS-only operation.
  The redis_certs precedent included an IP SAN
  (`redis-certs_staging.conf` lists `IP:134.95.40.103`); we are
  retiring that.
- **Multi-SAN per cert** instead of "one cert per SAN": one renewal
  path per service, one cert file in one place. The reload adapters
  don't have to juggle three cert files for the same daemon.

### Consequences

- Adding a new alias to a service is a re-issuance, not a config edit.
  Acceptable because aliases change rarely and the role makes
  re-issuance trivial.
- The cert will list multiple SANs under `Subject Alternative Name` in
  `openssl x509 -noout -text` — do not treat this as a misconfiguration
  in inspection scripts.

---

## Decision: mTLS scope asymmetry

### Decision

- **Redis: keep mTLS.** Both server and client present certs.
- **Postgres: server-auth-only.** Server presents a cert; client
  authenticates with username + password as today.
- **InfluxDB: server-auth-only.** Server presents a cert; client
  authenticates with API token as today.

### Reasoning — and being honest about it

Redis stays mTLS because it's already mTLS today (homegrown
`redis_certs` role) and because the application clients (data-transfer
workers, ops-db-api, etc.) already know how to present client certs.
Migrating Redis off mTLS at the same time as moving its trust root is
two changes at once. We are not doing two changes at once.

This is **inertia, not principle.** A clean-sheet design might well
land all three on server-auth-only-with-password/token; mTLS for Redis
buys us a marginal extra layer (compromise of the Redis password isn't
enough; you'd also need the client cert) but at the cost of
distributing client material to every Redis-using service.

**Revisit:** when data-transfer or ops-db-api next has a credentials
refactor, evaluate whether Redis mTLS is still pulling its weight or
whether server-auth-only-with-password is enough. Track this as a
follow-up; do not block #95 on resolving it.

### Consequences

- `runtime_redis` reload adapter has to manage three files
  (`tls-cert-file`, `tls-key-file`, `tls-ca-cert-file`) — the CA file
  is what lets the server validate client certs. The other two
  adapters manage two files (cert + key only).
- Client-side trust distribution is asymmetric: Redis clients need
  *both* the CCAT root (to validate the server) *and* a client
  cert+key (to be validated by the server). Postgres/Influx clients
  only need the CCAT root. The `ca_trust` role already drops the root
  at `/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt`; client cert
  distribution stays where it is (per-service, today via redis_certs)
  for now.

---

## Decision: Cert-spec schema — parameterised UIDs, no defaults baked in

### Context

The PRD as drafted hardcoded the container UIDs (Redis 999,
Postgres 999, InfluxDB 1000) as constants inside the role. The
architect review pushed back: upstream image rebases historically
shift UIDs without major-version bumps, so a baked-in constant is a
silent foot-gun. Validation runbook Check 5 (2026-05-08) confirmed
the values on `input-b.staging` are 999/999/1000 today, but it also
confirmed `influxdb:latest` is the only unpinned image in scope —
exactly the drift candidate.

The role needs a schema that (a) takes UID as a per-cert parameter
with no role-level default, (b) sources the value from a per-host
fact so different hosts can have different UIDs without code changes,
(c) is the same schema the runtime drift-detection step (TODO 7)
reads at renewal time.

This section follows the same shape as the SSH-cert plane's
`ansible/roles/ssh_service_cert/defaults/main.yml` schema — same
pattern of "list of cert-spec dicts in `host_vars`, role is a no-op
when the list is empty".

### Decision

The role (working name `step_ca_vhost_cert`, modelled on
`step-ca/issue-vhost-cert.sh` + `step-ca/renew-vhost-cert.sh` plus
`roles/ssh_service_cert/`) takes a list of cert-spec dicts called
`service_tls_certs`. Per-host enable lives in
`ansible/host_vars/<host>/vars_step_ca_vhost_cert.yml`. The role
defaults file declares `service_tls_certs: []` so the role is a
no-op on hosts where the list is undefined.

```yaml
# ansible/host_vars/input-b.staging/vars_step_ca_vhost_cert.yml
host_container_uids:
  postgres: 999
  redis:    999
  influxdb: 1000

service_tls_certs:
  # ────────────────────────────── postgres ──────────────────────────────
  - service: postgres-main
    sans:
      - postgres
      - postgres.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /etc/postgres-certs/server.crt
    key_path:  /etc/postgres-certs/server.key
    owner_uid: "{{ host_container_uids.postgres | mandatory }}"
    owner_gid: "{{ host_container_uids.postgres | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: runtime_postgres
    container: system-integration-postgres-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false

  # ──────────────────────────────── redis ───────────────────────────────
  - service: redis-main
    sans:
      - redis
      - redis.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /opt/redis-certs/staging/server.crt
    key_path:  /opt/redis-certs/staging/server.key
    ca_path:   /opt/redis-certs/staging/ca.crt    # mtls=true only
    owner_uid: "{{ host_container_uids.redis | mandatory }}"
    owner_gid: "{{ host_container_uids.redis | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: runtime_redis
    container: system-integration-redis-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: true                                    # mTLS scope asymmetry

  # ────────────────────────────── influxdb ──────────────────────────────
  - service: influxdb-main
    sans:
      - influxdb
      - influxdb.staging.data.ccat.uni-koeln.de
      - input-b.staging.data.ccat.uni-koeln.de
    cert_path: /etc/influxdb-certs/server.crt
    key_path:  /etc/influxdb-certs/server.key
    owner_uid: "{{ host_container_uids.influxdb | mandatory }}"
    owner_gid: "{{ host_container_uids.influxdb | mandatory }}"
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "{{ step_ca_x509_cert_lifetime }}"
    reload_strategy: restart_influx
    container: system-integration-influxdb-1
    provisioner:    "{{ step_ca_x509_provisioner }}"
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false
```

### Field reference

| Field | Required | Description |
|---|---|---|
| `service` | ✓ | Canonical service name. Used in metric labels (Tier 1+3 of the alert substrate), filename suffixes, and journald audit-log lines. Format: `<service>-<variant>` (e.g. `redis-main`, `redis-ccat`, `postgres-main`). |
| `sans` | ✓ | DNS SAN list per the SAN-policy decision (no IP, no wildcard, multi-SAN). At minimum: docker-network alias + public FQDN + host FQDN. |
| `cert_path`, `key_path` | ✓ | On-host filesystem paths. The cert is bind-mounted into the container by the per-machine compose file. |
| `ca_path` | mtls=true only | Path to the CA bundle the server will use to validate client certs. Set only when `mtls: true`. |
| `owner_uid`, `owner_gid` | ✓ | **No role-level default.** Must be sourced from `host_container_uids.<service>` (or another per-host fact). The `\| mandatory` filter forces an explicit failure if the host fact is missing — fail-loud on misconfiguration is the desired property. |
| `cert_mode`, `key_mode` | ✓ | Typically `0644` and `0600`. Postgres rejects keys with group/world access (validation runbook Check 9 pinned the FATAL phrasing); the role asserts these post-issuance and aborts the play before reload if they drift. |
| `lifetime` | ✓ | Sourced from a role-level default (`step_ca_x509_cert_lifetime`) which the per-environment vars file overrides — `90d` for production, `45d` for staging (per HSM blast-radius decision). |
| `reload_strategy` | ✓ | One of `runtime_redis`, `runtime_postgres`, `restart_influx` (per renewal-architecture decision). Adding a fourth datastore = adding a fourth adapter, not extending the cert-spec. |
| `container` | ✓ | **Compose-namespaced container name** (e.g. `system-integration-postgres-1`), not the bare service name — runbook Check 5 captured this gotcha. The reload adapter and the runtime UID-drift probe (TODO 7) both `docker exec` into this container. |
| `provisioner` | ✓ | Name of the JWK provisioner on step-ca that issues this cert. Sourced from a role-level default (`step_ca_x509_provisioner`) — `staging-services` or `prod-services`. |
| `vault_var_name` | ✓ | Name of the Ansible vault variable holding the provisioner password. Used at issuance only; renewals are cert-as-auth (`step ca renew`) and never touch the password. Same vault → 0400 host tmpfile → unlink convention as `roles/ssh_service_cert/tasks/_per_container.yml`. |
| `mtls` | ✓ | `true` for Redis (server validates client certs); `false` for Postgres and InfluxDB (server-auth-only) per the mTLS scope asymmetry decision. Controls whether `ca_path` is written and whether the reload adapter manages two files (cert+key) or three (cert+key+ca). |

### Per-host UID fact

`host_container_uids` is a separate dict in the same `host_vars`
file. Two reasons:

1. **Reuse for TODO 7 runtime drift detection.** The renewal
   script reads `host_container_uids.<service>` and compares to
   `docker exec <container> cat /proc/1/status` (PID 1's effective
   UID — `docker exec ... id` defaults to root and is the wrong
   probe; runbook Check 5 captured this gotcha). Mismatch =
   non-zero exit + Tier-1 + Tier-2 alert.
2. **Single source of truth per host.** A future change that
   pins `influxdb:2.7-rootless` (UID 1001) is a one-line edit in
   `host_container_uids` rather than a hunt across multiple
   cert-spec entries.

### Acceptance against TODO 3 clauses

1. ✓ The role takes `owner_uid` as a per-cert-spec parameter — no
   defaults baked into the role. Defaults file sets
   `service_tls_certs: []`; cert-specs supply UIDs explicitly.
2. ✓ Per-host vars under
   `ansible/host_vars/<host>/vars_step_ca_vhost_cert.yml` set the
   UIDs via `host_container_uids` and reference them in cert-specs.
3. ✓ Actual UIDs as observed via `docker exec <container> cat
   /proc/1/status` are recorded in validation runbook Check 5
   (2026-05-08, `input-b.staging`): postgres=999, redis=999,
   influxdb=1000. Cross-referenced from this section.
4. → TODO 7 (runtime drift detection): the same `host_container_uids`
   dict is the source of truth at renewal time.

### Alternatives considered

- **One flat dict per cert-spec mixing UID with everything else.**
  Rejected: makes UID drift harder to reason about, and the
  runtime drift script would have to walk the cert-spec list to
  find the value rather than reading the UID dict directly.
- **Role-level UID defaults** (e.g., `default_postgres_uid: 999`
  in the role's `defaults/main.yml`). Rejected: defeats TODO 3's
  purpose. A future operator adding a host with non-default UIDs
  has to remember to override the default. Better to fail-loud
  than to silently use a wrong default.
- **Per-environment cert-spec lists in `group_vars/<env>/`**
  rather than per-host. Rejected because UIDs are a host-level
  fact (different hosts can run different image variants); SANs
  and paths are also host-level. Putting them in `group_vars`
  would force every host in the group to have identical UIDs,
  which is exactly the drift-foot-gun we are avoiding.

### Consequences

- **The schema is the contract** between the role and the
  TODO 7 runtime-drift script and the TODO 4 alert substrate
  scripts. Field renames are role-version-bump events.
- **Cert-spec count grows linearly with services × variants.**
  Today: 3 services × 2 environments × {main, ccat} variant where
  applicable = ~6-9 cert-specs across all hosts. Manageable.
- **`influxdb:latest` UID drift risk** stays open (no version
  pin), but is now a one-line `host_container_uids.influxdb`
  edit if it shifts. TODO 7's runtime probe catches the shift
  at the next renewal fire and refuses to write the new key —
  fail-closed before the reload would brick InfluxDB.

---

## Decision: x509 canary on `input-c.staging` — leading-indicator for the cert plane

### Context

Option A on `allowRenewalAfterExpiry` (Resolved, 2026-05-08) makes
the protection contingent on detection: a cert+key snapshot leak
auto-bounds at `notAfter` only if the operator notices the renewal
chain has been broken before then. The SSH-cert plane already runs
24h user certs that act as an HSM/CA-health canary for the SSH
side; the x509 plane has no equivalent today.

Service certs are 90d (production) / 45d (staging) and only renew
in the last 1/3 of lifetime, so a stuck renewal gives the alert
substrate days-of-warning *if it works*. A 24h x509 canary fails
within hours of any HSM/CA breakage on the x509 plane — long before
any production cert is at risk. It is the *leading indicator* that
proves the alert path is alive, and the smoke test for the JWK
provisioner cert-as-auth flow specifically.

### Decision

Issue a 24h-lifetime x509 cert from the `staging-services` JWK
provisioner to a non-prod host. **Target host: `input-c.staging`** —
deliberately not `input-b.staging` so the canary does not share fate
with the CA host itself.

Cert-spec entry (lives in
`ansible/host_vars/input-c.staging/vars_step_ca_vhost_cert.yml`):

```yaml
host_container_uids: {}   # canary has no container; no UID needed

service_tls_certs:
  - service: x509-canary
    sans:
      - x509-canary.input-c.staging.data.ccat.uni-koeln.de
      - input-c.staging.data.ccat.uni-koeln.de
    cert_path: /opt/x509-canary/canary.crt
    key_path:  /opt/x509-canary/canary.key
    owner_uid: 0
    owner_gid: 0
    cert_mode: "0644"
    key_mode:  "0600"
    lifetime:  "24h"
    reload_strategy: noop    # new fourth adapter, see below
    container: ""                   # no container; canary is host-only
    provisioner: staging-services
    vault_var_name: vault_step_ca_prov_staging_services_password
    mtls: false
```

This relies on a **fourth reload adapter, `noop`**, listed in the
Renewal architecture decision: "renew the cert, write the new
files, do nothing else". No reload — nothing reads the canary at
runtime. The cert exists only for its own lifecycle metrics. `noop`
is general-purpose (the canary is its first user, but a future cert
that doesn't need a service reload — e.g., a dual-purpose cert
inspected by external tooling — can use it too) and is exempt from
the operational-consequence asymmetry argument because there is no
service to reload.

### Renewal cadence and failure semantics

- **Cert lifetime:** 24h.
- **Timer cadence:** 12h (matches production cert plan, so the
  canary exercises the same code path as production renewals).
- **`step ca renew --expires-in` threshold:** 18h. Below that
  threshold a renew attempt actually contacts the CA; above it the
  timer fire is a no-op (same gate the production timers will use,
  just with smaller numbers).
- **Failure threshold for paging:** failure to successfully renew
  within 18h of `notAfter` = Tier 1 + Tier 2 alert. The 6h gap
  between the renewal threshold and the page threshold gives one
  natural retry without paging.

If the canary cert expires (no successful renewal for >24h after
notAfter), the alert substrate is itself broken — Tier 2 mail is
the canary on the canary.

### Acceptance against TODO 15 clauses

1. ✓ 24h-lifetime x509 cert from `staging-services` JWK on a
   non-prod, non-CA host (`input-c.staging`).
2. ✓ Renewal timer fires at 12h cadence; failure to renew within
   18h of `notAfter` triggers a Tier 1 + Tier 2 page on the
   substrate from TODO 4 (alert substrate decision).
3. ✓ `step_x509_cert{service=x509-canary} seconds_to_expiry` and
   `step_x509_cert_last_renewal_success{service=x509-canary}` are
   wired into the alert substrate **as the first metrics** —
   end-to-end verification (Tier 1 alert visible in Grafana, Tier 2
   mail actually delivered) happens *before* any production cert is
   enrolled. This is also Check 8 (page-path E2E) in the validation
   runbook.
4. ✓ The cert-spec entry above is the canary configuration; the
   `noop` adapter is the implementation. Single artifact
   for both purposes.

### Why a non-CA host

The canary is supposed to fail fast when the HSM is unreachable. If
it lives on `input-b.staging` (which hosts the CA), an `input-b`
outage takes down the CA *and* the canary together — the canary's
failure is then ambiguous between "CA is down" and "input-b is down
and the CA might be fine". Hosting the canary on `input-c.staging`
removes that ambiguity: a canary failure with `input-c.staging` up
means the CA is unreachable from a peer host, which is exactly the
condition the canary exists to detect.

### Consequences

- **Phase A scope adds the `noop` adapter** (fourth
  adapter; trivial — write files, exit 0, emit metrics). Phase A
  scaffolding gains one cert-spec on `input-c.staging`.
- **The canary is the validation-runbook Check 8 target.** Check 8
  is currently BLOCKED on TODO 15 (and on Phase A producing the
  role). Closing TODO 15 design unblocks Check 8 once Phase A
  lands.
- **A new operational duty:** if the canary alerts but no
  production cert has alerted, the operator's first move is "is
  the CA reachable from `input-c.staging`?" — `step ca health`,
  `nc -zv ca.ccat.uni-koeln.de 443` from input-c. Document in
  the on-call runbook (when one exists).

---

## Decision: Revocation stance — lifetime-as-revocation, no CRL/OCSP

### Decision

We do not stand up a CRL or OCSP responder. Compromised certs are
handled by **rotating the secret material and waiting for the cert to
expire** (90d production, 45d staging). For acute compromises, the
runbook below is the response.

### Trade-offs

- **CRL.** Operationally simple to publish, but every client has to
  fetch and trust it. Adding a fetch-and-trust step to telegraf,
  Grafana, three Celery worker fleets, and ops-db-api is real work
  for a threat model where we can already roll the underlying secret.
- **OCSP.** Real-time but adds a hard dependency on the CA being
  reachable from every TLS handshake. We just spent ADR-0001
  (`docs/source/adr/0001-ca-per-vhost-cert-split.md`) carefully
  containing the CA's reachability surface; OCSP would re-expand it.
- **Lifetime-as-revocation.** The 90d ceiling means a compromised
  cert is automatically not-trusted within 90d without operator
  action. For acute compromise we roll the secret immediately; the
  cert remains technically valid until it expires but the secret it
  protected is already changed.

### Compromise modes — runbook headlines

Full runbook: see the threat-model document (TODO: link when
written).

| Mode | Headline response |
|---|---|
| **Server key leaked** (Redis/Postgres/Influx host private key on disk readable by attacker) | Re-issue the cert with the role (`ccat <something> rotate <service>`), reload via the adapter. Old cert remains valid until `notAfter` but no longer protects anything. |
| **Client key leaked** (Redis client cert on a compromised app host) | Rotate the client cert via the redis_certs successor flow. Same lifetime caveat. |
| **HSM key leaked** (root CA private key compromised) | Stop the CA; cut a new root via ceremony; redistribute via `ca_trust` role; re-issue every leaf cert. This is the catastrophic case and is what `step-ca/ceremony-playbook.pdf` exists for. |

### Consequences

- Operators need to internalise "rolling the secret + waiting for
  expiry" as the revocation primitive. This is documented at the
  runbook level, not on every `ccat` CLI invocation.
- A future regulatory audit that asks for "CRL endpoint" gets the
  answer "no CRL; lifetime ceiling and operator-led rotation". Be
  prepared to defend that.

---

## Decision: Trust distribution — bind-mount + env vars, not image rebuild

### Decision

The CCAT root CA is distributed to containers via a **bind-mount** of
`/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` (placed there by
`roles/ca_trust/`) and an env var pointing each application's TLS
library at it (e.g. `SSL_CERT_FILE`,
`PGSSLROOTCERT`, etc).

We do **not** bake the CA root into the application container images.

### Reasoning

Baking the root into the image couples root-rotation cadence to CI
build cadence: every root rotation triggers a rebuild and redeploy of
every image. With bind-mount + env var, root rotation is "update one
file on disk via the `ca_trust` role, restart consumers" — independent
of CI.

This is the same separation already in effect for the SSH cert plane
(`roles/ssh_service_cert/` mounts `~/.ssh` from the host into spawned
agents, see commit `ce87baa`).

### Consequences

- Container images stay smaller and rebuild less often.
- The `ca_trust` role is now a hard dependency for every host that
  hosts a TLS-consuming container. This is already true today.
- A misconfigured bind-mount path silently turns into "no CA root" at
  the container level. The role must verify post-mount that the
  expected fingerprint is present.

---

## Decision: Compose layering — trust anchor in shared file, layered per context

### Context

Validation runbook Check 6 (2026-05-08) surfaced a structural fact while
inventorying the per-machine compose files: every staging-input,
prod-input, and chile context deploys with a **single self-contained
per-machine compose file**. There is no shared `docker-compose.yml` base
in the layering for those contexts. From `src/ccat_dc/_constants.py`:

```python
CONTEXT_COMPOSE: dict[str, list[str]] = {
    ...
    "staging-input-a":   ["docker-compose.staging.input-a.yml"],
    "staging-input-b":   ["docker-compose.staging.input-b.yml"],
    "staging-input-c":   ["docker-compose.staging.input-c.yml"],
    "prod-input-a":      ["docker-compose.production.input-a.yml"],
    "prod-input-b":      ["docker-compose.production.input-b.yml"],
    "prod-input-c":      ["docker-compose.production.input-c.yml"],
}
```

This invalidates the implicit assumption in #95 that an `x-ccat-trust:`
YAML anchor could live in a single base file and merge into each app
service via `<<: *ccat-trust`. There is no single base for the contexts
that matter; YAML anchors only resolve within a single file. So the
anchor cannot be "defined once, merged everywhere" by accident — it
needs an explicit wiring decision.

### Decision

Define the anchor and the per-service merge entries in a new file
`docker-compose.trust.yml`, and layer it into every applicable context
via `CONTEXT_COMPOSE`. The trust file is the **single source of truth**
for "which services get the trust bundle bind-mount":

```yaml
# docker-compose.trust.yml (sketch)
x-ccat-trust: &ccat-trust
  volumes:
    - /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem:/etc/ssl/certs/ca-bundle.crt:ro

services:
  postgres:        { <<: *ccat-trust }
  redis:           { <<: *ccat-trust }
  influxdb:        { <<: *ccat-trust }
  ops-db-api:      { <<: *ccat-trust }
  ops-db-ui:       { <<: *ccat-trust }
  grafana:         { <<: *ccat-trust }
  pgadmin:         { <<: *ccat-trust }
  db-backup:       { <<: *ccat-trust }
  # ...one line per app service across all 7 per-machine files...
```

Plus the corresponding `_constants.py` change:

```python
"staging-input-a": ["docker-compose.staging.input-a.yml", "docker-compose.trust.yml"],
"staging-input-b": ["docker-compose.staging.input-b.yml", "docker-compose.trust.yml"],
"staging-input-c": ["docker-compose.staging.input-c.yml", "docker-compose.trust.yml"],
"prod-input-a":    ["docker-compose.production.input-a.yml", "docker-compose.trust.yml"],
"prod-input-b":    ["docker-compose.production.input-b.yml", "docker-compose.trust.yml"],
"prod-input-c":    ["docker-compose.production.input-c.yml", "docker-compose.trust.yml"],
```

The chile production context (`docker-compose.production.chile.yml`) is
not in `CONTEXT_COMPOSE` today; its layering choice mirrors the input
nodes when it is added.

On deploy, `docker compose -f <per-machine> -f docker-compose.trust.yml`
merges each per-service entry from the trust file into the same-named
service in the per-machine file. Adding a new app service that needs
trust is a one-line addition to `docker-compose.trust.yml`, not a touch
in seven separate per-machine files.

### Alternatives considered

- **Option A — duplicate the anchor in each per-machine file.**
  Concrete, no `_constants.py` wiring change. Rejected because it
  creates a 7-file sync target — adding a new app service that needs
  trust means remembering to annotate it in whichever per-machine file
  it lands in, and the failure mode of forgetting is a silent TLS
  rejection at runtime. Validation runbook Check 6's per-service
  `has_trust: true` spot-check catches the oversight, but only if it
  is actually run.
- **Option C — bind-mount the parent directory** (`/etc/pki/ca-trust/extracted/pem/`)
  instead of the single bundle file. Would also resolve TODO 14 (single-
  file bind-mount staleness on rotation) because directory bind-mounts
  re-resolve dentries on lookup. Rejected from this section because
  it changes the *what* of the bind-mount; the layering question is the
  *how-to-wire*. Track the directory-vs-file question under TODO 14;
  if option C is chosen there, this layering decision is unaffected
  (the trust file's volume entry just changes shape).

### Why option (b) wins

- **One source of truth.** The list of services-that-need-trust lives
  in `docker-compose.trust.yml`, not scattered across 7 per-machine
  files.
- **Auditable PR review.** A Phase B PR diff is one file plus a
  6-line `_constants.py` change. Reviewers do not have to grep across
  7 compose files to verify coverage.
- **Adding a new context** (e.g., a future input-d) is "add the
  per-machine compose, append trust to its `CONTEXT_COMPOSE` entry" —
  a known shape, not a new sync rule.
- **Dev/local opt-out is explicit.** The dev contexts (`dev`,
  `localdev`, `local`) keep self-signed certs per the
  TLS-hard-cutover-policy convention (2026-05-07 memory). Not adding
  the trust file to those entries in `CONTEXT_COMPOSE` is the explicit
  opt-out, visible in the diff and reviewable.
- **YAML anchor mechanics are unchanged.** The `<<: *ccat-trust`
  merge happens within `docker-compose.trust.yml` itself; no
  cross-file anchor references are required (docker-compose does not
  resolve anchors across files anyway).

### Consequences

- **`docker-compose.trust.yml` becomes a hard dependency** for every
  context that lists it. A missing or malformed trust file fails the
  deploy at compose-render time, before any container starts —
  fail-closed, which is the desired property.
- **`_constants.py` is the source of truth for context wiring.**
  Future ADRs that touch deployment topology should reference here.
- **Phase B PR shape**: one new file (`docker-compose.trust.yml`,
  ~70 lines including the anchor and 60 service entries), one
  `_constants.py` patch (6 lines), no per-machine compose edits.
  Validation runbook Check 6's deferred per-service spot-check
  becomes `docker compose -f ... config | yq` against the merged
  render and lands in the same PR.
- **Service inventory (validation runbook Check 6, 2026-05-08)** —
  this is the input set for `docker-compose.trust.yml`'s `services:`
  mapping:

  | File | Total | needs trust | exempt |
  |---|---|---|---|
  | `production.input-a.yml` | 12 | 11 | promtail |
  | `production.input-b.yml` | 10 | 8 | loki, promtail |
  | `production.input-c.yml` | 8 | 7 | promtail |
  | `production.chile.yml` | 9 | 8 | promtail |
  | `staging.input-a.yml` | 13 | 11 | loki, promtail |
  | `staging.input-b.yml` | 10 | 8 | loki, promtail |
  | `staging.input-c.yml` | 8 | 7 | promtail |

  Total: **60 app-service annotations** across the 7 files.
- **Exemptions are deliberate, not oversights.** `promtail` ships
  logs to Loki via plain HTTP and has no DB connection; `loki` is a
  log store with no DB clients in the compose graph. Both are
  defence-in-depth candidates if a future change makes them speak to
  a step-ca-issued vhost — at which point they become a one-line
  addition to `docker-compose.trust.yml`. Not load-bearing for #95.

---

## Decision: Alert substrate — tiered, with a TLS-independent backstop

### Context

The PRD draft proposed renewal-failure alerts flowing telegraf →
InfluxDB → Grafana → ops chat. The architect review caught a
circular dependency: that alert path itself depends on the TLS trust
chain we're trying to monitor. If the trust chain breaks, the alert
telling us so is silenced by the same break.

An earlier draft of this ADR section recommended "piggyback on the
SSH-cert plane" on the premise that the SSH-cert plane's
failure-notification path is *by construction independent* of the
database TLS chain. **That premise was wrong.** Inspection of
`ansible/roles/ssh_service_cert/templates/step-cert-monitor.sh.j2`
plus `ansible/roles/system_setup/files/telegraf.conf:960` shows the
SSH-cert plane emits `step_cert` and `step_renew_failed` measurements
via Telegraf `[[inputs.exec]]`, and Telegraf's
`[[outputs.influxdb_v2]]` writes to
`http://db.data.ccat.uni-koeln.de:8086` — the same InfluxDB on
input-b that this PRD is hardening. The SSH-cert plane shares fate
with the database TLS chain. Piggybacking on it does not break the
circular dependency; it just inherits it under a different name.

The fix is not to rebuild on a different single substrate — it is to
accept that any single substrate convenient enough to use day-to-day
will share fate with *something* in the stack. We need a backstop
tier that is genuinely independent.

### Decision

**Tiered substrate, three independent paths:**

#### Tier 1 — Primary (visibility + everyday paging)

Telegraf `[[inputs.exec]]` on every cert host emits, mirroring the
existing SSH-cert plane's `step-cert-monitor.sh.j2`:

- `step_x509_cert,service=...,host=... seconds_to_expiry=Ni`
- `step_x509_renew_failed,service=...,unit=... value=0|1`
- `step_x509_cert_last_renewal_success,service=... seconds_ago=Ni`

Telegraf → InfluxDB → Grafana → Matrix room (page channel:
`#ccat-ops:matrix.data.ccat.uni-koeln.de`). Catches single-service
renewal failures, perms drift, image UID drift (TODO 7) — anything
that doesn't take down InfluxDB or Grafana itself.

Tier 1 *does* transit step-ca-issued TLS once Phase E lands (Telegraf
→ InfluxDB will use the new server cert). This is acknowledged, not
denied. It is the "convenience" tier; it is not load-bearing for the
catastrophic case.

#### Tier 2 — Backstop (TLS-independent, catches catastrophic failures)

Two host-local mechanisms, both calling `mailx` to the existing
`admin_email_addresses` alias (already configured by
`ansible/roles/system_setup/tasks/sendmail.yml` — root → admin alias
is in place via `/etc/aliases`):

- **`OnFailure=`** unit on every renewal systemd timer. Fires
  immediately when a renewal unit reports `failed`. Mail body
  includes hostname, service, unit name, last 20 lines of
  `journalctl -u <unit>`.
- **Daily heartbeat cron** at 06:00 UTC sends mail "all certs OK on
  $HOSTNAME, soonest-expiry=Nd, issuance-events-today=N" with one
  line per cert. **Absence of mail for 36h on any host = problem**,
  even if no specific failure was detected.

The mail path goes via the host MTA (sendmail) → uni-köln SMTP relay
→ admin inbox. **This tier does not transit any step-ca-issued TLS
cert.** It is the only path that survives:

- HSM offline (no new certs issuable).
- InfluxDB down on input-b (Tier 1 metrics black-hole).
- Grafana down on input-b.
- Matrix homeserver down on input-b.
- Network partition between input-a/c and input-b.

The only break-conditions for Tier 2 are host network down or the
external SMTP relay down — both known operational classes that are
not silently coupled to step-ca.

#### Tier 3 — Issuance audit (anomaly detection)

Every JWK-password-using `step ca certificate` invocation in the
role wraps its call site in a logger trap that writes a structured
journald line:

```
ccat-step-issuance: host=$HOSTNAME service=$SVC ts=$ISO triggered_by=$USER
```

promtail ships journald to Loki; a Grafana alert fires when
issuance-events-per-week exceeds the expected baseline (production:
~6/year per service after Phase A; staging: ~12/year per service).

Mostly Tier-1 plumbing, but the daily heartbeat mail (Tier 2) also
includes `issuance-events-today=N` — so an attacker who silences
Loki and InfluxDB still has to silence the host MTA path to hide
issuance events.

### How this maps to the seven acceptance clauses

| # | Clause | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|---|
| 1 | Substrate | Telegraf+Grafana+Matrix | cron+`OnFailure`+`mailx` | journald+Loki+Grafana |
| 2 | Page channel | Matrix `#ccat-ops` | email to `admin_email_addresses` | (rolls up into 1+2) |
| 3 | On-call contract | **deferred to a follow-up — see "Out-of-scope" below** | | |
| 4 | No-step-ca-TLS statement | acknowledged transits step-ca TLS | **does not transit** | partial (Loki not on step-ca today) + Tier-2 backstop |
| 5 | Renewal-failure alert | `step_x509_renew_failed > 0` → Grafana alert | systemd `OnFailure=` mail | n/a |
| 6 | Renewal-success heartbeat | `last_renewal_success_seconds_ago > 24h` alert | daily 06:00 mail; absence ≥ 36h = problem | n/a |
| 7 | Issuance audit log | n/a | "issuance-events-today=N" line in heartbeat mail | structured journald + Loki alert on >2σ above 30d baseline |

### What this changes in the existing infrastructure

- **New Telegraf input-exec script** templated by the
  `step_ca_vhost_cert` role, parallel to
  `roles/ssh_service_cert/templates/step-cert-monitor.sh.j2` — same
  pattern, x509 measurements instead of SSH ones. ~1 small PR.
- **Renewal systemd timer template** gains `OnFailure=
  ccat-cert-mail@%i.service` and a sibling `ccat-cert-mail@.service`
  unit that calls `mailx`. ~1 small PR.
- **Daily heartbeat cron entry** (`/etc/cron.d/ccat-cert-heartbeat`)
  templated per host from the cert-spec list. ~1 small PR.
- **Issuance audit log** is a one-line `logger -t
  ccat-step-issuance` wrapper around the issuance script and a
  Grafana/Loki alert rule. Folded into the role's issuance task.

Three small Phase A PRs, parallelisable with the role itself; none
depends on the cert-issuance role being complete first.

### Alternatives considered (why this won over the original options)

- **Single-substrate, "use what exists and is independent"
  (original recommendation: SSH-plane piggyback).** Rejected because
  the SSH-plane is not actually independent — it shares the same
  Telegraf → InfluxDB pipe. Single-substrate framing is the bug.
- **cron + mailx as the *only* path.** Robust, but operationally
  thin: no silencing, no ack, no per-service severity, no
  dashboard. Acceptable as a backstop; not enough as the everyday
  path.
- **Pushgateway + Alertmanager over plain HTTP on a private docker
  network.** Adds two new components for one alert class.
  Yagni until we have at least three substrates that would benefit
  from a unified alerting layer. Revisit when the alerting story is
  mature enough to consolidate.

### Why a tiered design is the honest answer

A single substrate convenient enough to be the daily path will share
fate with something. The architect's worry was real; the fix is not
"find a magically-independent single path" (no such path exists at
this scale of infrastructure) but "have a backstop that is
deliberately inconvenient — mail to a mailing list — so it is
actually independent". Tier 2 is operationally annoying on purpose:
mail is not a great paging UX, but it is a great *backstop UX*
because it doesn't transit any of the things we are trying to alert
on.

### Consequences

- **Three artifacts to maintain** instead of one. Worth it for the
  load-bearing independence guarantee.
- **`admin_email_addresses` is now load-bearing.** Document the alias
  contents and the SMTP relay path in the on-call runbook (when one
  exists). Test the path during Phase A by deliberately failing a
  staging renewal and confirming the mail arrives.
- **Grafana / InfluxDB / Matrix outage scenarios are now
  page-quiet on Tier 1 by design.** Operators must internalise that
  "no Tier 1 alert" means "Tier 1 is up", not "all is well". The
  daily Tier 2 heartbeat is the positive-confirmation signal.
- **Future consolidation** (e.g., Alertmanager) replaces Tier 1
  without disturbing Tier 2. Tier 2 is the architectural floor.

---

## Open questions

- **Is `runtime_redis` (CONFIG SET) sufficient on Redis 7 with
  TLS-only listeners?** The `tls-port` directive isn't reloadable
  via CONFIG SET in some Redis versions; verify on the version we
  ship. If not, `runtime_redis` degrades to a `restart_redis`
  adapter and Redis joins InfluxDB in the 30s-downtime club.
- **mTLS asymmetry follow-up.** Schedule a review at the next
  data-transfer credentials refactor. Don't block #95.
- **Threat-model document link.** The full leak-response runbook
  lives there; this ADR carries the headlines. Link when written.

### Resolved

- **(2026-05-08) Does `step ca renew` succeed against an already-expired
  authenticating cert?**
  Resolved by configuration inspection (validation runbook Check 4).
  `step ca provisioner add --allow-renewal-after-expiry` exists as a
  flag; `step-ca/provisioners-add.sh` does NOT pass it on `prod-services`
  or `staging-services`. Default is `false`. Therefore `step ca renew`
  on an expired cert is refused under the current CA config.

  **Decision: keep the strict default (`allowRenewalAfterExpiry: false`,
  i.e. Option A).** Threat-model trade-off:

  - Service-host snapshot leak (cert+key only): the JWK provisioner
    password is NOT on service hosts in steady state — vault-staged as a
    0400 host tmpfile during issuance, unlinked in the `always:` block
    of the issuance play. Steady-state renewal uses cert-as-auth and
    needs no password. So a snapshot leak gives the attacker cert+key
    but not the password, and Option A's "expired = denied" semantic
    auto-bounds the leak at `notAfter` *if the attacker fails to renew
    in time*. Detection-then-host-rotation breaks the renewal chain.
  - Controller compromise (saiyajin / Jenkins-on-input-b): both options
    are equally lost. Vault key lives there.
  - Persistent service-host compromise spanning an issuance window:
    attacker eventually grabs the 0400 tmpfile. Both options equally
    lost.
  - Operational cost of Option A: HSM offline > 30d production budget
    (15d staging) requires manual re-issuance ceremony — vault →
    0400 tmpfile → run issuance script. Same pattern as today's vhost
    cert and `ssh_service_cert/_per_container.yml`.

  Option A's protection is contingent on detection. Therefore
  monitoring + canary become load-bearing (TODO 15 in the pre-implementation
  TODO list, plus expanded acceptance for the alert substrate in TODO 4).

- **(2026-05-08) `update-ca-trust extract` atomicity.**
  Resolved by validation runbook Check 3. `update-ca-trust` swaps the
  bundle via atomic rename on RHEL 10.1 (inode change verified). No
  partial-read window on the host filesystem. Downstream nuance:
  Linux single-file bind-mounts pin the source inode, so atomic rename
  on the host means containers see the *old* bundle until restart —
  tracked as TODO 14, not a blocker for this ADR.

- **(2026-05-08) Trust-anchor compose layering.**
  Resolved by validation runbook Check 6 + this ADR's "Decision:
  Compose layering" section. New `docker-compose.trust.yml` is the
  single source of truth for service-needs-trust; layered into each
  applicable context via `CONTEXT_COMPOSE`. TODO 16 closed on this
  ADR section landing.

- **(2026-05-08) Break-glass SSH access during HSM-down >24h.**
  Resolved by static review of existing infrastructure rather than
  by adding a new artifact. The architect's concern presumed
  step-ca-issued user certs are the only operator auth path;
  `ansible/roles/system_setup/tasks/nitrokey_ssh.yml` applies
  per-operator FIDO2 hardware-key pubkeys to plain
  `authorized_keys` on every managed host (outside the
  `AuthorizedPrincipalsFile` cert path), and out-of-band hardware
  consoles cover hardware-level recovery. The Nitrokey path
  survives any step-ca outage by construction. TODO 5 dropped;
  Check 11 signed off as N/A. See "Operational notes" for the
  role-split rationale (Nitrokey for core admins, step-ca SSH
  certs for remote admins).

- **(2026-05-08) x509 canary on `input-c.staging`.**
  Resolved by this ADR's "Decision: x509 canary on
  `input-c.staging`" section. 24h cert from `staging-services` JWK
  on a non-CA host (`input-c.staging`); 12h timer cadence; failure
  to renew within 18h of `notAfter` triggers Tier 1 + Tier 2 alert.
  Adds a fourth `noop` reload adapter (general-purpose; canary is
  its first user). Doubles as validation runbook Check 8 (page-path
  E2E). TODO 15 closed on this ADR section landing.

- **(2026-05-08) Cert-spec schema and UID parameterisation.**
  Resolved by this ADR's "Decision: Cert-spec schema —
  parameterised UIDs, no defaults baked in" section. UIDs are
  per-host facts (`host_container_uids` dict) referenced from
  cert-specs via `\| mandatory` so a missing fact fails the play
  loudly. The schema is the shared contract for the role, the
  TODO 7 runtime-drift script, and the TODO 4 alert substrate's
  service labels. TODO 3 closed on this ADR section landing.

- **(2026-05-08) Alert substrate.**
  Resolved by replacing the single-substrate framing (SSH-plane
  piggyback) with a tiered design — see "Decision: Alert substrate
  — tiered, with a TLS-independent backstop". The SSH-plane
  piggyback recommendation in an earlier draft of this ADR was
  based on a wrong premise (the SSH plane shares the same Telegraf
  → InfluxDB pipe and so shares fate with the database TLS chain
  it was supposed to monitor). The tiered fix: Tier 1 Telegraf+
  Grafana+Matrix for everyday paging, Tier 2 cron+`OnFailure`+
  `mailx` to `admin_email_addresses` as the load-bearing
  TLS-independent backstop, Tier 3 journald+Loki+Grafana for
  issuance-frequency anomaly detection. TODO 4 closed on this ADR
  section landing. The on-call hand-off contract clause is
  explicitly deferred to a follow-up — channels exist; rotation
  contract is a team-structure decision for when the rotation
  exists.

---

## Out-of-scope

Things this PRD and ADR explicitly do not address. Each item is here
because someone has asked or might reasonably ask, and the answer is
"not in this rollout":

- **2-week F→G dual-trust soak.** Waived under the time-bound
  setup-mode argument in "Decision: Migration style". Revisit if
  production becomes populated before Phase G ships. Do not cite
  this ADR as precedent for skipping a soak on a populated
  production stack.
- **Migrating Redis off mTLS to server-auth-only-with-password.**
  Inertia, not principle (see "Decision: mTLS scope asymmetry").
  Revisit at the next data-transfer or ops-db-api credentials
  refactor.
- **CRL or OCSP infrastructure.** Lifetime-as-revocation only
  (see "Decision: Revocation stance"). A regulatory ask for a CRL
  endpoint is a future ADR.
- **`ops-db-api` inbound TLS** (`nginx-proxy → ops-db-api`).
  Currently undecided (TODO 11). Once chosen, the answer goes into
  "Operational notes" if in-scope, or remains here if explicitly
  out-of-scope, or moves to its own ADR.
- **Cert-transparency / public-log integration.** Step-ca is a
  private CA; not applicable.
- **Baking the CCAT root into application images.** Trust
  distribution decision: bind-mount + env var, not image rebuild.
- **A unified CLI surface upfront** (`ccat tls rotate`,
  `ccat tls status`). YAGNI; design after the role works.
- **Renewal-job log retention beyond `journalctl`.** Covered by
  the general logging / Loki policy, not this PRD.
- **Backup-as-cert-recovery-path.** Backup coverage of service-cert
  directories is not confirmed by ITCC (TODO 17). The role
  re-applying after a host reinstall is the recovery path; backups
  are best-effort defence in depth, not load-bearing.
- **F→G soak in any future TLS migration on populated production.**
  See "Decision: Migration style" → Time-bound. Future migrations on
  a populated stack must use a soak.
- **On-call hand-off contract for the alert substrate** (who acks,
  escalation timeout, expected MTTR). Channels exist (Tier 1: Matrix
  `#ccat-ops`, Tier 2: `admin_email_addresses` mail). The
  rotation/ack/MTTR contract is a team-structure decision deferred
  until a real on-call rotation exists. Track as a follow-up; not
  blocking PRD #95.

---

## Operational notes

This section consolidates the operational concerns surfaced through the
per-decision sections above into one place for on-call. Each item
points back to where the rationale lives.

- **HSM offline budget.** 30d production / 15d staging (HSM blast
  radius decision). Beyond budget: manual re-issuance ceremony via
  JWK provisioner password (vault → 0400 host tmpfile → unlink in an
  `always:` block of the issuance play). This is not auto-recovery —
  it requires an operator with vault access to run the issuance
  script.
- **Renewal cadence.** 12h timer per host, modelled on
  `step-ca/renew-vhost-cert.sh`. Most fires are no-ops because
  `step ca renew` only contacts the CA in the last 1/3 of cert
  lifetime. A misconfigured timer (or a `--force` storm during
  rollout) is a CA-DoS risk; throttle / serialise mass issuance
  during phase rollouts (TODO 6).
- **Trust-anchor rotation requires container restart.** Single-file
  bind-mounts pin the source inode (TODO 14). Any change to
  `/etc/pki/ca-trust/source/anchors/` followed by `update-ca-trust
  extract` REQUIRES a rolling restart of every container that
  bind-mounts the trust bundle. The host gets the new file
  atomically; running containers do not. Either accept this and
  document the restart in the rotation procedure, or move to a
  directory bind-mount (TODO 14 alternative).
- **Postgres replica during rotation.** Primary and replica must not
  renew simultaneously while replication is mid-write (TODO 10).
  The chosen ordering — primary-first with a wait gate, or
  replica-first, or a coordination lock — is recorded under TODO 10
  acceptance and migrates here once decided.
- **Alert path independence — tiered substrate.** Three paths (alert
  substrate decision, TODO 4 closed). Tier 1 (Telegraf → InfluxDB →
  Grafana → Matrix `#ccat-ops`) is the everyday paging path and
  shares fate with input-b services. Tier 2 (systemd `OnFailure=` +
  daily 06:00 cron heartbeat → `mailx` to `admin_email_addresses`)
  is the load-bearing TLS-independent backstop — does not transit
  any step-ca-issued cert. Tier 3 (journald → promtail → Loki →
  Grafana) is the issuance-anomaly audit. **Operational rule for
  on-call:** "no Tier 1 alert" means "Tier 1 is up", not "all is
  well"; the daily Tier 2 heartbeat mail is the positive-
  confirmation signal — absence ≥ 36h on any host is itself a
  problem.
- **Container UIDs are per-host parameters.** Cert-spec UIDs (TODO 3)
  are parameterised, not hardcoded; runtime UID drift is detected by
  the renewal script (TODO 7) by reading `/proc/1/status` from PID 1
  inside each container (`docker exec ... id` defaults to root and is
  the wrong probe — runbook Check 5 captured this gotcha). Today's
  values, observed on `input-b.staging` 2026-05-08: Redis 999,
  Postgres 999, InfluxDB 1000. `influxdb:latest` is the only
  unpinned image in the stack — drift risk concentrates there.
- **Backup is not the cert recovery path.** Service-cert directories
  may not be in the central Commvault policy (TODO 17, ITCC ticket
  pending). Recovery on host reinstall is "re-run the
  `step_ca_vhost_cert` role". Document this in the role README.
- **Break-glass SSH already provided by existing infrastructure.**
  The architect's worry — "if step-ca is down >24h, every operator's
  SSH cert expires and nobody can SSH in to fix it" — assumed
  step-ca-issued user certs are the only operator auth path. They
  are not. `ansible/roles/system_setup/tasks/nitrokey_ssh.yml`
  applies per-operator FIDO2 hardware-key pubkeys
  (`roles/system_setup/files/pubkeys/<username>/*.pub`) directly to
  `authorized_keys` on every managed host, outside the
  `AuthorizedPrincipalsFile` cert path. Out-of-band hardware access
  (iDRAC / hypervisor console) provides the second tier for
  hardware-level recovery. The role split is: Nitrokey for **core
  admins** (physically present, hardware key in pocket), step-ca
  SSH certs for **out-of-core / remote admins** where shipping a
  hardware key is impractical. TODO 5 is dropped on this basis;
  Check 11 signed off as N/A.
- **Compose layering is anchored in `docker-compose.trust.yml`.**
  Single source of truth for the service-trust bind-mount matrix
  (compose-layering decision). Validation runbook Check 6 inventory
  is the input set; per-service `has_trust: true` spot-check lands
  in the same Phase B PR as the trust file itself.

---

## Consequences — overall

**What becomes easier:**

- One trust root for the whole CCAT stack (SSH plane, vhost cert,
  three databases). Operators need to know one CA, one root file
  path, one renewal model.
- Retiring `roles/redis_certs/` and `redis/<variant>/certs/` removes
  a homegrown PKI with four parallel CAs that nobody outside this
  team can audit.
- Adding a fourth TLS-consuming datastore later is "add a
  reload-strategy adapter", not "build new PKI".

**What becomes harder:**

- The CCAT root CA / HSM is now load-bearing for *more* things. HSM
  ceremony cadence and HSM availability matter more than they did.
  The soft-offline budget gives us 30d production / 15d staging
  headroom but the calculus is now "how long can the HSM be offline"
  not "how long can the redis-certs CA be offline" (which was
  effectively infinite because that CA was a file on input-b's
  disk).
- The pluggable-adapter design means the role has three test
  surfaces, not one. Plan for that in the test plan.

**New operational duties:**

- Watch the SSH-cert-plane notification stream for DB cert renewal
  failures (decision-section: alert substrate).
- Maintain the schema entry for any new `vault_step_ca_prov_*`
  passwords (lines up with the existing vault schema work in
  `data-center-computer-setup/vars_application_schema.yml`).
- The `ccat redis-certs` CLI commands (currently in `ctl`) get
  superseded; plan a CLI surface for the new role
  (`ccat tls rotate <service>`, `ccat tls status`). Don't build it
  before the role works; YAGNI.

---

## References

Files verified to exist in the repo at the time of writing:

- `step-ca/issue-vhost-cert.sh` — one-shot issuance pattern (JWK
  provisioner password via `--password-file`, atomic `.new` install,
  docker exec reload).
- `step-ca/renew-vhost-cert.sh` — `step ca renew` cert-as-auth pattern,
  PRE_MTIME/POST_MTIME conditional reload, 12h timer cadence.
- `ansible/roles/ssh_service_cert/tasks/_per_container.yml` —
  password-staging-from-vault → 0400 host tmpfile → unlink convention,
  `community.docker.docker_container_exec` with stdin-only password
  delivery.
- `ansible/roles/ca_trust/` — RHEL system-anchor distribution for the
  CCAT root.
- `ansible/roles/redis_certs/` — homegrown PKI being retired by this
  ADR.
- `redis/{main,ccat,develop,develop-ccat}/certs/` — per-variant CAs
  being sunset.
- `grafana/provisioning/{production,staging}/datasources/influxdb-datasource.yaml`
  — current `tlsSkipVerify: true` lines, plain HTTP datasource URL.
- `docs/source/adr/0001-ca-per-vhost-cert-split.md` — prior ADR on the
  CA's own vhost cert; format and reasoning style mirrored here.
- PRD: ccatobs/system-integration#95 — defers full decision tree to
  this document.