# CCAT Certificate Authority — Architecture and Design

This page is **explanation**: it describes what the CCAT private CA is
for, why it is shaped the way it is, and the reasoning behind the major
design choices (two tiers, two HSMs, GitHub-team authorization,
short-lived certs, Pattern A service-account renewal). It does **not**
contain runbooks; for those see the how-to siblings listed under "See
also" below.

For PKI fundamentals (what a cert is, how TLS handshakes work, the role
of public vs private keys), see {doc}`tls-and-pki`. This document
assumes you already know those concepts and focuses on *our specific
setup* and the reasons behind it.

```{contents}
:local:
:depth: 2
```

## What the CA is for

Before the CA existed, every TLS or mTLS need across the CCAT stack was
solved with a hand-rolled OpenSSL script. Redis mTLS has a
[`ccat redis-certs generate` workflow](../secrets-management.md) that
runs eight `openssl` commands to produce a per-variant CA and a set of
client/server certs. Postgres replication uses its own cert pair.
Developer SSH access uses static `~/.ssh/authorized_keys` files per
machine. Every service reinvents the cert layer at a slightly
different angle.

The CA replaces that pattern. Instead of "each service has its own
trust root and its own cert-generation script," every service gets its
certs from **one** central authority that issues short-lived certs on
demand. The benefits compound:

- **One trust root.** Clients bootstrap against the CCAT root once and
  trust everything it signs — Redis, Postgres, SSH hosts, internal web
  UIs, bbcp endpoints.
- **Short-lived certs.** 16-hour SSH user certs, 30–90 day TLS certs,
  7-day SSH host certs — no more "rotate these 12-month certs once a
  year and hope nothing breaks." Expiry becomes a health property, not
  a calendar event.
- **Identity-aware issuance.** SSH user certs come from GitHub OAuth
  via Dex, with ccatobs GitHub team membership as the authorization
  gate. No more adding public keys to `authorized_keys` files across
  machines — if you're in the `ccatobs/datacenter` team, you run
  `step ssh login`, get a fresh cert valid for 16 hours, and SSH in.
  Off-boarding is "remove from the GitHub team" — certs expire on their
  own, no key-removal ceremony.
- **Automation paths.** ACME and SSHPOP provisioners let services
  auto-renew their own certs without human touch.

The CA is not a goal in itself — it is how we stop writing
cert-management code.

## Trust architecture: the two-tier, two-HSM model

The CA is **private** — nothing publicly trusted on the wider internet
connects to it. It is used only by CCAT hosts, containers, and
developer laptops that have explicitly bootstrapped against our root.
This shapes the threat model and the choices below.

### Two tiers: root and intermediate

Every X.509 PKI has a notion of a trust hierarchy. We use the standard
two-tier layout:

- **Root CA**: the ultimate trust anchor. All CCAT certs chain back to
  it. Used only to sign the intermediate. Rotated ~never (20-year
  lifetime).
- **Intermediate CA**: the working signing key. Signs every cert
  step-ca issues day-to-day. Rotated every 5–10 years or immediately
  upon suspected compromise.

The reason this separation exists: if the intermediate is compromised,
you can revoke and replace it by pulling the root out of the safe and
doing a controlled ceremony — clients keep trusting the (unchanged)
root. If the **root** is compromised, every client must re-bootstrap
against a new root, which for CCAT means touching every server and
every developer laptop. We optimize for recovery from intermediate
compromise, not for root compromise never happening.

### Two HSMs: offline root, online intermediate

We use two Nitrokey HSM 2 dongles:

| Role | HSM | Physical location | Online? |
|---|---|---|---|
| Root signing key | HSM #1 | Locked safe, off-site ideally | Never |
| Intermediate signing key | HSM #2 | R640 internal USB on input-b | Always |

HSM #1 comes out of the safe only during signing ceremonies (once at
commissioning, then once per intermediate rotation). It is plugged into
an **air-gapped laptop** for those ceremonies, never into input-b.

HSM #2 lives inside the R640 chassis permanently, in the internal
USB-A port that sits on the motherboard. Getting physical access to it
requires pulling the server out of its rack and opening the lid — a
bar high enough that the "someone unplugs the dongle" threat is
effectively closed. This is only a partial protection though: see the
next section.

### What the HSMs actually protect

An HSM protects against **key extraction**, not against **key use**.
The private key is generated on the device and is physically
impossible to export — even root on the host cannot read the key
bytes. But anyone with:

1. access to the machine the HSM is plugged into, **and**
2. the HSM user PIN

can invoke signing operations via PKCS#11. The HSM will happily
comply — that's its job.

This distinction drives several design choices:

- **The intermediate PIN lives in the CCAT vault** (encrypted with
  ansible-vault) and is rendered into `/opt/data-center/system-integration/.env`
  on input-b as `STEP_CA_HSM_PIN`. step-ca's `ca.json` references it
  via `pin-source=/run/secrets/hsm-pin` mounted from a tmpfs. The PIN
  is therefore reachable by any process running as root on input-b.
  An attacker who owns input-b can ask HSM #2 to sign arbitrary certs
  for the duration of their access.

- **Accepting that risk is fine** because when you kick the attacker
  out, the key is still on the dongle. Recovery is "rotate the
  intermediate" — pull HSM #1 from the safe, do a ceremony, install
  the new intermediate on HSM #2 (or a fresh dongle), restart step-ca.
  Clients do not notice because the root is unchanged. Total downtime:
  ~1 hour, mostly ceremony overhead. Compare to a file-on-disk
  intermediate, where the attacker walks off with the key and can
  continue issuing certs even after being expelled — that requires a
  *root* rotation, which is catastrophic.

- **The root HSM is never plugged into anything networked.** Period.
  If you need the root to sign something, you do it on a freshly
  wiped laptop, offline, with pre-printed procedures. The root user
  PIN is never typed on input-b or saved in any vault. It is kept on
  paper in the safe, in a sealed envelope alongside the dongle.

- **HSM #1 failure = emergency root rotation.** We do not maintain
  DKEK backup shares for Nitrokey key recovery. If HSM #1 physically
  fails (electronics die in the safe), we have no way to recover the
  root key, and every CCAT client must re-bootstrap. This is a
  deliberate simplification for our scale (~20 clients) — the
  recovery pain is bounded, and DKEK introduces its own operational
  complexity. Revisit if the observatory grows past ~50 clients.

### Trust architecture: per-vhost cert posture

step-ca's TLS handshake has to present a cert that chains to the
CCAT root, because step-cli explicitly sets `tls.Config.RootCAs`
from the bootstrapped trust file and Go ignores `SystemCertPool` /
`SSL_CERT_FILE` whenever `RootCAs` is non-nil. The CA's TLS chain
therefore has to be CCAT-rooted at the wire — no client-side
trust-bundle plumbing can bridge an LE-fronted endpoint.

We do this with a per-vhost cert split inside the same nginx-proxy:

| vhost | TLS signed by | Trusted by |
|---|---|---|
| `ca.ccat.uni-koeln.de` (:443) | **CCAT root** (issued by step-ca itself, `prod-services` JWK) | step-cli clients via `step ca bootstrap` |
| `auth.ccat.uni-koeln.de` (:443) | Let's Encrypt | browsers (Dex login flow), GitHub OAuth |
| `grafana.ccat.uni-koeln.de` (:443) | Let's Encrypt | browsers |
| `docs.ccat.uni-koeln.de` (:443) | Let's Encrypt | browsers |

The CA vhost opts out of `acme-companion` (no `LETSENCRYPT_HOST` on
the step-ca compose service); the CCAT-rooted cert is issued via
`step-ca/issue-vhost-cert.sh` and renewed by the
`step-ca-vhost-renew.timer` (every 12h, no-op until within 1/3 of
the cert lifetime). Other vhosts in the same proxy keep LE because
they serve browsers + GitHub OAuth callbacks which require a
publicly-trusted chain.

step-ca's native `:9000` is also bound on input-b. firewalld
(managed by the `hsm_host` role's `ca_allowed_source_cidrs`
variable, default Uni Köln `/16`) gates inbound access. In practice
`:9000` is reachable only from input-b's own `/24` because the Uni
Köln IT firewall drops `:9000` cross-subnet — that's why we use it
exclusively for the same-host issuance/renewal scripts that
nginx-proxy can't proxy (the JWK-gated `step ca certificate` flow
that `issue-vhost-cert.sh` and `renew-vhost-cert.sh` invoke).
**`:443` is the universal path** for everyone else.

#### Policy enforcement point

Access to `ca.ccat.uni-koeln.de:443` is gated by an in-repo IP
allowlist at the proxy: {file}`proxy/data/vhost.d/ca.ccat.uni-koeln.de`,
defaulting to Uni Köln `/16` and `deny all` otherwise. The file is
the source of truth — adding a partner CIDR is a PR plus an
`nginx -s reload`. Operator mechanics are in {doc}`../ca-day-to-day`
§ "Adding a partner subnet"; the decision rationale for this
posture is in ADR-0001 ({doc}`../adr/0001-ca-per-vhost-cert-split`).

The full retrospective of attempts that ruled out alternative
postures (LE on the CA vhost, client-side trust-bundle bridges,
direct cross-subnet `:9000` exposure) lives in
{doc}`../ceremony/lessons-learned-cutover-2026-05-04`. This page
describes only the current state.

### Let's Encrypt layering

The CA has two public-facing HTTPS endpoints: `ca.ccat.uni-koeln.de`
(step-ca itself) and `auth.ccat.uni-koeln.de` (Dex). These are
the URLs that developer laptops, `step ca bootstrap`, and GitHub's
OAuth callback hit over the public internet.

These endpoints are served by the **existing nginx-proxy +
acme-companion** stack on input-b, with certs from **Let's Encrypt**
— not from our own CCAT CA. The reasons:

1. Browsers and GitHub's OAuth callback will not follow redirects to
   a TLS endpoint signed by an untrusted CA. Using the CCAT root for
   `ca.ccat.uni-koeln.de` would mean every `step ca bootstrap` needs
   a manually-verified pre-shared fingerprint — a chicken-and-egg
   problem.
2. Let's Encrypt is free, automated, and trusted by every OS. We get
   working HTTPS on both domains with zero additional infrastructure
   and zero per-client trust configuration.

Two ACME endpoints exist on input-b once commissioning is done, and
they are not the same thing:

- **Let's Encrypt ACME** at `acme-v02.api.letsencrypt.org` — used by
  acme-companion to obtain public TLS certs for `ca.ccat.uni-koeln.de`
  and `auth.ccat.uni-koeln.de`. Renewed automatically every ~60 days.
- **step-ca ACME** at `https://ca.ccat.uni-koeln.de/acme/acme/directory`
  — used by internal services (future: cert-manager in K8s, or a
  simple `step` command on each host) to obtain CCAT-issued internal
  TLS certs. Provisioner is added to step-ca during commissioning.

Both speak the same ACME protocol but serve different trust domains.
Don't confuse them.

## Physical and network preconditions

Before commissioning, these must all be true:

1. **input-b is physical** (R640 in a locked HA hall). The internal USB
   port on the motherboard is accessible. Physical access to the room
   is gated.
2. **DNS** records exist for `ca.ccat.uni-koeln.de` and
   `auth.ccat.uni-koeln.de`, both pointing at input-b's public IP.
3. **Firewall**: ports 80 and 443 are reachable from the public
   internet (for Let's Encrypt HTTP-01 challenges and client traffic).
4. **nginx-proxy + acme-companion** is already running on input-b via
   `docker-compose.proxy.yml` (`ccat proxy status`).
5. **GitHub OAuth App** has been created in the `ccatobs` organization
   with callback URL `https://auth.ccat.uni-koeln.de/callback` and
   `read:org` scope (Dex needs it to check team membership).
6. **GitHub team `ccatobs/datacenter`** exists and contains the people
   who should get SSH access via `step ssh login`. Dex rejects
   everyone outside this team at the authentication step.

## Commissioning strategy — phases

We deliberately commission the CA in two stages, using the HSM arrival
as a built-in rehearsal of the most dangerous operation in the CA's
lifetime (root rotation). The phases are:

| Phase | What | When | Outcome |
|---|---|---|---|
| **Phase 1** | Dry-run commissioning with a throwaway auto-init root | Now, without HSMs | Working CA, used by a small test cohort, `ca_trust` role proven in production |
| **Phase 2** | Offline root ceremony + HSM cutover = rotation rehearsal | When both HSMs arrive | CA migrated to the intended HSM-backed steady state, test cohort re-bootstraps |
| **Phase 3** | Rollout to real services (Redis mTLS, Postgres TLS, SSH host certs, etc.) | After Phase 2 has been stable for ~1 week | CA is trusted by production services |

Why this ordering:

- **Phase 1 de-risks everything that isn't HSM-specific.** DNS, Let's
  Encrypt issuance, nginx-proxy wiring, Dex + GitHub team enforcement,
  OIDC redirect URIs, step-ca provisioner syntax, the `step ca bootstrap`
  → `step ssh login` flow, the `ca_trust` Ansible role end-to-end —
  all verified in a low-stakes setting before hardware arrives.
- **Phase 2 exercises the root rotation procedure.** Root rotation is
  the one operation the team otherwise never practices; it is also
  the catastrophic disaster-recovery path. Doing it once intentionally,
  with throwaway clients and zero production impact, is the best
  rehearsal possible. If it fails, you learn while stakes are zero.
- **Phase 3 is gated on Phase 2 success.** Nothing outside the small
  Phase 1 test cohort bootstraps against the CA until after the HSM
  cutover. This discipline is non-negotiable: if production clients
  trusted the Phase 1 throwaway root, Phase 2 would require
  re-bootstrapping them for real, defeating the rehearsal framing.

### The one rule that makes Phase 1 safe

**Nothing production-critical bootstraps against the Phase 1 CA.**
The test cohort is 2–3 people who know they're on a test CA and have
agreed to re-bootstrap at Phase 2 cutover. No Redis, no Postgres, no
SSH hosts, no CI systems, no automation.

The Phase 1 CA's blast radius is therefore near-zero: even if someone
compromised input-b during the dry-run window and stole the
auto-init root key from the docker volume, the certs they could sign
would be trusted only by the test cohort's laptops — which are going
to be re-bootstrapped in Phase 2 anyway. The Phase 1 root goes in
the bin regardless.

The executable Phase 2 ceremony procedure lives in
{doc}`../ceremony/playbook` and the on-server cutover in
{doc}`../ceremony/cutover-playbook`. After the ceremony, the
public artefacts (root cert, SSH CA pubkeys, fingerprint paper)
flow back to git and to operator workstations as described next.

### Post-ceremony distribution

- HSM #1 → sealed envelope with root PINs and fingerprint paper → the
  safe. Does not enter input-b. Ever.
- HSM #2 → carried to the server room → installed in the R640 internal
  USB port → chassis closed → server returned to rack.
- Export USB → mounted on a developer machine → public artifacts
  (`root_ca.crt`, `ssh_user_ca.pub`, `ssh_host_ca.pub`) copied into
  `ansible/roles/ca_trust/files/` → committed to git with a clear
  commit message ("ca: commit public trust material from root
  ceremony 2026-XX-XX, fingerprint ...").

The public artifacts are safe to commit — they contain no secret
material, and every client needs to be able to fetch them. The
fingerprint in the commit message is the cross-check: any future
developer inspecting history can verify the committed root cert
matches the ceremony fingerprint on paper.

## Authorization model — we trust GitHub, not email domains

A common pitfall when wiring step-ca's OIDC provisioner is to use
the `--domain` flag to restrict which users can get certs. That
flag checks the `email` claim of the OIDC token against an
allowlist of domains. For a tenant that uses a single corporate
email domain (Google Workspace, Microsoft 365), it's a reasonable
coarse gate.

**For CCAT, it is the wrong model.** Our trust chain is:

1. Dex federates **GitHub** as the identity provider.
2. Authorization is **membership in the `ccatobs/datacenter`
   GitHub team**, not email domain membership.
3. Team members have **wildly different email domains** — uni-koeln.de,
   ph1.uni-koeln.de, cornell.edu, fyst.org, personal addresses. None
   of these reflect CCAT membership in any structural way.

Filtering by email domain is simultaneously too strict (rejects
valid ccatobs members whose GitHub primary email isn't a uni
address) and too loose (accepts anyone with a uni-koeln.de email
regardless of whether they're in ccatobs — that's a huge public
domain). The script therefore **omits `--domain` by default**.

What actually provides the authorization gate:

**Dex enforces `ccatobs/datacenter` team membership directly in its
GitHub connector**, before step-ca ever sees a token. Config is in
`step-ca/dex/config.yaml`:

```yaml
connectors:
  - type: github
    id: github
    config:
      orgs:
        - name: ccatobs
          teams:
            - datacenter
```

How it works end-to-end:

1. User runs `step ssh login`. step-cli opens a browser to the
   `CCAT-GitHub` provisioner's configured OIDC issuer (Dex).
2. Dex redirects the browser to GitHub for OAuth.
3. GitHub authenticates the user and returns an OAuth token with
   `read:org` scope.
4. Dex calls GitHub's `/user/teams` endpoint with that token and
   checks whether the user is a member of `ccatobs/datacenter`.
5. If yes: Dex issues an OIDC ID token with a `groups` claim
   containing the team slug, redirects back to step-cli, step-ca
   validates the token, issues a 16h SSH cert. Done.
6. If no: Dex returns an "access denied" page, no token is issued,
   step-cli errors out with "OIDC flow failed." The user never
   reaches step-ca.

**Onboarding** a new operator: add them to the `ccatobs/datacenter`
team on github.com. Their next `step ssh login` succeeds. No CCAT-side
configuration change, no admin UI to click through, no secret to
rotate.

**Offboarding**: remove them from the team. Their current 16h cert
expires within the day, no new certs can be issued. Any existing
SSH sessions keep working until the cert underlying them expires,
and then they're locked out. No cert revocation needed in the
common case.

This is a **fully automatic** model: both authentication and
authorization are delegated to GitHub's team management. CCAT
writes zero identity code. A GitHub outage makes new `step ssh
login` flows unavailable until GitHub recovers (existing 16h certs
keep working), which is an acceptable trade for the operational
simplicity — and in practice GitHub has dramatically better
uptime than any identity layer CCAT would run itself.

**Why we moved off Keycloak.** The prior Phase 1 setup used Keycloak
as an IdP in front of GitHub. Keycloak's built-in GitHub broker does
not call the teams endpoint, only `/user`, so authorization had to
be enforced by a manual "assign the `ccatobs-member` realm role"
step in the Keycloak admin UI after each new user's first login.
That's one manual onboarding step too many, and it doesn't age
well — if a user leaves the GitHub team, their Keycloak role
stays assigned unless an admin remembers to clean up. Switching to
Dex collapses three moving parts (Keycloak, Keycloak-db, manual
role assignment) into one declarative YAML block and tracks GitHub
team membership automatically.

The `--domain` flag remains available in the script via the
`ALLOWED_DOMAINS` env var for cases where domain is genuinely the
right gate (e.g. you're bootstrapping a CA for a specific org that
does use a uniform email domain). For CCAT, leave it unset.

## SSH access tiers — narrative

The Dex team gate answers "who may authenticate." The separate
question of "which Linux user may they become, and what happens
if the IdP is down" is answered by a three-tier access model,
implemented via a mix of Ansible-managed local users,
`AuthorizedPrincipalsFile`, and the existing Nitrokey FIDO2 SSH
keys.

This section explains the *intended steady-state model*. The
implementation (an Ansible role deploying `auth_principals/%u`
files) is Phase 3 work; Phase 1 hosts are currently using the
legacy static-`authorized_keys` path. For the lookup table that
summarises the tiers in one place, see
{doc}`ca-provisioner-set` § "SSH access tiers".

**Tier 1 — Hard-core admins (2–3 people)**

Full root access to every CCAT-managed host, with a physical
second factor as the fallback for when the IdP layer is
unavailable.

- Personal Linux user on every host (e.g. `buchbend`), managed by
  Ansible `users.yml`, member of the `wheel`/`sudo` group.
- Static SSH authorized_keys entry for their **Nitrokey FIDO2
  resident key** (`sk-ecdsa-sha2-nistp256@openssh.com`). This key
  is physically bound to the dongle and cannot be cloned without
  the device. It is the break-glass path: if Dex is down, if
  GitHub is unreachable, if step-ca won't issue, the admin still
  SSHes in with their dongle.
- Also a full member of `ccatobs/datacenter` on GitHub, so the
  normal `step ssh login` flow works day-to-day. The Nitrokey
  path is the backup, not the primary.
- Sudo permissions are granted through group membership, not
  through anything the SSH cert carries. A Tier 1 admin who
  logs in with a step-ca cert lands in the same local account
  and gets the same sudo rights as one who logs in with the
  Nitrokey — the cert/key choice is just the door, not the
  privilege level.

**Tier 2 — Operational staff**

Regular contributors who need SSH access for legitimate
operational work but are not the people you wake up at 3am. The
Nitrokey dependency is explicitly *not* required — adding
hardware to every new contributor is friction that scales badly.

- Personal Linux user on managed hosts, created by Ansible from
  `users.yml`. No `wheel`/`sudo` membership unless there's a
  specific operational need.
- **No static SSH authorized_keys entry.** The only path to
  logging in is a valid step-ca-issued SSH user cert, which
  requires authenticating through Dex + the GitHub team check.
- `sshd_config` has `AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u`
  and `TrustedUserCAKeys /etc/ssh/trusted_user_ca_keys`. Each
  staff member gets a one-line file
  `/etc/ssh/auth_principals/<username>` containing their GitHub
  login as a principal. Both are rendered by the `ca_trust`
  Ansible role from the `github:` field on each user record in
  `group_vars/all/users.yml` — see {doc}`ca-provisioner-set` for
  the implementation reference.
- Off-boarding is GitHub-side: remove them from
  `ccatobs/datacenter`, their next `step ssh login` fails at
  Dex, their current cert expires within 16h, they're out. No
  Ansible rerun, no manual `authorized_keys` surgery.
- Rolling back a staff member to "no SSH at all" can be done
  either by removing them from the GitHub team (preferred, fast,
  no CCAT-side action) or by removing their `github:` field
  from `users.yml` and re-running `ca_trust` (the role scrubs
  the stale `auth_principals/<user>` file explicitly on the
  next run, which closes the cert path even if their GitHub
  team membership is still live).

**Tier 3 — Break-glass / emergency-only accounts**

For scenarios where even Tier 1's Nitrokey path is insufficient
— the local sshd is broken, the machine is in single-user mode,
the network is down — there must be a path that bypasses SSH
entirely.

- A named local user (e.g. `breakglass`) exists on each managed
  host, created by Ansible but with:
  - **No password** (`!` in `/etc/shadow`).
  - **No authorized_keys** and **no entry in any
    `auth_principals` file**. Cannot be reached via SSH by
    design.
  - Full `wheel`/`sudo` rights, so once you *are* them, you can
    recover anything.
- Access is via the **iLO/DRAC out-of-band management console**
  on the R640, reached from the Uni Köln management VLAN. The
  iLO gives you a virtual keyboard at a physical login prompt,
  which is the one interface that works when every network
  service is gone. An admin with iLO credentials types the
  break-glass account's name + a password supplied by iLO
  root-recovery or a physically-printed emergency password kept
  in the safe alongside the root HSM.
- The break-glass path is tested during commissioning and then
  left alone. Using it is an incident in itself; any use should
  generate a postmortem.

The key property: **the failure modes are orthogonal**. A Dex
outage takes out Tier 2 but leaves Tiers 1 and 3 intact. A
GitHub outage takes out the `step ssh login` path for everyone,
but Tier 1 falls back to Nitrokey and Tier 3 is untouched. A
full network outage on input-b takes out step-ca entirely, but
Tier 1's Nitrokey path still works on every *other* host (their
FIDO2 key is in each host's local `authorized_keys`) and Tier 3
recovers the unreachable machine via iLO. No single failure,
including a compromise of input-b, locks the operators out of
their fleet.

Tier 2 implementation landed as an extension of the existing
`ca_trust` role (Phase 1b). Remaining Phase 3 work items: build
out the `breakglass` local user + iLO recovery procedure (Tier 3),
and migrate each operator off the legacy static `authorized_keys`
entries deployed by `system_setup` once their cert path has been
used successfully for a while.

## Why these lifetimes

The provisioner lifetimes (16h / 24h / 7d / 30–90d) are deliberate and
worth understanding, because "cert lifetime" often gets conflated with
"security strength" when it's really about **compromise recovery time**
vs **operational resilience**. The numbers themselves live in
{doc}`ca-provisioner-set`; the *why* is here.

- **Human SSH (16h)** — long enough to cover a full workday across
  time zones, short enough that daily re-authentication is routine.
  Off-boarding someone from the ccatobs GitHub org effectively
  revokes their SSH access within 16 hours with zero extra work:
  their next `step ssh login` fails at the GitHub OAuth step,
  their previous cert expires, they're out. No `authorized_keys`
  surgery required.

- **Service SSH (24h, auto-renewed every 6h)** — the service-accounts
  provisioner is designed for the **Pattern A** renewal flow
  described below: services run a systemd timer that calls
  `step ssh renew` every 6h, so the cert is continuously refreshed
  without ever touching the provisioner password again after
  bootstrap. A stolen cert is valid for at most 24h (and the timer
  would be trying to replace it during that window anyway).
  Rotation = rotate the provisioner password centrally, all
  downstream certs expire naturally within a day. Compare to
  classic SSH keys where compromise means "find and rotate keys on
  every deployed host."

- **Service x509 (30–90d)** — TLS certs for Redis, Postgres, internal
  APIs etc. run 30d in staging and 90d in production. Production is
  longer for operational resilience (a week-long CA outage doesn't
  cascade into service outages); staging is shorter to exercise the
  renewal flow and surface any regressions before they bite prod.
  Services renew weekly via a short script or cert-manager-style
  controller.

- **SSH host certs (7d via SSHPOP)** — See the detailed SSHPOP
  explanation below. 7 days gives plenty of slack; no reason to go
  longer when renewal is free.

- **ACME (90d)** — matches LE convention. Any internal service that
  speaks ACME (cert-manager in k8s, certbot-like tools on hosts)
  gets the standard public-CA-equivalent lifetime.

The one non-obvious choice is **service-accounts at 24h instead of
30d**. A longer cert would mean fewer renewals and less operational
friction, but it would also mean a compromise window measured in
weeks instead of hours, and a stolen cert could quietly self-renew
via `step ssh renew` until someone notices. 24h is the sweet spot
where auto-renewal is cheap (every 6h, trivial load) and compromise
is self-healing within a day.

## What SSHPOP is and why it's clever

SSHPOP = **SSH Proof Of Possession**. It's a step-ca provisioner
type specifically designed for renewing SSH **host** certs with
zero credentials stored on the host after initial bootstrap.
Understanding it matters because it's the foundation of the
"SSH host certs rotate themselves forever" story in Phase 3.

**The mechanism**: when a host wants to renew its cert, it signs
the renewal request with **the private key of its currently-valid
cert** (which is the sshd host key — already on disk, already
required for sshd to work). step-ca verifies the signature against
the submitted cert, checks the cert hasn't expired, checks it was
originally issued by this CA, and issues a fresh one with the
same principal.

```
Host                               step-ca
  │                                    │
  │ (current cert is 5 days old,       │
  │  systemd timer fires)              │
  │                                    │
  │──── step ssh renew request ───────>│
  │     (signed with current cert's    │
  │      private key, includes current │
  │      cert in the request)          │
  │                                    │
  │                                    │ SSHPOP provisioner:
  │                                    │   - Extract pubkey from current cert
  │                                    │   - Verify signature
  │                                    │   - Check not expired
  │                                    │   - Check issued-by-us
  │                                    │
  │<───── new cert, 7 days valid ──────│
  │                                    │
  │ Write to disk, SIGHUP sshd         │
```

**Zero new secrets were used.** The host proved its identity by
*possessing* the private key that matches the current cert. Hence
"Proof of Possession". No password, no token, no provisioner
credential on the host — just the sshd key which has to be there
anyway.

**Why only host certs?** Host certs are associated with a single
long-lived key (the sshd host key), so "prove possession of the
current cert's key" has a natural answer. User certs are per-session
(fresh key each `step ssh login`), so there's no stable key to
prove possession of.

**Natural forcing function**: if a host falls out of rotation long
enough for its cert to fully expire, SSHPOP cannot rescue it. The
host has no valid cert to sign with, so renewal fails. You'd have
to re-bootstrap the host with a fresh cert via a different provisioner
(the JWK `service-accounts`). This is a **feature, not a bug** — it
surfaces hosts that have silently fallen offline. Classic SSH host
keys are forever and silently trust stale hosts; SSHPOP reflects
liveness.

**Phase 3 usage** (not yet in place):

1. Bootstrap host cert via the JWK `service-accounts` provisioner,
   one-time, during host provisioning (requires the password briefly,
   then delete it).
2. Configure sshd: `HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub`.
3. systemd timer on each host, daily:
   ```
   step ssh renew --force /etc/ssh/ssh_host_ed25519_key-cert.pub
   ```
4. Cert rotates forever, no credentials on the host after bootstrap.
5. Clients that have `ca_trust` deployed (the `@cert-authority` line
   in `ssh_known_hosts`) automatically trust the renewed certs.

## Service-account SSH patterns

There are two deployment patterns for machine SSH identities on
CCAT, and knowing which is which keeps the threat model clear.

**Pattern A — long-lived cert with auto-renewal.** A service
bootstraps once, gets a 24h cert, and runs a systemd timer that
calls `step ssh renew` every 6 hours. After the one-time bootstrap,
the provisioner password is no longer stored on the host — the
cert *is* the authentication for future renewals (`step ssh renew`
uses the current cert's private key to authenticate to step-ca).
This is the right pattern for:

- **Jenkins** running on input-b — long-running daemon, lots of
  small SSH operations, trusted host
- **ccat_transfer (bbcp)** on every input node — same profile,
  high-volume transfers between internal machines
- **cron-based backup scripts** and similar daemons

**Pattern B — per-task short-lived cert.** A service has no
standing SSH identity. When it needs to SSH, it calls
`step ssh certificate` with a 5–60 minute lifetime, uses the cert
for the task, discards it. The provisioner password lives in a
tightly-scoped secret readable only by the job runner. Each cert
issuance is a logged event in step-ca. This is the right pattern
for:

- **CI runners on untrusted execution environments** (cloud
  runners, contractor machines, shared infrastructure)
- **Rarely-run one-off jobs** where maintaining a renewal timer
  adds more ceremony than it saves
- **Compliance-sensitive operations** that need an audit entry
  per execution

Both patterns use the **same** `service-accounts` provisioner —
the difference is how the service *uses* it. CCAT's current setup
(Jenkins + ccat_transfer, all on trusted hardware in a locked
hall) maps cleanly to Pattern A everywhere.

### Wiring Pattern A — the `ssh_service_cert` role (concept)

The Ansible role `ansible/roles/ssh_service_cert/` implements Pattern A.
It is wired into `playbook_setup_vms.yml` for both `input_staging` and
`input_ccat`, and it is a no-op on hosts where `ssh_service_certs` is
empty — so adoption is opt-in per group.

What the role does, per service-account in `ssh_service_certs`:

1. Install step-cli on the host (RHEL only; pinned via
   `ssh_service_cert_step_cli_version` in `defaults/main.yml`). Skipped
   if any version is already present (e.g. on input-b, where
   `hsm_host` already installed it).
2. As the target user (`become_user: <user>`), bootstrap step-cli
   against the CCAT CA. Idempotent — gated by `creates: ~/.step/config/defaults.json`.
3. Check whether `~user/.ssh/ccat_id_ed25519-cert.pub` exists and is
   valid (`step ssh inspect`). If not, do a one-shot password-gated
   issuance: write `vault_step_ca_prov_service_accounts_password` to
   a 0400 tmpfile, run `step ssh certificate ... --provisioner-password-file`,
   delete the tmpfile in an `always:` block. `no_log: true` everywhere
   the password could surface in Ansible output.
4. Install templated systemd units `step-renew@.service` and
   `step-renew@.timer`. Enable + start `step-renew@<user>.timer`.
   The timer fires every `ssh_service_cert_renew_interval` (default
   6h), runs `step ssh renew --force` as `<user>`. step-cli's renew
   only contacts the CA when the cert is past 2/3 of its lifetime,
   so over-firing is harmless.

Cert files live at `~user/.ssh/ccat_id_ed25519{,.pub,-cert.pub}` —
the `ccat_` prefix avoids collision with any pre-existing `id_ed25519`
the user already had. Services that consume the cert must point at
this filename explicitly (`-i ~/.ssh/ccat_id_ed25519` for ssh, or via
their config).

**Target-side principals.** For service-account SSH (where the cert
principal equals the login username), no extra config is needed
beyond what `ca_trust` already deploys. sshd's default rule — "if no
`AuthorizedPrincipalsFile` matches, the cert principal must equal the
login username" — applies. Cert issued with principal `ccat_transfer`
→ user logs in as `ccat_transfer`.

`ca_trust` does set `AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u`
in sshd_config and writes per-human files for users with a `github:`
binding. For service accounts (which are not in the `users:` list),
no per-user file is written, so sshd falls back to the default rule.

This is why the role enforces `principal == user` — it's the cheapest
trust model that needs no per-host file coordination.

For the operational rollout steps (staging then production), see
{doc}`../ca-provisioner-management` § "Wiring Pattern A — Rollout
staging/production".

**Jenkins — deferred.** Jenkins runs in a Docker container (UID 1000
inside the container ≠ host UID 10999). The Pattern A role assumes a
host-native user with a host-native `~/.ssh`. For Jenkins, the cert +
key need to live in the bind-mounted `/data/jenkins/`
(= `/var/jenkins_home/` inside the container) and renewals must run
with container-side ownership (1000:1000). Two reasonable shapes:

1. Renewal as a sidecar `docker exec` from a host timer: requires
   step-cli inside the Jenkins image, plus a chown step.
2. A dedicated `step-renewer` sidecar service in
   `docker-compose.jenkins.yml`: cleaner separation, more moving parts.

Either fits cleanly on top of the existing `ssh_service_cert` role
once we pick a shape. Tracked as a follow-up; not blocking the
ccat_transfer rollout.

## Appendix: Why not Let's Encrypt for everything?

A fair question: if Let's Encrypt already works for our public
endpoints, why run our own CA for internal stuff?

- **Let's Encrypt only works for publicly-resolvable DNS names and
  reachable HTTP(S) endpoints.** Our internal Redis, Postgres, SSH
  host certs, and service mTLS all run on hostnames like
  `redis.data.ccat.uni-koeln.de` that are reachable only from inside
  our network — Let's Encrypt cannot validate them.
- **Let's Encrypt does not issue SSH certs.** SSH certs are a
  completely different format from X.509 TLS certs. step-ca handles
  both; LE only does TLS.
- **Let's Encrypt rate limits** (50 certs per week per registered
  domain) would be hit fast if every internal service renewed against
  the public CA. Our own CA has no such limit.
- **Short-lived internal TLS certs** (30 days, renewed weekly) with
  LE would mean constantly hammering a third-party. With our own CA
  the operation is free and internal.

LE is the right tool for the outer boundary (the CA's own public face).
step-ca is the right tool for everything behind it.

## See also

- {doc}`tls-and-pki` — PKI fundamentals: cert structure, TLS
  handshake, root vs intermediate explained from first principles.
- {doc}`ca-provisioner-set` — reference tables: provisioner set,
  SSH access tiers, Ansible role tags, lifetime flags.
- {doc}`certificate-authority-threat-model` — what the cert auth
  model defends against and what it does not.
- {doc}`../ca-client-onboarding` — how-to: set up your laptop to
  SSH with step-ca-issued certs.
- {doc}`../ca-day-to-day` — how-to: routine CA operations
  (bring stack up/down, issue certs, monitor expiry, backup).
- {doc}`../ca-provisioner-management` — how-to: add, update,
  remove provisioners; rotate JWK passwords; roll out Pattern A.
- {doc}`../ca-rotation-and-recovery` — runbook: intermediate / root
  rotation; vhost cert renewal and emergency re-issue; disaster
  recovery scenarios.
- {doc}`../adr/0001-ca-per-vhost-cert-split` — the ADR for the
  per-vhost cert split decision (why `:443` for everyone, why
  `:9000` is same-host-only).
- {doc}`../ceremony/index` — the executable ceremony procedures
  (offline root, HSM cutover) that this architecture relies on.