CCAT Certificate Authority — Architecture and Design#

This page is explanation: it describes what the CCAT private CA is for, why it is shaped the way it is, and the reasoning behind the major design choices (two tiers, two HSMs, GitHub-team authorization, short-lived certs, Pattern A service-account renewal). It does not contain runbooks; for those see the how-to siblings listed under “See also” below.

For PKI fundamentals (what a cert is, how TLS handshakes work, the role of public vs private keys), see TLS, Certificates, and Public Key Infrastructure. This document assumes you already know those concepts and focuses on our specific setup and the reasons behind it.

What the CA is for #

Before the CA existed, every TLS or mTLS need across the CCAT stack was solved with a hand-rolled OpenSSL script. Redis mTLS has a ccat redis-certs generate workflow that runs eight openssl commands to produce a per-variant CA and a set of client/server certs. Postgres replication uses its own cert pair. Developer SSH access uses static ~/.ssh/authorized_keys files per machine. Every service reinvents the cert layer at a slightly different angle.

The CA replaces that pattern. Instead of “each service has its own trust root and its own cert-generation script,” every service gets its certs from one central authority that issues short-lived certs on demand. The benefits compound:

One trust root. Clients bootstrap against the CCAT root once and trust everything it signs — Redis, Postgres, SSH hosts, internal web UIs, bbcp endpoints.
Short-lived certs. 16-hour SSH user certs, 30–90 day TLS certs, 7-day SSH host certs — no more “rotate these 12-month certs once a year and hope nothing breaks.” Expiry becomes a health property, not a calendar event.
Identity-aware issuance. SSH user certs come from GitHub OAuth via Dex, with ccatobs GitHub team membership as the authorization gate. No more adding public keys to authorized_keys files across machines — if you’re in the ccatobs/datacenter team, you run step ssh login, get a fresh cert valid for 16 hours, and SSH in. Off-boarding is “remove from the GitHub team” — certs expire on their own, no key-removal ceremony.
Automation paths. ACME and SSHPOP provisioners let services auto-renew their own certs without human touch.

The CA is not a goal in itself — it is how we stop writing cert-management code.

Trust architecture: the two-tier, two-HSM model #

The CA is private — nothing publicly trusted on the wider internet connects to it. It is used only by CCAT hosts, containers, and developer laptops that have explicitly bootstrapped against our root. This shapes the threat model and the choices below.

Two tiers: root and intermediate #

Every X.509 PKI has a notion of a trust hierarchy. We use the standard two-tier layout:

Root CA: the ultimate trust anchor. All CCAT certs chain back to it. Used only to sign the intermediate. Rotated ~never (20-year lifetime).
Intermediate CA: the working signing key. Signs every cert step-ca issues day-to-day. Rotated every 5–10 years or immediately upon suspected compromise.

The reason this separation exists: if the intermediate is compromised, you can revoke and replace it by pulling the root out of the safe and doing a controlled ceremony — clients keep trusting the (unchanged) root. If the root is compromised, every client must re-bootstrap against a new root, which for CCAT means touching every server and every developer laptop. We optimize for recovery from intermediate compromise, not for root compromise never happening.

Two HSMs: offline root, online intermediate #

We use two Nitrokey HSM 2 dongles:

Role	HSM	Physical location	Online?
Root signing key	HSM #1	Locked safe, off-site ideally	Never
Intermediate signing key	HSM #2	R640 internal USB on input-b	Always

HSM #1 comes out of the safe only during signing ceremonies (once at commissioning, then once per intermediate rotation). It is plugged into an air-gapped laptop for those ceremonies, never into input-b.

HSM #2 lives inside the R640 chassis permanently, in the internal USB-A port that sits on the motherboard. Getting physical access to it requires pulling the server out of its rack and opening the lid — a bar high enough that the “someone unplugs the dongle” threat is effectively closed. This is only a partial protection though: see the next section.

What the HSMs actually protect #

An HSM protects against key extraction, not against key use. The private key is generated on the device and is physically impossible to export — even root on the host cannot read the key bytes. But anyone with:

access to the machine the HSM is plugged into, and
the HSM user PIN

can invoke signing operations via PKCS#11. The HSM will happily comply — that’s its job.

This distinction drives several design choices:

The intermediate PIN lives in the CCAT vault (encrypted with ansible-vault) and is rendered into /opt/data-center/system-integration/.env on input-b as STEP_CA_HSM_PIN. step-ca’s ca.json references it via pin-source=/run/secrets/hsm-pin mounted from a tmpfs. The PIN is therefore reachable by any process running as root on input-b. An attacker who owns input-b can ask HSM #2 to sign arbitrary certs for the duration of their access.
Accepting that risk is fine because when you kick the attacker out, the key is still on the dongle. Recovery is “rotate the intermediate” — pull HSM #1 from the safe, do a ceremony, install the new intermediate on HSM #2 (or a fresh dongle), restart step-ca. Clients do not notice because the root is unchanged. Total downtime: ~1 hour, mostly ceremony overhead. Compare to a file-on-disk intermediate, where the attacker walks off with the key and can continue issuing certs even after being expelled — that requires a root rotation, which is catastrophic.
The root HSM is never plugged into anything networked. Period. If you need the root to sign something, you do it on a freshly wiped laptop, offline, with pre-printed procedures. The root user PIN is never typed on input-b or saved in any vault. It is kept on paper in the safe, in a sealed envelope alongside the dongle.
HSM #1 failure = emergency root rotation. We do not maintain DKEK backup shares for Nitrokey key recovery. If HSM #1 physically fails (electronics die in the safe), we have no way to recover the root key, and every CCAT client must re-bootstrap. This is a deliberate simplification for our scale (~20 clients) — the recovery pain is bounded, and DKEK introduces its own operational complexity. Revisit if the observatory grows past ~50 clients.

Trust architecture: per-vhost cert posture #

step-ca’s TLS handshake has to present a cert that chains to the CCAT root, because step-cli explicitly sets tls.Config.RootCAs from the bootstrapped trust file and Go ignores SystemCertPool / SSL_CERT_FILE whenever RootCAs is non-nil. The CA’s TLS chain therefore has to be CCAT-rooted at the wire — no client-side trust-bundle plumbing can bridge an LE-fronted endpoint.

We do this with a per-vhost cert split inside the same nginx-proxy:

vhost	TLS signed by	Trusted by
`ca.ccat.uni-koeln.de` (:443)	CCAT root (issued by step-ca itself, `prod-services` JWK)	step-cli clients via `step ca bootstrap`
`auth.ccat.uni-koeln.de` (:443)	Let’s Encrypt	browsers (Dex login flow), GitHub OAuth
`grafana.ccat.uni-koeln.de` (:443)	Let’s Encrypt	browsers
`docs.ccat.uni-koeln.de` (:443)	Let’s Encrypt	browsers

The CA vhost opts out of acme-companion (no LETSENCRYPT_HOST on the step-ca compose service); the CCAT-rooted cert is issued via step-ca/issue-vhost-cert.sh and renewed by the step-ca-vhost-renew.timer (every 12h, no-op until within 1/3 of the cert lifetime). Other vhosts in the same proxy keep LE because they serve browsers + GitHub OAuth callbacks which require a publicly-trusted chain.

step-ca’s native :9000 is also bound on input-b. firewalld (managed by the hsm_host role’s ca_allowed_source_cidrs variable, default Uni Köln /16) gates inbound access. In practice :9000 is reachable only from input-b’s own /24 because the Uni Köln IT firewall drops :9000 cross-subnet — that’s why we use it exclusively for the same-host issuance/renewal scripts that nginx-proxy can’t proxy (the JWK-gated step ca certificate flow that issue-vhost-cert.sh and renew-vhost-cert.sh invoke). :443 is the universal path for everyone else.

Policy enforcement point#

Access to ca.ccat.uni-koeln.de:443 is gated by an in-repo IP allowlist at the proxy: proxy/data/vhost.d/ca.ccat.uni-koeln.de, defaulting to Uni Köln /16 and deny all otherwise. The file is the source of truth — adding a partner CIDR is a PR plus an nginx -s reload. Operator mechanics are in CA day-to-day operations § “Adding a partner subnet”; the decision rationale for this posture is in ADR-0001 (ADR-0001 — Per-vhost cert split for the CCAT step-ca endpoint).

The full retrospective of attempts that ruled out alternative postures (LE on the CA vhost, client-side trust-bundle bridges, direct cross-subnet :9000 exposure) lives in Lessons learned — Phase 2 HSM cutover 2026-05-04. This page describes only the current state.

Let’s Encrypt layering #

The CA has two public-facing HTTPS endpoints: ca.ccat.uni-koeln.de (step-ca itself) and auth.ccat.uni-koeln.de (Dex). These are the URLs that developer laptops, step ca bootstrap, and GitHub’s OAuth callback hit over the public internet.

These endpoints are served by the existing nginx-proxy + acme-companion stack on input-b, with certs from Let’s Encrypt — not from our own CCAT CA. The reasons:

Browsers and GitHub’s OAuth callback will not follow redirects to a TLS endpoint signed by an untrusted CA. Using the CCAT root for ca.ccat.uni-koeln.de would mean every step ca bootstrap needs a manually-verified pre-shared fingerprint — a chicken-and-egg problem.
Let’s Encrypt is free, automated, and trusted by every OS. We get working HTTPS on both domains with zero additional infrastructure and zero per-client trust configuration.

Two ACME endpoints exist on input-b once commissioning is done, and they are not the same thing:

Let’s Encrypt ACME at acme-v02.api.letsencrypt.org — used by acme-companion to obtain public TLS certs for ca.ccat.uni-koeln.de and auth.ccat.uni-koeln.de. Renewed automatically every ~60 days.
step-ca ACME at https://ca.ccat.uni-koeln.de/acme/acme/directory — used by internal services (future: cert-manager in K8s, or a simple step command on each host) to obtain CCAT-issued internal TLS certs. Provisioner is added to step-ca during commissioning.

Both speak the same ACME protocol but serve different trust domains. Don’t confuse them.

Physical and network preconditions #

Before commissioning, these must all be true:

input-b is physical (R640 in a locked HA hall). The internal USB port on the motherboard is accessible. Physical access to the room is gated.
DNS records exist for ca.ccat.uni-koeln.de and auth.ccat.uni-koeln.de, both pointing at input-b’s public IP.
Firewall: ports 80 and 443 are reachable from the public internet (for Let’s Encrypt HTTP-01 challenges and client traffic).
nginx-proxy + acme-companion is already running on input-b via docker-compose.proxy.yml (ccat proxy status).
GitHub OAuth App has been created in the ccatobs organization with callback URL https://auth.ccat.uni-koeln.de/callback and read:org scope (Dex needs it to check team membership).
GitHub team ccatobs/datacenter exists and contains the people who should get SSH access via step ssh login. Dex rejects everyone outside this team at the authentication step.

Commissioning strategy — phases #

We deliberately commission the CA in two stages, using the HSM arrival as a built-in rehearsal of the most dangerous operation in the CA’s lifetime (root rotation). The phases are:

Phase	What	When	Outcome
Phase 1	Dry-run commissioning with a throwaway auto-init root	Now, without HSMs	Working CA, used by a small test cohort, `ca_trust` role proven in production
Phase 2	Offline root ceremony + HSM cutover = rotation rehearsal	When both HSMs arrive	CA migrated to the intended HSM-backed steady state, test cohort re-bootstraps
Phase 3	Rollout to real services (Redis mTLS, Postgres TLS, SSH host certs, etc.)	After Phase 2 has been stable for ~1 week	CA is trusted by production services

Why this ordering:

Phase 1 de-risks everything that isn’t HSM-specific. DNS, Let’s Encrypt issuance, nginx-proxy wiring, Dex + GitHub team enforcement, OIDC redirect URIs, step-ca provisioner syntax, the step ca bootstrap → step ssh login flow, the ca_trust Ansible role end-to-end — all verified in a low-stakes setting before hardware arrives.
Phase 2 exercises the root rotation procedure. Root rotation is the one operation the team otherwise never practices; it is also the catastrophic disaster-recovery path. Doing it once intentionally, with throwaway clients and zero production impact, is the best rehearsal possible. If it fails, you learn while stakes are zero.
Phase 3 is gated on Phase 2 success. Nothing outside the small Phase 1 test cohort bootstraps against the CA until after the HSM cutover. This discipline is non-negotiable: if production clients trusted the Phase 1 throwaway root, Phase 2 would require re-bootstrapping them for real, defeating the rehearsal framing.

The one rule that makes Phase 1 safe #

Nothing production-critical bootstraps against the Phase 1 CA. The test cohort is 2–3 people who know they’re on a test CA and have agreed to re-bootstrap at Phase 2 cutover. No Redis, no Postgres, no SSH hosts, no CI systems, no automation.

The Phase 1 CA’s blast radius is therefore near-zero: even if someone compromised input-b during the dry-run window and stole the auto-init root key from the docker volume, the certs they could sign would be trusted only by the test cohort’s laptops — which are going to be re-bootstrapped in Phase 2 anyway. The Phase 1 root goes in the bin regardless.

The executable Phase 2 ceremony procedure lives in CCAT CA — Offline Root Ceremony Playbook and the on-server cutover in CCAT CA — HSM Cutover Playbook (post-ceremony). After the ceremony, the public artefacts (root cert, SSH CA pubkeys, fingerprint paper) flow back to git and to operator workstations as described next.

Post-ceremony distribution #

HSM #1 → sealed envelope with root PINs and fingerprint paper → the safe. Does not enter input-b. Ever.
HSM #2 → carried to the server room → installed in the R640 internal USB port → chassis closed → server returned to rack.
Export USB → mounted on a developer machine → public artifacts (root_ca.crt, ssh_user_ca.pub, ssh_host_ca.pub) copied into ansible/roles/ca_trust/files/ → committed to git with a clear commit message (“ca: commit public trust material from root ceremony 2026-XX-XX, fingerprint …”).

The public artifacts are safe to commit — they contain no secret material, and every client needs to be able to fetch them. The fingerprint in the commit message is the cross-check: any future developer inspecting history can verify the committed root cert matches the ceremony fingerprint on paper.

Authorization model — we trust GitHub, not email domains #

A common pitfall when wiring step-ca’s OIDC provisioner is to use the --domain flag to restrict which users can get certs. That flag checks the email claim of the OIDC token against an allowlist of domains. For a tenant that uses a single corporate email domain (Google Workspace, Microsoft 365), it’s a reasonable coarse gate.

For CCAT, it is the wrong model. Our trust chain is:

Dex federates GitHub as the identity provider.
Authorization is membership in the ccatobs/datacenter GitHub team, not email domain membership.
Team members have wildly different email domains — uni-koeln.de, ph1.uni-koeln.de, cornell.edu, fyst.org, personal addresses. None of these reflect CCAT membership in any structural way.

Filtering by email domain is simultaneously too strict (rejects valid ccatobs members whose GitHub primary email isn’t a uni address) and too loose (accepts anyone with a uni-koeln.de email regardless of whether they’re in ccatobs — that’s a huge public domain). The script therefore omits --domain by default.

What actually provides the authorization gate:

Dex enforces ccatobs/datacenter team membership directly in its GitHub connector, before step-ca ever sees a token. Config is in step-ca/dex/config.yaml:

connectors:
  - type: github
    id: github
    config:
      orgs:
        - name: ccatobs
          teams:
            - datacenter

How it works end-to-end:

User runs step ssh login. step-cli opens a browser to the CCAT-GitHub provisioner’s configured OIDC issuer (Dex).
Dex redirects the browser to GitHub for OAuth.
GitHub authenticates the user and returns an OAuth token with read:org scope.
Dex calls GitHub’s /user/teams endpoint with that token and checks whether the user is a member of ccatobs/datacenter.
If yes: Dex issues an OIDC ID token with a groups claim containing the team slug, redirects back to step-cli, step-ca validates the token, issues a 16h SSH cert. Done.
If no: Dex returns an “access denied” page, no token is issued, step-cli errors out with “OIDC flow failed.” The user never reaches step-ca.

Onboarding a new operator: add them to the ccatobs/datacenter team on github.com. Their next step ssh login succeeds. No CCAT-side configuration change, no admin UI to click through, no secret to rotate.

Offboarding: remove them from the team. Their current 16h cert expires within the day, no new certs can be issued. Any existing SSH sessions keep working until the cert underlying them expires, and then they’re locked out. No cert revocation needed in the common case.

This is a fully automatic model: both authentication and authorization are delegated to GitHub’s team management. CCAT writes zero identity code. A GitHub outage makes new step ssh login flows unavailable until GitHub recovers (existing 16h certs keep working), which is an acceptable trade for the operational simplicity — and in practice GitHub has dramatically better uptime than any identity layer CCAT would run itself.

Why we moved off Keycloak. The prior Phase 1 setup used Keycloak as an IdP in front of GitHub. Keycloak’s built-in GitHub broker does not call the teams endpoint, only /user, so authorization had to be enforced by a manual “assign the ccatobs-member realm role” step in the Keycloak admin UI after each new user’s first login. That’s one manual onboarding step too many, and it doesn’t age well — if a user leaves the GitHub team, their Keycloak role stays assigned unless an admin remembers to clean up. Switching to Dex collapses three moving parts (Keycloak, Keycloak-db, manual role assignment) into one declarative YAML block and tracks GitHub team membership automatically.

The --domain flag remains available in the script via the ALLOWED_DOMAINS env var for cases where domain is genuinely the right gate (e.g. you’re bootstrapping a CA for a specific org that does use a uniform email domain). For CCAT, leave it unset.

SSH access tiers — narrative #

The Dex team gate answers “who may authenticate.” The separate question of “which Linux user may they become, and what happens if the IdP is down” is answered by a three-tier access model, implemented via a mix of Ansible-managed local users, AuthorizedPrincipalsFile, and the existing Nitrokey FIDO2 SSH keys.

This section explains the intended steady-state model. The implementation (an Ansible role deploying auth_principals/%u files) is Phase 3 work; Phase 1 hosts are currently using the legacy static-authorized_keys path. For the lookup table that summarises the tiers in one place, see CCAT CA — Provisioner set and reference tables § “SSH access tiers”.

Tier 1 — Hard-core admins (2–3 people)

Full root access to every CCAT-managed host, with a physical second factor as the fallback for when the IdP layer is unavailable.

Personal Linux user on every host (e.g. buchbend), managed by Ansible users.yml, member of the wheel/sudo group.
Static SSH authorized_keys entry for their Nitrokey FIDO2 resident key (sk-ecdsa-sha2-nistp256@openssh.com). This key is physically bound to the dongle and cannot be cloned without the device. It is the break-glass path: if Dex is down, if GitHub is unreachable, if step-ca won’t issue, the admin still SSHes in with their dongle.
Also a full member of ccatobs/datacenter on GitHub, so the normal step ssh login flow works day-to-day. The Nitrokey path is the backup, not the primary.
Sudo permissions are granted through group membership, not through anything the SSH cert carries. A Tier 1 admin who logs in with a step-ca cert lands in the same local account and gets the same sudo rights as one who logs in with the Nitrokey — the cert/key choice is just the door, not the privilege level.

Tier 2 — Operational staff

Regular contributors who need SSH access for legitimate operational work but are not the people you wake up at 3am. The Nitrokey dependency is explicitly not required — adding hardware to every new contributor is friction that scales badly.

Personal Linux user on managed hosts, created by Ansible from users.yml. No wheel/sudo membership unless there’s a specific operational need.
No static SSH authorized_keys entry. The only path to logging in is a valid step-ca-issued SSH user cert, which requires authenticating through Dex + the GitHub team check.
sshd_config has AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u and TrustedUserCAKeys /etc/ssh/trusted_user_ca_keys. Each staff member gets a one-line file /etc/ssh/auth_principals/<username> containing their GitHub login as a principal. Both are rendered by the ca_trust Ansible role from the github: field on each user record in group_vars/all/users.yml — see CCAT CA — Provisioner set and reference tables for the implementation reference.
Off-boarding is GitHub-side: remove them from ccatobs/datacenter, their next step ssh login fails at Dex, their current cert expires within 16h, they’re out. No Ansible rerun, no manual authorized_keys surgery.
Rolling back a staff member to “no SSH at all” can be done either by removing them from the GitHub team (preferred, fast, no CCAT-side action) or by removing their github: field from users.yml and re-running ca_trust (the role scrubs the stale auth_principals/<user> file explicitly on the next run, which closes the cert path even if their GitHub team membership is still live).

Tier 3 — Break-glass / emergency-only accounts

For scenarios where even Tier 1’s Nitrokey path is insufficient — the local sshd is broken, the machine is in single-user mode, the network is down — there must be a path that bypasses SSH entirely.

A named local user (e.g. breakglass) exists on each managed host, created by Ansible but with:
- No password (! in /etc/shadow).
- No authorized_keys and no entry in any auth_principals file. Cannot be reached via SSH by design.
- Full wheel/sudo rights, so once you are them, you can recover anything.
Access is via the iLO/DRAC out-of-band management console on the R640, reached from the Uni Köln management VLAN. The iLO gives you a virtual keyboard at a physical login prompt, which is the one interface that works when every network service is gone. An admin with iLO credentials types the break-glass account’s name + a password supplied by iLO root-recovery or a physically-printed emergency password kept in the safe alongside the root HSM.
The break-glass path is tested during commissioning and then left alone. Using it is an incident in itself; any use should generate a postmortem.

The key property: the failure modes are orthogonal. A Dex outage takes out Tier 2 but leaves Tiers 1 and 3 intact. A GitHub outage takes out the step ssh login path for everyone, but Tier 1 falls back to Nitrokey and Tier 3 is untouched. A full network outage on input-b takes out step-ca entirely, but Tier 1’s Nitrokey path still works on every other host (their FIDO2 key is in each host’s local authorized_keys) and Tier 3 recovers the unreachable machine via iLO. No single failure, including a compromise of input-b, locks the operators out of their fleet.

Tier 2 implementation landed as an extension of the existing ca_trust role (Phase 1b). Remaining Phase 3 work items: build out the breakglass local user + iLO recovery procedure (Tier 3), and migrate each operator off the legacy static authorized_keys entries deployed by system_setup once their cert path has been used successfully for a while.

Why these lifetimes #

The provisioner lifetimes (16h / 24h / 7d / 30–90d) are deliberate and worth understanding, because “cert lifetime” often gets conflated with “security strength” when it’s really about compromise recovery time vs operational resilience. The numbers themselves live in CCAT CA — Provisioner set and reference tables; the why is here.

Human SSH (16h) — long enough to cover a full workday across time zones, short enough that daily re-authentication is routine. Off-boarding someone from the ccatobs GitHub org effectively revokes their SSH access within 16 hours with zero extra work: their next step ssh login fails at the GitHub OAuth step, their previous cert expires, they’re out. No authorized_keys surgery required.
Service SSH (24h, auto-renewed every 6h) — the service-accounts provisioner is designed for the Pattern A renewal flow described below: services run a systemd timer that calls step ssh renew every 6h, so the cert is continuously refreshed without ever touching the provisioner password again after bootstrap. A stolen cert is valid for at most 24h (and the timer would be trying to replace it during that window anyway). Rotation = rotate the provisioner password centrally, all downstream certs expire naturally within a day. Compare to classic SSH keys where compromise means “find and rotate keys on every deployed host.”
Service x509 (30–90d) — TLS certs for Redis, Postgres, internal APIs etc. run 30d in staging and 90d in production. Production is longer for operational resilience (a week-long CA outage doesn’t cascade into service outages); staging is shorter to exercise the renewal flow and surface any regressions before they bite prod. Services renew weekly via a short script or cert-manager-style controller.
SSH host certs (7d via SSHPOP) — See the detailed SSHPOP explanation below. 7 days gives plenty of slack; no reason to go longer when renewal is free.
ACME (90d) — matches LE convention. Any internal service that speaks ACME (cert-manager in k8s, certbot-like tools on hosts) gets the standard public-CA-equivalent lifetime.

The one non-obvious choice is service-accounts at 24h instead of 30d. A longer cert would mean fewer renewals and less operational friction, but it would also mean a compromise window measured in weeks instead of hours, and a stolen cert could quietly self-renew via step ssh renew until someone notices. 24h is the sweet spot where auto-renewal is cheap (every 6h, trivial load) and compromise is self-healing within a day.

What SSHPOP is and why it’s clever #

SSHPOP = SSH Proof Of Possession. It’s a step-ca provisioner type specifically designed for renewing SSH host certs with zero credentials stored on the host after initial bootstrap. Understanding it matters because it’s the foundation of the “SSH host certs rotate themselves forever” story in Phase 3.

The mechanism: when a host wants to renew its cert, it signs the renewal request with the private key of its currently-valid cert (which is the sshd host key — already on disk, already required for sshd to work). step-ca verifies the signature against the submitted cert, checks the cert hasn’t expired, checks it was originally issued by this CA, and issues a fresh one with the same principal.

Host                               step-ca
  │                                    │
  │ (current cert is 5 days old,       │
  │  systemd timer fires)              │
  │                                    │
  │──── step ssh renew request ───────>│
  │     (signed with current cert's    │
  │      private key, includes current │
  │      cert in the request)          │
  │                                    │
  │                                    │ SSHPOP provisioner:
  │                                    │   - Extract pubkey from current cert
  │                                    │   - Verify signature
  │                                    │   - Check not expired
  │                                    │   - Check issued-by-us
  │                                    │
  │<───── new cert, 7 days valid ──────│
  │                                    │
  │ Write to disk, SIGHUP sshd         │

Zero new secrets were used. The host proved its identity by possessing the private key that matches the current cert. Hence “Proof of Possession”. No password, no token, no provisioner credential on the host — just the sshd key which has to be there anyway.

Why only host certs? Host certs are associated with a single long-lived key (the sshd host key), so “prove possession of the current cert’s key” has a natural answer. User certs are per-session (fresh key each step ssh login), so there’s no stable key to prove possession of.

Natural forcing function: if a host falls out of rotation long enough for its cert to fully expire, SSHPOP cannot rescue it. The host has no valid cert to sign with, so renewal fails. You’d have to re-bootstrap the host with a fresh cert via a different provisioner (the JWK service-accounts). This is a feature, not a bug — it surfaces hosts that have silently fallen offline. Classic SSH host keys are forever and silently trust stale hosts; SSHPOP reflects liveness.

Phase 3 usage (not yet in place):

Bootstrap host cert via the JWK service-accounts provisioner, one-time, during host provisioning (requires the password briefly, then delete it).
Configure sshd: HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub.

systemd timer on each host, daily:

step ssh renew --force /etc/ssh/ssh_host_ed25519_key-cert.pub

Cert rotates forever, no credentials on the host after bootstrap.
Clients that have ca_trust deployed (the @cert-authority line in ssh_known_hosts) automatically trust the renewed certs.

Service-account SSH patterns #

There are two deployment patterns for machine SSH identities on CCAT, and knowing which is which keeps the threat model clear.

Pattern A — long-lived cert with auto-renewal. A service bootstraps once, gets a 24h cert, and runs a systemd timer that calls step ssh renew every 6 hours. After the one-time bootstrap, the provisioner password is no longer stored on the host — the cert is the authentication for future renewals (step ssh renew uses the current cert’s private key to authenticate to step-ca). This is the right pattern for:

Jenkins running on input-b — long-running daemon, lots of small SSH operations, trusted host
ccat_transfer (bbcp) on every input node — same profile, high-volume transfers between internal machines
cron-based backup scripts and similar daemons

Pattern B — per-task short-lived cert. A service has no standing SSH identity. When it needs to SSH, it calls step ssh certificate with a 5–60 minute lifetime, uses the cert for the task, discards it. The provisioner password lives in a tightly-scoped secret readable only by the job runner. Each cert issuance is a logged event in step-ca. This is the right pattern for:

CI runners on untrusted execution environments (cloud runners, contractor machines, shared infrastructure)
Rarely-run one-off jobs where maintaining a renewal timer adds more ceremony than it saves
Compliance-sensitive operations that need an audit entry per execution

Both patterns use the same service-accounts provisioner — the difference is how the service uses it. CCAT’s current setup (Jenkins + ccat_transfer, all on trusted hardware in a locked hall) maps cleanly to Pattern A everywhere.

Wiring Pattern A — the `ssh_service_cert` role (concept)#

The Ansible role ansible/roles/ssh_service_cert/ implements Pattern A. It is wired into playbook_setup_vms.yml for both input_staging and input_ccat, and it is a no-op on hosts where ssh_service_certs is empty — so adoption is opt-in per group.

What the role does, per service-account in ssh_service_certs:

Install step-cli on the host (RHEL only; pinned via ssh_service_cert_step_cli_version in defaults/main.yml). Skipped if any version is already present (e.g. on input-b, where hsm_host already installed it).
As the target user (become_user: <user>), bootstrap step-cli against the CCAT CA. Idempotent — gated by creates: ~/.step/config/defaults.json.
Check whether ~user/.ssh/ccat_id_ed25519-cert.pub exists and is valid (step ssh inspect). If not, do a one-shot password-gated issuance: write vault_step_ca_prov_service_accounts_password to a 0400 tmpfile, run step ssh certificate ... --provisioner-password-file, delete the tmpfile in an always: block. no_log: true everywhere the password could surface in Ansible output.
Install templated systemd units step-renew@.service and step-renew@.timer. Enable + start step-renew@<user>.timer. The timer fires every ssh_service_cert_renew_interval (default 6h), runs step ssh renew --force as <user>. step-cli’s renew only contacts the CA when the cert is past 2/3 of its lifetime, so over-firing is harmless.

Cert files live at ~user/.ssh/ccat_id_ed25519{,.pub,-cert.pub} — the ccat_ prefix avoids collision with any pre-existing id_ed25519 the user already had. Services that consume the cert must point at this filename explicitly (-i ~/.ssh/ccat_id_ed25519 for ssh, or via their config).

Target-side principals. For service-account SSH (where the cert principal equals the login username), no extra config is needed beyond what ca_trust already deploys. sshd’s default rule — “if no AuthorizedPrincipalsFile matches, the cert principal must equal the login username” — applies. Cert issued with principal ccat_transfer → user logs in as ccat_transfer.

ca_trust does set AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u in sshd_config and writes per-human files for users with a github: binding. For service accounts (which are not in the users: list), no per-user file is written, so sshd falls back to the default rule.

This is why the role enforces principal == user — it’s the cheapest trust model that needs no per-host file coordination.

For the operational rollout steps (staging then production), see CA provisioner management § “Wiring Pattern A — Rollout staging/production”.

Jenkins — deferred. Jenkins runs in a Docker container (UID 1000 inside the container ≠ host UID 10999). The Pattern A role assumes a host-native user with a host-native ~/.ssh. For Jenkins, the cert + key need to live in the bind-mounted /data/jenkins/ (= /var/jenkins_home/ inside the container) and renewals must run with container-side ownership (1000:1000). Two reasonable shapes:

Renewal as a sidecar docker exec from a host timer: requires step-cli inside the Jenkins image, plus a chown step.
A dedicated step-renewer sidecar service in docker-compose.jenkins.yml: cleaner separation, more moving parts.

Either fits cleanly on top of the existing ssh_service_cert role once we pick a shape. Tracked as a follow-up; not blocking the ccat_transfer rollout.

Appendix: Why not Let’s Encrypt for everything?#

A fair question: if Let’s Encrypt already works for our public endpoints, why run our own CA for internal stuff?

Let’s Encrypt only works for publicly-resolvable DNS names and reachable HTTP(S) endpoints. Our internal Redis, Postgres, SSH host certs, and service mTLS all run on hostnames like redis.data.ccat.uni-koeln.de that are reachable only from inside our network — Let’s Encrypt cannot validate them.
Let’s Encrypt does not issue SSH certs. SSH certs are a completely different format from X.509 TLS certs. step-ca handles both; LE only does TLS.
Let’s Encrypt rate limits (50 certs per week per registered domain) would be hit fast if every internal service renewed against the public CA. Our own CA has no such limit.
Short-lived internal TLS certs (30 days, renewed weekly) with LE would mean constantly hammering a third-party. With our own CA the operation is free and internal.

LE is the right tool for the outer boundary (the CA’s own public face). step-ca is the right tool for everything behind it.