CA rotation and disaster recovery#
This page is a runbook for rotation and recovery operations against the CCAT step-ca. Every procedure here follows the same shape — When you run this · Preconditions · Steps · Verification · Rollback / escalation — so an on-call operator can execute at 02:00 without reading prose. For the why behind any of this, see CCAT Certificate Authority — Architecture and Design. For the executable offline ceremony itself, see CCAT CA — Offline Root Ceremony Playbook and CCAT CA — HSM Cutover Playbook (post-ceremony).
Rotation procedures#
JWK provisioner password rotation#
When you run this. On a schedule (annual is a reasonable default),
or immediately on suspicion that a JWK provisioner password has
leaked. The three JWK provisioners — prod-services,
staging-services, service-accounts — encrypt their private keys
inside ca.json with one password each; rotation re-keys those
encryptions and invalidates the old passwords as issuance credentials.
Preconditions.
You can run Ansible against
input-b(make play-hsm-hostsucceeds in dry-run)..ansible_vault_keyis on the operator workstation.step-ca container is running on input-b (the rotation block edits the live
ca.jsonviaprovisioners-bootstrap.sh).
Steps.
Rotate the vault var(s) on the operator workstation:
ccat secrets rotate vault_step_ca_prov_prod_services_password --env production ccat secrets rotate vault_step_ca_prov_staging_services_password --env production ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production
Run the explicit-tag-only Ansible task. Ansible reads the four vault vars (three JWK + Dex client secret) directly from the encrypted vault, materialises them into a 0400 root-owned tmpfile on input-b, and runs
provisioners-bootstrap.sh --rotate-jwk. The tmpfile is removed in analways:block;no_log: truekeeps secrets out of Ansible logs.make play-hsm-host T=hsm_host_rotate_jwk
Restart step-ca so the rewritten
ca.jsonis loaded:ssh input-b 'ccat ca restart step-ca'
Re-issue any short-lived service-account SSH certs minted under the old password (next renewal cycle handles this on its own; force one via
systemctl start step-renew@<service>.serviceif you can’t wait).
Verification.
ssh input-b 'docker exec ccat-ca-step-ca-1 step ca provisioner list | jq -r ".[].name"'— the three JWK provisioners are still listed (rotation does not drop them, only re-keys them).A test issuance with the new password succeeds:
ssh input-b 'docker exec ccat-ca-step-ca-1 step ca certificate test.local /tmp/t.crt /tmp/t.key --provisioner prod-services --password-file /home/step/secrets/password --force && rm /tmp/t.{crt,key}'
Rollback / escalation. Vault history (encrypted git history) is
the only roll-back: revert the ccat secrets rotate commit, re-run
step 2, restart. If the rotation produced a ca.json that step-ca
refuses to load, it falls back to the prior file via the
provisioners-bootstrap.sh backup; check journalctl -u ccat-ca-step-ca for the actual error. Existing certs remain valid
until natural expiry — rotation never invalidates issued certs.
Reference commit: 7063427 (Ansible-driven rewrite that obsoletes
the older .env-round-trip runbook).
Vhost cert routine rotation#
When you run this. You don’t, normally. The
step-ca-vhost-renew.timer (every 12 h, persistent) calls
step-ca/renew-vhost-cert.sh, which only contacts the CA when
step ca renew’s threshold (~1/3 of cert lifetime) has been reached.
The lifecycle and inspection commands live in CA day-to-day operations
§ “Vhost cert lifecycle”. Run a manual force-renew only if the timer
is wedged or you need a renewal before the threshold (e.g. you
just rotated the issuing JWK provisioner password and want the new
cert to use the new key material).
Preconditions.
You’re on input-b.
The CCAT root cert is at
/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt(RHEL system trust anchor).step-ca container is up;
:9000is reachable from input-b’s own/24.
Steps.
Trigger the renewal helper directly:
sudo /opt/data-center/system-integration/step-ca/renew-vhost-cert.shThe helper checks PRE/POST cert mtime and only reloads nginx-proxy if the cert actually changed.
Verification.
step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt --shortshows a fresh validity window.echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 -servername ca.ccat.uni-koeln.de 2>/dev/null | openssl x509 -noout -dates -issuershows the new dates and the CCAT issuer.journalctl -u step-ca-vhost-renew.service -n 20shows the most recent run with no error.
Rollback / escalation. The previous cert is overwritten in place; there is no automatic rollback. If the new cert doesn’t load, re-issue from scratch (see Vhost cert emergency re-issue below).
Vhost cert emergency re-issue#
When you run this. The renewal timer is broken, the cert was
overwritten with an LE cert by a misconfigured acme-companion,
the cert key was leaked, or the on-wire chain doesn’t match what
the file claims. This procedure regenerates the keypair from
nothing.
Preconditions.
You’re on input-b.
The vault password file with
vault_step_ca_prov_prod_services_passwordis reachable to the issuance script (Ansible normally materialises it; for a fully manual run, set--password-fileexplicitly).step-ca container is running; firewalld permits
:9000from input-b’s own /24.
Steps.
Re-issue the cert. The script writes
/opt/proxy/certs/ca.ccat.uni-koeln.de.{crt,key}and shells out tostep ca certificatewith--ca-url https://ca.ccat.uni-koeln.de:9000and theprod-servicesJWK provisioner:sudo /opt/data-center/system-integration/step-ca/issue-vhost-cert.shOr via Ansible (cleanest path — handles the JWK password tmpfile automatically):
make play-hsm-host T=hsm_host_vhost_cert_issue
Reload the proxy so the new cert is served:
ccat proxy restart
Verification. Same as routine rotation, plus a remote verification from a partner client:
step ca health # from any allowlisted client; expect: ok
Rollback / escalation. The old cert is gone; there is no rollback. If the re-issue itself fails (JWK password wrong, step-ca rejecting), the proxy keeps serving the previous cert until you reload it. Investigate the issuance script’s stderr; the most common failure is a stale JWK password (see JWK provisioner password rotation above).
Intermediate rotation (planned, every ~5 years)#
When you run this. Scheduled, low-activity window. The intermediate cert is rotated proactively before its lifetime expires.
Preconditions.
HSM #1 (root) is retrievable from the safe.
A spare HSM 2 dongle is available (avoids downtime during the ceremony) or HSM #2 is unplugged-able from input-b.
Air-gapped laptop with both dongles plugged in is ready.
Steps.
Retrieve HSM #1 from the safe.
Either unplug HSM #2 from input-b, or use a fresh dongle to avoid downtime.
On the air-gapped laptop with both dongles plugged in:
Generate a new intermediate key on the target dongle.
Sign a new intermediate cert with HSM #1.
Return HSM #1 to the safe immediately.
Install the new intermediate cert on input-b; update
ca.jsonto reference the new dongle (if swapped); restart step-ca.
Verification.
step ca certificates(against the running step-ca) shows the new intermediate.A previously-issued client cert continues to validate (chains to the unchanged root).
A freshly-issued cert chains through the new intermediate.
Rollback / escalation. Keep the prior intermediate cert and
ca.json snapshot; if the new intermediate misbehaves, restore
both and restart. Downtime budget: 10–30 minutes depending on HSM
swap logistics.
Intermediate rotation (emergency, after suspected compromise)#
When you run this. You believe the intermediate key has been compromised (e.g. HSM #2 access leaked, signed cert-chain anomaly detected).
Preconditions. Same as planned intermediate rotation.
Steps.
Run the planned-rotation steps above to bring up a fresh intermediate.
Remove the compromised intermediate from step-ca’s config.
Force clients to refresh their trust chain (Ansible
ca_trustrole re-applied).
Verification. Same as planned, plus: previously-issued certs from the compromised intermediate keep chaining successfully (this is unavoidable; their validity windows are 30–90 days). Watch logs for unexpected use of those certs over the remaining lifetime.
Rollback / escalation. No rollback — the rotation is the remediation. If you find evidence the root was also compromised, escalate to Root rotation below.
Root rotation#
When you run this. Catastrophic case: the root key is compromised, or the root’s lifetime is approaching (planned). Re-bootstraps every client against a new root. Treat as a half-day team-coordinated event.
Preconditions.
Spare HSM is available for the new root.
Communication channel to every operator and partner is in place (you’ll be telling them to run
step ca bootstrap --forcewith a new fingerprint).CCAT CA — Offline Root Ceremony Playbook is current.
Steps.
Generate a new root, ceremony-style, with the spare HSM. Use CCAT CA — Offline Root Ceremony Playbook as the executable procedure.
Distribute the new
root_ca.crtto every managed host via theca_trustrole (commit the new cert, run the playbook).Coordinate every operator/partner to run
step ca bootstrap --forcewith the new fingerprint. Update the canonical fingerprint block in Client setup — SSH with step-ca certificates.Rotate every service config that hard-codes the root path (
Settings.REDIS_CA_CERT_PATHetc.).Dispose of the old root HSM if compromised; retain it if the rotation was lifetime-driven.
Verification.
Every managed host:
step certificate verify /etc/pki/ca-trust/source/anchors/ccat-root-ca.crtsucceeds.Every operator:
step ca healthsucceeds against the new root.Pattern A service-account certs renew on next timer fire under the new chain.
Rollback / escalation. None — once distributed, the new root is the root. The entire offline-root architecture exists to make this a once-in-the-lifetime-of-the-CA event.
Disaster recovery#
“step-ca container won’t start”#
When you run this. ccat ca status shows step-ca in restart
loop or Exited.
Preconditions. SSH access to input-b.
Steps.
Read the error first:
ccat ca logs step-ca
Diagnose in this order:
docker exec -it ccat-ca-step-ca-1 pkcs11-tool --list-slots docker exec -it ccat-ca-step-ca-1 cat /run/secrets/hsm-pin docker exec -it ccat-ca-step-ca-1 jq . /home/step/config/ca.json
If the HSM isn’t visible inside the container but is visible on the host (
pkcs11-tool --list-slotsfrom the host SSH session), thedevices:mount indocker-compose.ca.ymlis wrong — udev may have renumbered the USB bus after a reboot. Update the device path andccat ca restart step-ca.
Verification. ccat ca status shows step-ca Up and
ccat ca logs step-ca | grep -i ready.
Rollback / escalation. If the issue is a corrupted ca.json,
restore from the latest backup of the step-ca-data volume. If the
HSM itself is the issue, escalate to HSM #2 has failed.
“input-b is down, CA is unreachable”#
When you run this. input-b is unreachable; existing clients keep working but cannot get new certs.
Preconditions. None — clients with valid certs continue to function until expiry (16 h for SSH user certs, 30–90 d for TLS certs).
Steps.
If input-b is just offline (network, power, OS panic), bring it back online. No further action needed.
If the server itself is lost:
Provision a replacement R640 or equivalent.
Restore the relevant Docker volumes from backup (see CA day-to-day operations § “Backup”).
Move HSM #2 from the old chassis to the new one’s internal USB.
Re-point DNS if the IP changed.
ccat ca up.
Verification. Clients do not notice — the CA URL and root
fingerprint are unchanged. Confirm via step ca health from a
known-good client.
Rollback / escalation. No rollback — the recovery is the fix.
“HSM #2 has failed”#
When you run this. HSM #2 (intermediate key) is dead or unresponsive on input-b.
Preconditions.
HSM #1 is retrievable from the safe.
A new HSM 2 dongle (same model) is on hand or procurable.
Steps.
Retrieve HSM #1 from the safe.
Buy a new HSM 2 dongle (same model) if you don’t already have a spare.
Run the Intermediate rotation (planned) procedure above to produce a new intermediate on the fresh dongle.
Install the new HSM in input-b, update
ca.json, restart step-ca.
Verification. Existing certs chain to the same root and remain valid; new certs chain through the new intermediate. Downtime: ~1 hour ceremony + recovery time.
Rollback / escalation. None. If HSM #1 also fails during this procedure, escalate to HSM #1 has failed.
“HSM #1 has failed”#
When you run this. HSM #1 (root key) is dead. This is a once-in-the-lifetime event — the root is exercised only at root ceremonies.
Preconditions.
A new HSM 2 (root) is available or procurable.
CCAT CA — Offline Root Ceremony Playbook is current.
Coordination with every operator and partner is possible.
Steps.
Procure a new HSM 2.
Run a full commissioning ceremony to generate a new root. Executable steps: CCAT CA — Offline Root Ceremony Playbook.
Produce a new intermediate signed by the new root.
Distribute the new
root_ca.crtto every managed host viaca_trust.Every operator runs
step ca bootstrap --forcewith the new fingerprint. Update the canonical fingerprint block in Client setup — SSH with step-ca certificates.Rotate every internal service config that hard-codes the root path.
Verification. Same as Root rotation above.
Rollback / escalation. None — this is Root rotation by incident rather than schedule.
See also#
CCAT Certificate Authority — Architecture and Design — why two HSMs, why offline root, and what each HSM actually protects.
CCAT CA — Offline Root Ceremony Playbook — the executable offline-root ceremony procedure (used for both root rotation and HSM #1 recovery).
CCAT CA — HSM Cutover Playbook (post-ceremony) — the executable on-server cutover procedure (used after the ceremony to deploy the new root).
CA day-to-day operations — backup details and the routine vhost cert lifecycle (the happy path — this page is the unhappy path).
CA provisioner management — for adding/updating/removing provisioners; this page is for rotating their material.