# CA rotation and disaster recovery This page is a **runbook** for rotation and recovery operations against the CCAT step-ca. Every procedure here follows the same shape — *When you run this · Preconditions · Steps · Verification · Rollback / escalation* — so an on-call operator can execute at 02:00 without reading prose. For the *why* behind any of this, see {doc}`background/ca-architecture`. For the executable offline ceremony itself, see {doc}`ceremony/playbook` and {doc}`ceremony/cutover-playbook`. ```{contents} :local: :depth: 2 ``` ## Rotation procedures ### JWK provisioner password rotation **When you run this.** On a schedule (annual is a reasonable default), or immediately on suspicion that a JWK provisioner password has leaked. The three JWK provisioners — `prod-services`, `staging-services`, `service-accounts` — encrypt their private keys inside `ca.json` with one password each; rotation re-keys those encryptions and invalidates the old passwords as issuance credentials. **Preconditions.** - You can run Ansible against `input-b` (`make play-hsm-host` succeeds in dry-run). - `.ansible_vault_key` is on the operator workstation. - step-ca container is running on input-b (the rotation block edits the live `ca.json` via `provisioners-bootstrap.sh`). **Steps.** 1. Rotate the vault var(s) on the operator workstation: ```bash ccat secrets rotate vault_step_ca_prov_prod_services_password --env production ccat secrets rotate vault_step_ca_prov_staging_services_password --env production ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production ``` 2. Run the explicit-tag-only Ansible task. Ansible reads the four vault vars (three JWK + Dex client secret) directly from the encrypted vault, materialises them into a 0400 root-owned tmpfile on input-b, and runs `provisioners-bootstrap.sh --rotate-jwk`. The tmpfile is removed in an `always:` block; `no_log: true` keeps secrets out of Ansible logs. ```bash make play-hsm-host T=hsm_host_rotate_jwk ``` 3. Restart step-ca so the rewritten `ca.json` is loaded: ```bash ssh input-b 'ccat ca restart step-ca' ``` 4. Re-issue any short-lived service-account SSH certs minted under the old password (next renewal cycle handles this on its own; force one via `systemctl start step-renew@.service` if you can't wait). **Verification.** - `ssh input-b 'docker exec ccat-ca-step-ca-1 step ca provisioner list | jq -r ".[].name"'` — the three JWK provisioners are still listed (rotation does not drop them, only re-keys them). - A test issuance with the new password succeeds: ```bash ssh input-b 'docker exec ccat-ca-step-ca-1 step ca certificate test.local /tmp/t.crt /tmp/t.key --provisioner prod-services --password-file /home/step/secrets/password --force && rm /tmp/t.{crt,key}' ``` **Rollback / escalation.** Vault history (encrypted git history) is the only roll-back: revert the `ccat secrets rotate` commit, re-run step 2, restart. If the rotation produced a `ca.json` that step-ca refuses to load, it falls back to the prior file via the `provisioners-bootstrap.sh` backup; check `journalctl -u ccat-ca-step-ca` for the actual error. Existing certs remain valid until natural expiry — rotation never invalidates issued certs. Reference commit: `7063427` (Ansible-driven rewrite that obsoletes the older `.env`-round-trip runbook). ### Vhost cert routine rotation **When you run this.** You don't, normally. The `step-ca-vhost-renew.timer` (every 12 h, persistent) calls `step-ca/renew-vhost-cert.sh`, which only contacts the CA when `step ca renew`'s threshold (~1/3 of cert lifetime) has been reached. The lifecycle and inspection commands live in {doc}`ca-day-to-day` § "Vhost cert lifecycle". Run a manual force-renew only if the timer is wedged or you need a renewal *before* the threshold (e.g. you just rotated the issuing JWK provisioner password and want the new cert to use the new key material). **Preconditions.** - You're on input-b. - The CCAT root cert is at `/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` (RHEL system trust anchor). - step-ca container is up; `:9000` is reachable from input-b's own `/24`. **Steps.** 1. Trigger the renewal helper directly: ```bash sudo /opt/data-center/system-integration/step-ca/renew-vhost-cert.sh ``` The helper checks PRE/POST cert mtime and only reloads nginx-proxy if the cert actually changed. **Verification.** - `step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt --short` shows a fresh validity window. - `echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 -servername ca.ccat.uni-koeln.de 2>/dev/null | openssl x509 -noout -dates -issuer` shows the new dates and the CCAT issuer. - `journalctl -u step-ca-vhost-renew.service -n 20` shows the most recent run with no error. **Rollback / escalation.** The previous cert is overwritten in place; there is no automatic rollback. If the new cert doesn't load, re-issue from scratch (see *Vhost cert emergency re-issue* below). ### Vhost cert emergency re-issue **When you run this.** The renewal timer is broken, the cert was overwritten with an LE cert by a misconfigured `acme-companion`, the cert key was leaked, or the on-wire chain doesn't match what the file claims. This procedure regenerates the keypair from nothing. **Preconditions.** - You're on input-b. - The vault password file with `vault_step_ca_prov_prod_services_password` is reachable to the issuance script (Ansible normally materialises it; for a fully manual run, set `--password-file` explicitly). - step-ca container is running; firewalld permits `:9000` from input-b's own /24. **Steps.** 1. Re-issue the cert. The script writes `/opt/proxy/certs/ca.ccat.uni-koeln.de.{crt,key}` and shells out to `step ca certificate` with `--ca-url https://ca.ccat.uni-koeln.de:9000` and the `prod-services` JWK provisioner: ```bash sudo /opt/data-center/system-integration/step-ca/issue-vhost-cert.sh ``` Or via Ansible (cleanest path — handles the JWK password tmpfile automatically): ```bash make play-hsm-host T=hsm_host_vhost_cert_issue ``` 2. Reload the proxy so the new cert is served: ```bash ccat proxy restart ``` **Verification.** Same as routine rotation, plus a remote verification from a partner client: ```bash step ca health # from any allowlisted client; expect: ok ``` **Rollback / escalation.** The old cert is gone; there is no rollback. If the re-issue itself fails (JWK password wrong, step-ca rejecting), the proxy keeps serving the previous cert until you reload it. Investigate the issuance script's stderr; the most common failure is a stale JWK password (see *JWK provisioner password rotation* above). ### Intermediate rotation (planned, every ~5 years) **When you run this.** Scheduled, low-activity window. The intermediate cert is rotated proactively before its lifetime expires. **Preconditions.** - HSM #1 (root) is retrievable from the safe. - A spare HSM 2 dongle is available (avoids downtime during the ceremony) or HSM #2 is unplugged-able from input-b. - Air-gapped laptop with both dongles plugged in is ready. **Steps.** 1. Retrieve HSM #1 from the safe. 2. Either unplug HSM #2 from input-b, or use a fresh dongle to avoid downtime. 3. On the air-gapped laptop with both dongles plugged in: - Generate a new intermediate key on the target dongle. - Sign a new intermediate cert with HSM #1. 4. Return HSM #1 to the safe immediately. 5. Install the new intermediate cert on input-b; update `ca.json` to reference the new dongle (if swapped); restart step-ca. **Verification.** - `step ca certificates` (against the running step-ca) shows the new intermediate. - A previously-issued client cert continues to validate (chains to the unchanged root). - A freshly-issued cert chains through the new intermediate. **Rollback / escalation.** Keep the prior intermediate cert and `ca.json` snapshot; if the new intermediate misbehaves, restore both and restart. Downtime budget: 10–30 minutes depending on HSM swap logistics. ### Intermediate rotation (emergency, after suspected compromise) **When you run this.** You believe the intermediate key has been compromised (e.g. HSM #2 access leaked, signed cert-chain anomaly detected). **Preconditions.** Same as planned intermediate rotation. **Steps.** 1. Run the planned-rotation steps above to bring up a fresh intermediate. 2. Remove the compromised intermediate from step-ca's config. 3. Force clients to refresh their trust chain (Ansible `ca_trust` role re-applied). **Verification.** Same as planned, plus: previously-issued certs *from the compromised intermediate* keep chaining successfully (this is unavoidable; their validity windows are 30–90 days). Watch logs for unexpected use of those certs over the remaining lifetime. **Rollback / escalation.** No rollback — the rotation is the remediation. If you find evidence the root was also compromised, escalate to *Root rotation* below. ### Root rotation **When you run this.** Catastrophic case: the root key is compromised, or the root's lifetime is approaching (planned). Re-bootstraps every client against a new root. Treat as a half-day team-coordinated event. **Preconditions.** - Spare HSM is available for the new root. - Communication channel to every operator and partner is in place (you'll be telling them to run `step ca bootstrap --force` with a new fingerprint). - {doc}`ceremony/playbook` is current. **Steps.** 1. Generate a new root, ceremony-style, with the spare HSM. Use {doc}`ceremony/playbook` as the executable procedure. 2. Distribute the new `root_ca.crt` to every managed host via the `ca_trust` role (commit the new cert, run the playbook). 3. Coordinate every operator/partner to run `step ca bootstrap --force` with the new fingerprint. Update the canonical fingerprint block in {doc}`ca-client-onboarding`. 4. Rotate every service config that hard-codes the root path (`Settings.REDIS_CA_CERT_PATH` etc.). 5. Dispose of the old root HSM if compromised; retain it if the rotation was lifetime-driven. **Verification.** - Every managed host: `step certificate verify /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` succeeds. - Every operator: `step ca health` succeeds against the new root. - Pattern A service-account certs renew on next timer fire under the new chain. **Rollback / escalation.** None — once distributed, the new root is the root. The entire offline-root architecture exists to make this a once-in-the-lifetime-of-the-CA event. ## Disaster recovery ### "step-ca container won't start" **When you run this.** `ccat ca status` shows step-ca in restart loop or `Exited`. **Preconditions.** SSH access to input-b. **Steps.** 1. Read the error first: ```bash ccat ca logs step-ca ``` 2. Diagnose in this order: ```bash docker exec -it ccat-ca-step-ca-1 pkcs11-tool --list-slots docker exec -it ccat-ca-step-ca-1 cat /run/secrets/hsm-pin docker exec -it ccat-ca-step-ca-1 jq . /home/step/config/ca.json ``` 3. If the HSM isn't visible inside the container but *is* visible on the host (`pkcs11-tool --list-slots` from the host SSH session), the `devices:` mount in `docker-compose.ca.yml` is wrong — udev may have renumbered the USB bus after a reboot. Update the device path and `ccat ca restart step-ca`. **Verification.** `ccat ca status` shows step-ca `Up` and `ccat ca logs step-ca | grep -i ready`. **Rollback / escalation.** If the issue is a corrupted ca.json, restore from the latest backup of the `step-ca-data` volume. If the HSM itself is the issue, escalate to *HSM #2 has failed*. ### "input-b is down, CA is unreachable" **When you run this.** input-b is unreachable; existing clients keep working but cannot get new certs. **Preconditions.** None — clients with valid certs continue to function until expiry (16 h for SSH user certs, 30–90 d for TLS certs). **Steps.** 1. If input-b is just offline (network, power, OS panic), bring it back online. No further action needed. 2. If the server itself is lost: 1. Provision a replacement R640 or equivalent. 2. Restore the relevant Docker volumes from backup (see {doc}`ca-day-to-day` § "Backup"). 3. Move HSM #2 from the old chassis to the new one's internal USB. 4. Re-point DNS if the IP changed. 5. `ccat ca up`. **Verification.** Clients do not notice — the CA URL and root fingerprint are unchanged. Confirm via `step ca health` from a known-good client. **Rollback / escalation.** No rollback — the recovery is the fix. ### "HSM #2 has failed" **When you run this.** HSM #2 (intermediate key) is dead or unresponsive on input-b. **Preconditions.** - HSM #1 is retrievable from the safe. - A new HSM 2 dongle (same model) is on hand or procurable. **Steps.** 1. Retrieve HSM #1 from the safe. 2. Buy a new HSM 2 dongle (same model) if you don't already have a spare. 3. Run the *Intermediate rotation (planned)* procedure above to produce a new intermediate on the fresh dongle. 4. Install the new HSM in input-b, update `ca.json`, restart step-ca. **Verification.** Existing certs chain to the same root and remain valid; new certs chain through the new intermediate. Downtime: ~1 hour ceremony + recovery time. **Rollback / escalation.** None. If HSM #1 also fails during this procedure, escalate to *HSM #1 has failed*. ### "HSM #1 has failed" **When you run this.** HSM #1 (root key) is dead. This is a once-in-the-lifetime event — the root is exercised only at root ceremonies. **Preconditions.** - A new HSM 2 (root) is available or procurable. - {doc}`ceremony/playbook` is current. - Coordination with every operator and partner is possible. **Steps.** 1. Procure a new HSM 2. 2. Run a **full commissioning ceremony** to generate a new root. Executable steps: {doc}`ceremony/playbook`. 3. Produce a new intermediate signed by the new root. 4. Distribute the new `root_ca.crt` to every managed host via `ca_trust`. 5. Every operator runs `step ca bootstrap --force` with the new fingerprint. Update the canonical fingerprint block in {doc}`ca-client-onboarding`. 6. Rotate every internal service config that hard-codes the root path. **Verification.** Same as *Root rotation* above. **Rollback / escalation.** None — this is *Root rotation* by incident rather than schedule. ## See also - {doc}`background/ca-architecture` — why two HSMs, why offline root, and what each HSM actually protects. - {doc}`ceremony/playbook` — the executable offline-root ceremony procedure (used for both root rotation and HSM #1 recovery). - {doc}`ceremony/cutover-playbook` — the executable on-server cutover procedure (used after the ceremony to deploy the new root). - {doc}`ca-day-to-day` — backup details and the routine vhost cert lifecycle (the happy path — this page is the unhappy path). - {doc}`ca-provisioner-management` — for adding/updating/removing provisioners; this page is for *rotating* their material.