CA rotation and disaster recovery#

This page is a runbook for rotation and recovery operations against the CCAT step-ca. Every procedure here follows the same shape — When you run this · Preconditions · Steps · Verification · Rollback / escalation — so an on-call operator can execute at 02:00 without reading prose. For the why behind any of this, see CCAT Certificate Authority — Architecture and Design. For the executable offline ceremony itself, see CCAT CA — Offline Root Ceremony Playbook and CCAT CA — HSM Cutover Playbook (post-ceremony).

Rotation procedures#

JWK provisioner password rotation#

When you run this. On a schedule (annual is a reasonable default), or immediately on suspicion that a JWK provisioner password has leaked. The three JWK provisioners — prod-services, staging-services, service-accounts — encrypt their private keys inside ca.json with one password each; rotation re-keys those encryptions and invalidates the old passwords as issuance credentials.

Preconditions.

  • You can run Ansible against input-b (make play-hsm-host succeeds in dry-run).

  • .ansible_vault_key is on the operator workstation.

  • step-ca container is running on input-b (the rotation block edits the live ca.json via provisioners-bootstrap.sh).

Steps.

  1. Rotate the vault var(s) on the operator workstation:

    ccat secrets rotate vault_step_ca_prov_prod_services_password    --env production
    ccat secrets rotate vault_step_ca_prov_staging_services_password --env production
    ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production
    
  2. Run the explicit-tag-only Ansible task. Ansible reads the four vault vars (three JWK + Dex client secret) directly from the encrypted vault, materialises them into a 0400 root-owned tmpfile on input-b, and runs provisioners-bootstrap.sh --rotate-jwk. The tmpfile is removed in an always: block; no_log: true keeps secrets out of Ansible logs.

    make play-hsm-host T=hsm_host_rotate_jwk
    
  3. Restart step-ca so the rewritten ca.json is loaded:

    ssh input-b 'ccat ca restart step-ca'
    
  4. Re-issue any short-lived service-account SSH certs minted under the old password (next renewal cycle handles this on its own; force one via systemctl start step-renew@<service>.service if you can’t wait).

Verification.

  • ssh input-b 'docker exec ccat-ca-step-ca-1 step ca provisioner list | jq -r ".[].name"' — the three JWK provisioners are still listed (rotation does not drop them, only re-keys them).

  • A test issuance with the new password succeeds:

    ssh input-b 'docker exec ccat-ca-step-ca-1 step ca certificate test.local /tmp/t.crt /tmp/t.key --provisioner prod-services --password-file /home/step/secrets/password --force && rm /tmp/t.{crt,key}'
    

Rollback / escalation. Vault history (encrypted git history) is the only roll-back: revert the ccat secrets rotate commit, re-run step 2, restart. If the rotation produced a ca.json that step-ca refuses to load, it falls back to the prior file via the provisioners-bootstrap.sh backup; check journalctl -u ccat-ca-step-ca for the actual error. Existing certs remain valid until natural expiry — rotation never invalidates issued certs. Reference commit: 7063427 (Ansible-driven rewrite that obsoletes the older .env-round-trip runbook).

Vhost cert routine rotation#

When you run this. You don’t, normally. The step-ca-vhost-renew.timer (every 12 h, persistent) calls step-ca/renew-vhost-cert.sh, which only contacts the CA when step ca renew’s threshold (~1/3 of cert lifetime) has been reached. The lifecycle and inspection commands live in CA day-to-day operations § “Vhost cert lifecycle”. Run a manual force-renew only if the timer is wedged or you need a renewal before the threshold (e.g. you just rotated the issuing JWK provisioner password and want the new cert to use the new key material).

Preconditions.

  • You’re on input-b.

  • The CCAT root cert is at /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt (RHEL system trust anchor).

  • step-ca container is up; :9000 is reachable from input-b’s own /24.

Steps.

  1. Trigger the renewal helper directly:

    sudo /opt/data-center/system-integration/step-ca/renew-vhost-cert.sh
    

    The helper checks PRE/POST cert mtime and only reloads nginx-proxy if the cert actually changed.

Verification.

  • step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt --short shows a fresh validity window.

  • echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 -servername ca.ccat.uni-koeln.de 2>/dev/null | openssl x509 -noout -dates -issuer shows the new dates and the CCAT issuer.

  • journalctl -u step-ca-vhost-renew.service -n 20 shows the most recent run with no error.

Rollback / escalation. The previous cert is overwritten in place; there is no automatic rollback. If the new cert doesn’t load, re-issue from scratch (see Vhost cert emergency re-issue below).

Vhost cert emergency re-issue#

When you run this. The renewal timer is broken, the cert was overwritten with an LE cert by a misconfigured acme-companion, the cert key was leaked, or the on-wire chain doesn’t match what the file claims. This procedure regenerates the keypair from nothing.

Preconditions.

  • You’re on input-b.

  • The vault password file with vault_step_ca_prov_prod_services_password is reachable to the issuance script (Ansible normally materialises it; for a fully manual run, set --password-file explicitly).

  • step-ca container is running; firewalld permits :9000 from input-b’s own /24.

Steps.

  1. Re-issue the cert. The script writes /opt/proxy/certs/ca.ccat.uni-koeln.de.{crt,key} and shells out to step ca certificate with --ca-url https://ca.ccat.uni-koeln.de:9000 and the prod-services JWK provisioner:

    sudo /opt/data-center/system-integration/step-ca/issue-vhost-cert.sh
    

    Or via Ansible (cleanest path — handles the JWK password tmpfile automatically):

    make play-hsm-host T=hsm_host_vhost_cert_issue
    
  2. Reload the proxy so the new cert is served:

    ccat proxy restart
    

Verification. Same as routine rotation, plus a remote verification from a partner client:

step ca health  # from any allowlisted client; expect: ok

Rollback / escalation. The old cert is gone; there is no rollback. If the re-issue itself fails (JWK password wrong, step-ca rejecting), the proxy keeps serving the previous cert until you reload it. Investigate the issuance script’s stderr; the most common failure is a stale JWK password (see JWK provisioner password rotation above).

Intermediate rotation (planned, every ~5 years)#

When you run this. Scheduled, low-activity window. The intermediate cert is rotated proactively before its lifetime expires.

Preconditions.

  • HSM #1 (root) is retrievable from the safe.

  • A spare HSM 2 dongle is available (avoids downtime during the ceremony) or HSM #2 is unplugged-able from input-b.

  • Air-gapped laptop with both dongles plugged in is ready.

Steps.

  1. Retrieve HSM #1 from the safe.

  2. Either unplug HSM #2 from input-b, or use a fresh dongle to avoid downtime.

  3. On the air-gapped laptop with both dongles plugged in:

    • Generate a new intermediate key on the target dongle.

    • Sign a new intermediate cert with HSM #1.

  4. Return HSM #1 to the safe immediately.

  5. Install the new intermediate cert on input-b; update ca.json to reference the new dongle (if swapped); restart step-ca.

Verification.

  • step ca certificates (against the running step-ca) shows the new intermediate.

  • A previously-issued client cert continues to validate (chains to the unchanged root).

  • A freshly-issued cert chains through the new intermediate.

Rollback / escalation. Keep the prior intermediate cert and ca.json snapshot; if the new intermediate misbehaves, restore both and restart. Downtime budget: 10–30 minutes depending on HSM swap logistics.

Intermediate rotation (emergency, after suspected compromise)#

When you run this. You believe the intermediate key has been compromised (e.g. HSM #2 access leaked, signed cert-chain anomaly detected).

Preconditions. Same as planned intermediate rotation.

Steps.

  1. Run the planned-rotation steps above to bring up a fresh intermediate.

  2. Remove the compromised intermediate from step-ca’s config.

  3. Force clients to refresh their trust chain (Ansible ca_trust role re-applied).

Verification. Same as planned, plus: previously-issued certs from the compromised intermediate keep chaining successfully (this is unavoidable; their validity windows are 30–90 days). Watch logs for unexpected use of those certs over the remaining lifetime.

Rollback / escalation. No rollback — the rotation is the remediation. If you find evidence the root was also compromised, escalate to Root rotation below.

Root rotation#

When you run this. Catastrophic case: the root key is compromised, or the root’s lifetime is approaching (planned). Re-bootstraps every client against a new root. Treat as a half-day team-coordinated event.

Preconditions.

  • Spare HSM is available for the new root.

  • Communication channel to every operator and partner is in place (you’ll be telling them to run step ca bootstrap --force with a new fingerprint).

  • CCAT CA — Offline Root Ceremony Playbook is current.

Steps.

  1. Generate a new root, ceremony-style, with the spare HSM. Use CCAT CA — Offline Root Ceremony Playbook as the executable procedure.

  2. Distribute the new root_ca.crt to every managed host via the ca_trust role (commit the new cert, run the playbook).

  3. Coordinate every operator/partner to run step ca bootstrap --force with the new fingerprint. Update the canonical fingerprint block in Client setup — SSH with step-ca certificates.

  4. Rotate every service config that hard-codes the root path (Settings.REDIS_CA_CERT_PATH etc.).

  5. Dispose of the old root HSM if compromised; retain it if the rotation was lifetime-driven.

Verification.

  • Every managed host: step certificate verify /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt succeeds.

  • Every operator: step ca health succeeds against the new root.

  • Pattern A service-account certs renew on next timer fire under the new chain.

Rollback / escalation. None — once distributed, the new root is the root. The entire offline-root architecture exists to make this a once-in-the-lifetime-of-the-CA event.

Disaster recovery#

“step-ca container won’t start”#

When you run this. ccat ca status shows step-ca in restart loop or Exited.

Preconditions. SSH access to input-b.

Steps.

  1. Read the error first:

    ccat ca logs step-ca
    
  2. Diagnose in this order:

    docker exec -it ccat-ca-step-ca-1 pkcs11-tool --list-slots
    docker exec -it ccat-ca-step-ca-1 cat /run/secrets/hsm-pin
    docker exec -it ccat-ca-step-ca-1 jq . /home/step/config/ca.json
    
  3. If the HSM isn’t visible inside the container but is visible on the host (pkcs11-tool --list-slots from the host SSH session), the devices: mount in docker-compose.ca.yml is wrong — udev may have renumbered the USB bus after a reboot. Update the device path and ccat ca restart step-ca.

Verification. ccat ca status shows step-ca Up and ccat ca logs step-ca | grep -i ready.

Rollback / escalation. If the issue is a corrupted ca.json, restore from the latest backup of the step-ca-data volume. If the HSM itself is the issue, escalate to HSM #2 has failed.

“input-b is down, CA is unreachable”#

When you run this. input-b is unreachable; existing clients keep working but cannot get new certs.

Preconditions. None — clients with valid certs continue to function until expiry (16 h for SSH user certs, 30–90 d for TLS certs).

Steps.

  1. If input-b is just offline (network, power, OS panic), bring it back online. No further action needed.

  2. If the server itself is lost:

    1. Provision a replacement R640 or equivalent.

    2. Restore the relevant Docker volumes from backup (see CA day-to-day operations § “Backup”).

    3. Move HSM #2 from the old chassis to the new one’s internal USB.

    4. Re-point DNS if the IP changed.

    5. ccat ca up.

Verification. Clients do not notice — the CA URL and root fingerprint are unchanged. Confirm via step ca health from a known-good client.

Rollback / escalation. No rollback — the recovery is the fix.

“HSM #2 has failed”#

When you run this. HSM #2 (intermediate key) is dead or unresponsive on input-b.

Preconditions.

  • HSM #1 is retrievable from the safe.

  • A new HSM 2 dongle (same model) is on hand or procurable.

Steps.

  1. Retrieve HSM #1 from the safe.

  2. Buy a new HSM 2 dongle (same model) if you don’t already have a spare.

  3. Run the Intermediate rotation (planned) procedure above to produce a new intermediate on the fresh dongle.

  4. Install the new HSM in input-b, update ca.json, restart step-ca.

Verification. Existing certs chain to the same root and remain valid; new certs chain through the new intermediate. Downtime: ~1 hour ceremony + recovery time.

Rollback / escalation. None. If HSM #1 also fails during this procedure, escalate to HSM #1 has failed.

“HSM #1 has failed”#

When you run this. HSM #1 (root key) is dead. This is a once-in-the-lifetime event — the root is exercised only at root ceremonies.

Preconditions.

Steps.

  1. Procure a new HSM 2.

  2. Run a full commissioning ceremony to generate a new root. Executable steps: CCAT CA — Offline Root Ceremony Playbook.

  3. Produce a new intermediate signed by the new root.

  4. Distribute the new root_ca.crt to every managed host via ca_trust.

  5. Every operator runs step ca bootstrap --force with the new fingerprint. Update the canonical fingerprint block in Client setup — SSH with step-ca certificates.

  6. Rotate every internal service config that hard-codes the root path.

Verification. Same as Root rotation above.

Rollback / escalation. None — this is Root rotation by incident rather than schedule.

See also#