# CA rotation and disaster recovery

This page is a **runbook** for rotation and recovery operations against
the CCAT step-ca. Every procedure here follows the same shape — *When
you run this · Preconditions · Steps · Verification · Rollback /
escalation* — so an on-call operator can execute at 02:00 without
reading prose. For the *why* behind any of this, see
{doc}`background/ca-architecture`. For the executable offline ceremony
itself, see {doc}`ceremony/playbook` and {doc}`ceremony/cutover-playbook`.

```{contents}
:local:
:depth: 2
```

## Rotation procedures

### JWK provisioner password rotation

**When you run this.** On a schedule (annual is a reasonable default),
or immediately on suspicion that a JWK provisioner password has
leaked. The three JWK provisioners — `prod-services`,
`staging-services`, `service-accounts` — encrypt their private keys
inside `ca.json` with one password each; rotation re-keys those
encryptions and invalidates the old passwords as issuance credentials.

**Preconditions.**
- You can run Ansible against `input-b` (`make play-hsm-host`
  succeeds in dry-run).
- `.ansible_vault_key` is on the operator workstation.
- step-ca container is running on input-b (the rotation block
  edits the live `ca.json` via `provisioners-bootstrap.sh`).

**Steps.**

1. Rotate the vault var(s) on the operator workstation:

   ```bash
   ccat secrets rotate vault_step_ca_prov_prod_services_password    --env production
   ccat secrets rotate vault_step_ca_prov_staging_services_password --env production
   ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production
   ```

2. Run the explicit-tag-only Ansible task. Ansible reads the four
   vault vars (three JWK + Dex client secret) directly from the
   encrypted vault, materialises them into a 0400 root-owned tmpfile
   on input-b, and runs `provisioners-bootstrap.sh --rotate-jwk`.
   The tmpfile is removed in an `always:` block; `no_log: true`
   keeps secrets out of Ansible logs.

   ```bash
   make play-hsm-host T=hsm_host_rotate_jwk
   ```

3. Restart step-ca so the rewritten `ca.json` is loaded:

   ```bash
   ssh input-b 'ccat ca restart step-ca'
   ```

4. Re-issue any short-lived service-account SSH certs minted under
   the old password (next renewal cycle handles this on its own;
   force one via `systemctl start step-renew@<service>.service` if
   you can't wait).

**Verification.**
- `ssh input-b 'docker exec ccat-ca-step-ca-1 step ca provisioner list | jq -r ".[].name"'`
  — the three JWK provisioners are still listed (rotation does not
  drop them, only re-keys them).
- A test issuance with the new password succeeds:

  ```bash
  ssh input-b 'docker exec ccat-ca-step-ca-1 step ca certificate test.local /tmp/t.crt /tmp/t.key --provisioner prod-services --password-file /home/step/secrets/password --force && rm /tmp/t.{crt,key}'
  ```

**Rollback / escalation.** Vault history (encrypted git history) is
the only roll-back: revert the `ccat secrets rotate` commit, re-run
step 2, restart. If the rotation produced a `ca.json` that step-ca
refuses to load, it falls back to the prior file via the
`provisioners-bootstrap.sh` backup; check `journalctl -u
ccat-ca-step-ca` for the actual error. Existing certs remain valid
until natural expiry — rotation never invalidates issued certs.
Reference commit: `7063427` (Ansible-driven rewrite that obsoletes
the older `.env`-round-trip runbook).

### Vhost cert routine rotation

**When you run this.** You don't, normally. The
`step-ca-vhost-renew.timer` (every 12 h, persistent) calls
`step-ca/renew-vhost-cert.sh`, which only contacts the CA when
`step ca renew`'s threshold (~1/3 of cert lifetime) has been reached.
The lifecycle and inspection commands live in {doc}`ca-day-to-day`
§ "Vhost cert lifecycle". Run a manual force-renew only if the timer
is wedged or you need a renewal *before* the threshold (e.g. you
just rotated the issuing JWK provisioner password and want the new
cert to use the new key material).

**Preconditions.**
- You're on input-b.
- The CCAT root cert is at
  `/etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` (RHEL system
  trust anchor).
- step-ca container is up; `:9000` is reachable from input-b's own
  `/24`.

**Steps.**

1. Trigger the renewal helper directly:

   ```bash
   sudo /opt/data-center/system-integration/step-ca/renew-vhost-cert.sh
   ```

   The helper checks PRE/POST cert mtime and only reloads
   nginx-proxy if the cert actually changed.

**Verification.**
- `step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt --short`
  shows a fresh validity window.
- `echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 -servername ca.ccat.uni-koeln.de 2>/dev/null | openssl x509 -noout -dates -issuer`
  shows the new dates and the CCAT issuer.
- `journalctl -u step-ca-vhost-renew.service -n 20` shows the most
  recent run with no error.

**Rollback / escalation.** The previous cert is overwritten in
place; there is no automatic rollback. If the new cert doesn't load,
re-issue from scratch (see *Vhost cert emergency re-issue* below).

### Vhost cert emergency re-issue

**When you run this.** The renewal timer is broken, the cert was
overwritten with an LE cert by a misconfigured `acme-companion`,
the cert key was leaked, or the on-wire chain doesn't match what
the file claims. This procedure regenerates the keypair from
nothing.

**Preconditions.**
- You're on input-b.
- The vault password file with `vault_step_ca_prov_prod_services_password`
  is reachable to the issuance script (Ansible normally materialises
  it; for a fully manual run, set `--password-file` explicitly).
- step-ca container is running; firewalld permits `:9000` from
  input-b's own /24.

**Steps.**

1. Re-issue the cert. The script writes
   `/opt/proxy/certs/ca.ccat.uni-koeln.de.{crt,key}` and shells out
   to `step ca certificate` with `--ca-url
   https://ca.ccat.uni-koeln.de:9000` and the `prod-services` JWK
   provisioner:

   ```bash
   sudo /opt/data-center/system-integration/step-ca/issue-vhost-cert.sh
   ```

   Or via Ansible (cleanest path — handles the JWK password
   tmpfile automatically):

   ```bash
   make play-hsm-host T=hsm_host_vhost_cert_issue
   ```

2. Reload the proxy so the new cert is served:

   ```bash
   ccat proxy restart
   ```

**Verification.** Same as routine rotation, plus a remote
verification from a partner client:

```bash
step ca health  # from any allowlisted client; expect: ok
```

**Rollback / escalation.** The old cert is gone; there is no
rollback. If the re-issue itself fails (JWK password wrong, step-ca
rejecting), the proxy keeps serving the previous cert until you
reload it. Investigate the issuance script's stderr; the most
common failure is a stale JWK password (see *JWK provisioner
password rotation* above).

### Intermediate rotation (planned, every ~5 years)

**When you run this.** Scheduled, low-activity window. The
intermediate cert is rotated proactively before its lifetime
expires.

**Preconditions.**
- HSM #1 (root) is retrievable from the safe.
- A spare HSM 2 dongle is available (avoids downtime during the
  ceremony) or HSM #2 is unplugged-able from input-b.
- Air-gapped laptop with both dongles plugged in is ready.

**Steps.**

1. Retrieve HSM #1 from the safe.
2. Either unplug HSM #2 from input-b, or use a fresh dongle to
   avoid downtime.
3. On the air-gapped laptop with both dongles plugged in:
   - Generate a new intermediate key on the target dongle.
   - Sign a new intermediate cert with HSM #1.
4. Return HSM #1 to the safe immediately.
5. Install the new intermediate cert on input-b; update `ca.json`
   to reference the new dongle (if swapped); restart step-ca.

**Verification.**
- `step ca certificates` (against the running step-ca) shows the
  new intermediate.
- A previously-issued client cert continues to validate (chains to
  the unchanged root).
- A freshly-issued cert chains through the new intermediate.

**Rollback / escalation.** Keep the prior intermediate cert and
`ca.json` snapshot; if the new intermediate misbehaves, restore
both and restart. Downtime budget: 10–30 minutes depending on HSM
swap logistics.

### Intermediate rotation (emergency, after suspected compromise)

**When you run this.** You believe the intermediate key has been
compromised (e.g. HSM #2 access leaked, signed cert-chain anomaly
detected).

**Preconditions.** Same as planned intermediate rotation.

**Steps.**

1. Run the planned-rotation steps above to bring up a fresh
   intermediate.
2. Remove the compromised intermediate from step-ca's config.
3. Force clients to refresh their trust chain (Ansible
   `ca_trust` role re-applied).

**Verification.** Same as planned, plus: previously-issued certs
*from the compromised intermediate* keep chaining successfully
(this is unavoidable; their validity windows are 30–90 days).
Watch logs for unexpected use of those certs over the remaining
lifetime.

**Rollback / escalation.** No rollback — the rotation is the
remediation. If you find evidence the root was also compromised,
escalate to *Root rotation* below.

### Root rotation

**When you run this.** Catastrophic case: the root key is
compromised, or the root's lifetime is approaching (planned).
Re-bootstraps every client against a new root. Treat as a
half-day team-coordinated event.

**Preconditions.**
- Spare HSM is available for the new root.
- Communication channel to every operator and partner is in place
  (you'll be telling them to run `step ca bootstrap --force` with a
  new fingerprint).
- {doc}`ceremony/playbook` is current.

**Steps.**

1. Generate a new root, ceremony-style, with the spare HSM. Use
   {doc}`ceremony/playbook` as the executable procedure.
2. Distribute the new `root_ca.crt` to every managed host via the
   `ca_trust` role (commit the new cert, run the playbook).
3. Coordinate every operator/partner to run
   `step ca bootstrap --force` with the new fingerprint. Update the
   canonical fingerprint block in {doc}`ca-client-onboarding`.
4. Rotate every service config that hard-codes the root path
   (`Settings.REDIS_CA_CERT_PATH` etc.).
5. Dispose of the old root HSM if compromised; retain it if the
   rotation was lifetime-driven.

**Verification.**
- Every managed host: `step certificate verify
  /etc/pki/ca-trust/source/anchors/ccat-root-ca.crt` succeeds.
- Every operator: `step ca health` succeeds against the new root.
- Pattern A service-account certs renew on next timer fire under
  the new chain.

**Rollback / escalation.** None — once distributed, the new root
is the root. The entire offline-root architecture exists to make
this a once-in-the-lifetime-of-the-CA event.

## Disaster recovery

### "step-ca container won't start"

**When you run this.** `ccat ca status` shows step-ca in restart
loop or `Exited`.

**Preconditions.** SSH access to input-b.

**Steps.**

1. Read the error first:

   ```bash
   ccat ca logs step-ca
   ```

2. Diagnose in this order:

   ```bash
   docker exec -it ccat-ca-step-ca-1 pkcs11-tool --list-slots
   docker exec -it ccat-ca-step-ca-1 cat /run/secrets/hsm-pin
   docker exec -it ccat-ca-step-ca-1 jq . /home/step/config/ca.json
   ```

3. If the HSM isn't visible inside the container but *is* visible
   on the host (`pkcs11-tool --list-slots` from the host SSH
   session), the `devices:` mount in `docker-compose.ca.yml` is
   wrong — udev may have renumbered the USB bus after a reboot.
   Update the device path and `ccat ca restart step-ca`.

**Verification.** `ccat ca status` shows step-ca `Up` and
`ccat ca logs step-ca | grep -i ready`.

**Rollback / escalation.** If the issue is a corrupted ca.json,
restore from the latest backup of the `step-ca-data` volume. If the
HSM itself is the issue, escalate to *HSM #2 has failed*.

### "input-b is down, CA is unreachable"

**When you run this.** input-b is unreachable; existing clients
keep working but cannot get new certs.

**Preconditions.** None — clients with valid certs continue to
function until expiry (16 h for SSH user certs, 30–90 d for TLS
certs).

**Steps.**

1. If input-b is just offline (network, power, OS panic), bring it
   back online. No further action needed.

2. If the server itself is lost:

   1. Provision a replacement R640 or equivalent.
   2. Restore the relevant Docker volumes from backup
      (see {doc}`ca-day-to-day` § "Backup").
   3. Move HSM #2 from the old chassis to the new one's internal
      USB.
   4. Re-point DNS if the IP changed.
   5. `ccat ca up`.

**Verification.** Clients do not notice — the CA URL and root
fingerprint are unchanged. Confirm via `step ca health` from a
known-good client.

**Rollback / escalation.** No rollback — the recovery is the fix.

### "HSM #2 has failed"

**When you run this.** HSM #2 (intermediate key) is dead or
unresponsive on input-b.

**Preconditions.**
- HSM #1 is retrievable from the safe.
- A new HSM 2 dongle (same model) is on hand or procurable.

**Steps.**

1. Retrieve HSM #1 from the safe.
2. Buy a new HSM 2 dongle (same model) if you don't already have a
   spare.
3. Run the *Intermediate rotation (planned)* procedure above to
   produce a new intermediate on the fresh dongle.
4. Install the new HSM in input-b, update `ca.json`, restart
   step-ca.

**Verification.** Existing certs chain to the same root and remain
valid; new certs chain through the new intermediate. Downtime: ~1
hour ceremony + recovery time.

**Rollback / escalation.** None. If HSM #1 also fails during this
procedure, escalate to *HSM #1 has failed*.

### "HSM #1 has failed"

**When you run this.** HSM #1 (root key) is dead. This is a
once-in-the-lifetime event — the root is exercised only at root
ceremonies.

**Preconditions.**
- A new HSM 2 (root) is available or procurable.
- {doc}`ceremony/playbook` is current.
- Coordination with every operator and partner is possible.

**Steps.**

1. Procure a new HSM 2.
2. Run a **full commissioning ceremony** to generate a new root.
   Executable steps: {doc}`ceremony/playbook`.
3. Produce a new intermediate signed by the new root.
4. Distribute the new `root_ca.crt` to every managed host via
   `ca_trust`.
5. Every operator runs `step ca bootstrap --force` with the new
   fingerprint. Update the canonical fingerprint block in
   {doc}`ca-client-onboarding`.
6. Rotate every internal service config that hard-codes the root
   path.

**Verification.** Same as *Root rotation* above.

**Rollback / escalation.** None — this is *Root rotation* by
incident rather than schedule.

## See also

- {doc}`background/ca-architecture` — why two HSMs, why offline
  root, and what each HSM actually protects.
- {doc}`ceremony/playbook` — the executable offline-root ceremony
  procedure (used for both root rotation and HSM #1 recovery).
- {doc}`ceremony/cutover-playbook` — the executable on-server
  cutover procedure (used after the ceremony to deploy the new
  root).
- {doc}`ca-day-to-day` — backup details and the routine vhost cert
  lifecycle (the happy path — this page is the unhappy path).
- {doc}`ca-provisioner-management` — for adding/updating/removing
  provisioners; this page is for *rotating* their material.