# CCAT CA — HSM Cutover Playbook (post-ceremony) > **Read this before doing anything.** This is the executable > companion to the offline-root [ceremony playbook](playbook.md). Where > that document covers what happened on the air-gapped laptop on > 2026-04-29, this document covers what now has to happen on input-b > to make the rest of CCAT *trust* the new HSM-backed root. > > Two operators recommended (one driver, one second-pair-of-eyes). > Stages A–F are reversible and low-risk. Stage G is the irreversible > volume wipe + bring-up, scheduled in a known downtime window. Stage H > is the test cohort re-bootstrapping with the new fingerprint. > > Reference docs: > > - [`playbook.md`](playbook.md) — the offline ceremony itself > - [`lessons-learned-2026-04-29.md`](lessons-learned-2026-04-29.md) — PKCS#11 URI shapes that actually work > - [`background/ca-architecture.md`](../background/ca-architecture.md) — design context for the CA this playbook is cutting over > - `step-ca/COMMISSIONING-TODO.md` — overall phasing checklist --- ## §0 — Preconditions Before starting Stage A, all of the following must be true: - [ ] The 2026-04-29 ceremony completed successfully and the export USB contains five files: `root_ca.crt`, `intermediate_ca.crt`, `ssh_user_ca.pub`, `ssh_host_ca.pub`, `FINGERPRINT.txt`. - [ ] The paper PIN sheet from the ceremony is in the safe; the intermediate **user** PIN is also accessible to the cutover operator (memorised, written separately, or carried in a sealed envelope — *not* the root PINs). - [ ] Both HSM serials are recorded on the paper sheet. The HSM #2 serial is needed inline in `ca.json.hsm` (Stage F). - [ ] HSM #2 is physically installed in input-b's internal USB port. Chassis closed. Server up. - [ ] You are reading this on a trusted workstation (not on input-b itself — keep operator role separate from the host). - [ ] The Phase 1 test cohort has been told that a re-bootstrap is coming and roughly when. - [ ] You have a clean working tree on `main` of system-integration. If any of the above is missing, **stop**. Do not proceed. --- ## Stage A — Verify HSM #2 is functional on input-b Goal: confirm the OS sees the dongle, OpenSC can talk to it, and the keys on the card match the public artefacts on the export USB. **No compose changes here.** If anything in this stage fails, stop and diagnose before going further. ### §A1 — Host-level visibility SSH to input-b. Run: ```bash lsusb | grep -i nitrokey sudo systemctl status pcscd sudo pkcs11-tool --list-slots ``` Expected: - Exactly one Nitrokey HSM 2 line in `lsusb` (vendor `20a0:4230`). - `pcscd` active and running. - `pkcs11-tool --list-slots` shows **one** slot, token label `ccat-intermediate (UserPIN)`. If `pkcs11-tool` shows no slot, reseat the dongle and reload udev: ```bash sudo udevadm control --reload-rules sudo udevadm trigger ``` Re-check. Still nothing? Stop here — the dongle, the USB port, or the udev rule needs investigating before anything else happens. ### §A2 — End-to-end run of the `hsm_host` Ansible role The role installs `opensc` + `opensc-tools`, deploys `99-nitrokey-hsm.rules`, ensures the `plugdev` group exists, and verifies the slot is visible: ```bash cd ansible make play-hsm-host ``` This is `ansible-playbook -i inventory.ini -l input-b playbook_setup_vms.yml --tags hsm_host --vault-password-file .ansible_vault_key --ask-become-pass` — make handles the sudo prompt and the vault key for you. Don't invoke ansible-playbook directly; you'll trip "Missing sudo password" on the fact-gathering task. Expected: green run; the final `Report detected HSM slots` debug task prints the slot info (one slot, label `ccat-intermediate (UserPIN)`). After the first clean run, flip the role into hard-fail mode for the future. Edit `host_vars/input-b/hsm.yml` (create if missing) so the role refuses to skip verification on subsequent runs: ```yaml _hsm_enforce_verify: true ``` Commit + push. ### §A3 — Sign-test with the intermediate key on `id=01` Confirms the HSM is not just visible but actually responsive. Before running the sign command, do two free pre-flight checks that don't consume PIN attempts: ```bash # Confirm id=01, id=02, id=03 actually exist (no-login operation) sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \ --token-label ccat-intermediate --list-objects --type pubkey # Read the user PIN retry counter before betting on it sudo sc-hsm-tool ``` Expect three pubkey objects (intermediate, SSH user CA, SSH host CA) and a user PIN counter at full value (typically 3 of 3 on a freshly-initialised SC-HSM). If the counter reads `1 of 3`, **do not proceed to the sign-test on a guess** — reset via the SO-PIN first (`sudo sc-hsm-tool --unlock-pin`, prompts for SO-PIN then new user PIN). Then the sign-test itself. Run on input-b (RHEL9 path): ```bash TMPIN=$(mktemp /tmp/sigtest-in.XXXXXX) echo "ccat-test-$(date)" > "$TMPIN" sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \ --token-label ccat-intermediate --login \ --sign --id 01 --mechanism ECDSA-SHA384 \ -i "$TMPIN" -o /tmp/sigtest.bin ls -la /tmp/sigtest.bin && rm -f "$TMPIN" /tmp/sigtest.bin ``` > Don't be tempted to use `<(echo ...)` process substitution for the > input — sudo's default config closes file descriptors above 2 when > elevating, so `/dev/fd/63` is invisible to the elevated pkcs11-tool > and the sign aborts *after* successful login. PIN attempt is not > wasted (successful logins don't decrement the SC-HSM counter), but > you've still typed your PIN for nothing. Use a temp file. PIN goes in via interactive prompt. **Never put `--pin ` on the command line — it lands in shell history and `ps`.** Expected: a non-empty `/tmp/sigtest.bin` (~96 bytes for ECDSA-SHA384 / P-384, that's `r||s` with 48 bytes each). If the PIN is rejected: **stop, do not retry on a guess.** Each wrong attempt decrements the counter. Re-read the paper carefully (O vs 0, keyboard layout) and resume only when very confident. Lockout at counter=0 is recoverable via `sudo sc-hsm-tool --unlock-pin` with the SO-PIN from the safe. If signing fails for non-PIN reasons (key on a different `id`, HSM misbehaving): diagnose before continuing — don't keep retrying. ### §A4 — Bind-check: HSM keys ↔ ceremony artefacts This proves the dongle in input-b is the **same** dongle the ceremony wrote to. Run on a workstation with the export USB mounted at `/mnt/export-usb/` (or with the five ceremony files copied somewhere local). At this stage the artefacts on the **export USB are the canonical reference** — the in-repo files under `ca_trust/files/` are still the Phase 1 throwaway and will be overwritten in §B2. Three checks, two different conversion paths because the artefacts have two different formats: - `id=01` ↔ `intermediate_ca.crt` — X.509 path (public key embedded in a certificate; compare PEM-to-PEM) - `id=02` ↔ `ssh_user_ca.pub` — OpenSSH wire-format path - `id=03` ↔ `ssh_host_ca.pub` — OpenSSH wire-format path Don't try `openssl x509` on the `.pub` files — they aren't X.509, they're OpenSSH wire format, and `openssl x509` will error out with "Could not find certificate". #### id=01 — intermediate, X.509 path Pull the public part of `id=01` off the HSM and compare to the public key embedded in `intermediate_ca.crt`: ```bash # On input-b: sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 01 --type pubkey -o /tmp/int.der openssl ec -pubin -inform DER -in /tmp/int.der -pubout -outform PEM > /tmp/int.pem ``` ```bash # On the workstation: scp input-b:/tmp/int.pem /tmp/int.pem openssl x509 -in /mnt/export-usb/intermediate_ca.crt -pubkey -noout > /tmp/int_cert.pem diff /tmp/int.pem /tmp/int_cert.pem # must be empty ``` #### id=02 — SSH user CA, OpenSSH path `ssh_user_ca.pub` is OpenSSH wire format (one line: algorithm, base64 blob, optional comment). Convert the HSM-extracted PEM *up* to OpenSSH format and fingerprint-compare; don't byte-compare the strings directly (the comment field and whitespace can vary innocuously). ```bash # On input-b: sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 02 --type pubkey -o /tmp/sshuser.der openssl ec -pubin -inform DER -in /tmp/sshuser.der -pubout -outform PEM > /tmp/sshuser.pem ``` ```bash # On the workstation: scp input-b:/tmp/sshuser.pem /tmp/sshuser.pem ssh-keygen -i -m PKCS8 -f /tmp/sshuser.pem > /tmp/sshuser_from_hsm.pub ssh-keygen -lf /tmp/sshuser_from_hsm.pub ssh-keygen -lf /mnt/export-usb/ssh_user_ca.pub ``` The two `ssh-keygen -lf` lines print SHA256 fingerprints — visually compare them. Same fingerprint = same key. **Don't read aloud** — same audio-leak hygiene as everywhere else. If you prefer a hard diff over eyeballing fingerprints, strip the algorithm prefix and comment field so only the base64 key blob remains: ```bash diff <(awk '{print $2}' /tmp/sshuser_from_hsm.pub) <(awk '{print $2}' /mnt/export-usb/ssh_user_ca.pub) # Must be empty. Process substitution is fine here — no sudo involved # (cf. §A3 where it broke under sudo). ``` #### id=03 — SSH host CA, OpenSSH path Same shape as id=02 with `--id 03`, filenames `/tmp/sshhost.{der,pem,_from_hsm.pub}`, reference `/mnt/export-usb/ssh_host_ca.pub`. #### Decision If **any** comparison fails (non-empty diff or fingerprint mismatch): **stop**. The dongle in the server is not the one the ceremony wrote, or the ceremony artefacts have been tampered with. Investigate before going further. ### §A5 — Decision point If §A1 through §A4 are all green, the stick is functional and bound to the ceremony artefacts. Proceed to Stage B. If any failed, stop. Do not commit ceremony artefacts to the repo until the binding is proven. --- ## Stage B — Commit ceremony artefacts to the repo These files are public, but their authenticity is the load-bearing property of the whole CA. Treat the commit as a high-trust action. ### §B1 — Verify the fingerprint against the paper ```bash step certificate fingerprint /mnt/export-usb/root_ca.crt ``` Compare visually, character by character, against the paper from the safe. **Do not read aloud** — same audio-leak hygiene as the ceremony. If they match, continue. If not, stop and investigate (the export USB or the `root_ca.crt` file is wrong). ### §B2 — Overwrite the Phase 1 throwaway artefacts ```bash cp /mnt/export-usb/root_ca.crt ansible/roles/ca_trust/files/ cp /mnt/export-usb/ssh_user_ca.pub ansible/roles/ca_trust/files/ cp /mnt/export-usb/ssh_host_ca.pub ansible/roles/ca_trust/files/ ``` `intermediate_ca.crt` does **not** go into `ca_trust/files/`. Clients only need the root; the intermediate lives only on input-b in the step-ca volume (Stage F). ### §B3 — Commit with a loud message ```bash git add ansible/roles/ca_trust/files/ git commit -m "ca_trust: rotate to HSM-backed root (Phase 2 cutover) Replaces 2026-04 Phase 1 throwaway root with the ceremony output from 2026-04-29. New fingerprint: . Every CCAT client now needs to re-bootstrap: step ca bootstrap --force \\ --ca-url https://ca.ccat.uni-koeln.de \\ --fingerprint " git push origin main ``` ### §B4 — Distribute the new root via `ca_trust` ```bash cd ansible make play-ca-trust ``` This is `ansible-playbook -i inventory.ini -l all playbook_setup_vms.yml --tags ca_trust --vault-password-file .ansible_vault_key --ask-become-pass` — make handles the vault key and sudo prompt for you. As with §A2, don't invoke `ansible-playbook` directly or you'll trip "Missing sudo password" on fact-gathering. For a staged rollout (recommended in production — verify staging first, then push to production), use `G=` or `H=`: ```bash make play-ca-trust G=input_staging # staging hosts only make play-ca-trust G=input_ccat # production input nodes only make play-ca-trust H=input-a-staging # single host ``` The role adds the new root to every managed host's system trust store and to `/etc/ssh/trusted_user_ca_keys`. Briefly the old (Phase 1 throwaway) and new roots coexist; that's fine — existing 16h SSH certs continue to validate against the throwaway root until they expire or step-ca is cut over (Stage G). ### §B5 — Spot-check on one host ```bash ssh input-a sudo trust list | grep -i 'CCAT Observatory Root' ``` (or `ls /etc/ssh/trusted_user_ca_keys` + `head -c 30` to confirm the new SSH user CA pubkey is in place.) --- ## Stage C — Vault the intermediate user PIN The intermediate user PIN is the only secret on input-b that, with HSM #2 plugged in, can produce a signature. It lives in the vault and is rendered into `/opt/data-center/system-integration/.env` on input-b through the existing `application_env` schema-driven pipeline. ### §C1 — Add the schema entry + populate the vault ```bash ccat secrets add vault_step_ca_hsm_pin --env production # When prompted: # env_name: STEP_CA_HSM_PIN # description: "User PIN for HSM #2 (intermediate). Source: ceremony 2026-04-29 paper sheet." # value: ``` ### §C2 — Provision `.env` on input-b ```bash ccat secrets provision --host input-b ``` ### §C3 — Verify the file on input-b ```bash ssh input-b "sudo grep STEP_CA_HSM_PIN /opt/data-center/system-integration/.env" ``` Expected: a line `STEP_CA_HSM_PIN=...` (visible because you sudo'd — the file is mode 0640, root:jenkins). The PIN value should match what you set in §C1. > **Do not echo the PIN to a screen anyone but you can see.** Same > audio/video-leak hygiene as the ceremony. --- ## Stage D — Build a step-ca image with `opensc-pkcs11` The stock `smallstep/step-ca` image does **not** ship a PKCS#11 module. Smallstep maintains a separate `-hsm` flavour of the same image (e.g. `0.30.2-hsm`) with the OpenSC PKCS#11 module pre-installed and `pcscd` available (though not auto-started — the libusb-direct vs pcscd-mediated runtime choice is settled in Stage E, not at image build time). We use the upstream `-hsm` tag as our base and add only what it's missing. ### §D1 — Create `step-ca/Dockerfile.hsm` ```dockerfile ARG STEP_CA_VERSION=0.30.2 FROM smallstep/step-ca:${STEP_CA_VERSION}-hsm USER root RUN apt-get update \ && apt-get install -y --no-install-recommends ca-certificates \ && rm -rf /var/lib/apt/lists/* USER step ``` Notes: - `STEP_CA_VERSION` is pinned to match the ceremony's step-cli and step-kms-plugin pinning (see `step-ca/prepare-ceremony-usb.sh` `STEP_CLI_VERSION` and `ansible/roles/hsm_host/defaults/main.yml` `step_cli_version`). Bump all three together when the time comes, not floating to `:latest` — Phase 2 needs reproducibility across multi-year dormancy. - The `-hsm` suffix is hardcoded in the FROM line, not part of the ARG, so a version bump cannot accidentally drop the PKCS#11 layer. - `ca-certificates` is genuinely missing from the upstream `-hsm` image (verified empirically). step-ca needs it for outbound TLS trust during ACME flows. opensc-pkcs11 itself does not, but adding the package is cheap and the integration cost of *not* having it later is much higher. ### §D2 — Confirm the module path inside the image ```bash docker build -t ccat-step-ca:hsm-test ./step-ca -f step-ca/Dockerfile.hsm docker run --rm --entrypoint sh ccat-step-ca:hsm-test \ -c 'ls -la /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so /etc/ssl/certs/ca-certificates.crt' ``` Expected: both files listed, neither "No such file or directory." The `--entrypoint sh` override is needed because the upstream image's default entrypoint is `step-ca`, which exits with "no ca.json config file" before the `ls` runs. This module path is what goes into `ca.json.hsm` in Stage F. It is **different** from the host's `/usr/lib64/pkcs11/opensc-pkcs11.so` — container is Debian, host is RHEL. --- ## Stage E — Compose changes for HSM-backed mode These are surgical edits to `docker-compose.ca.yml` on the `step-ca` service. Make them as a PR; review with second-pair-of-eyes; merge only when ready to schedule the Stage G window. > **Architecture note.** The 2026-05-04 cutover proved that the > initial libusb-direct design did not work — OpenSC on Linux/Debian > only reaches the HSM via pcscd. The compose changes below describe > the *working* path: pcscd-in-container as root, with > `--disable-polkit`, then privilege drop to `step` (UID 1000) for > step-ca. See `lessons-learned-cutover-2026-05-04.md` §1 for the > full path-A-to-path-C narrative. ### §E1 — Replace `image:` with `build:`, container starts as root ```yaml step-ca: build: context: ./step-ca dockerfile: Dockerfile.hsm args: STEP_CA_VERSION: 0.30.2 # Container starts as root so the entrypoint can run pcscd # (libusb USB ioctls require root). The entrypoint drops to # step (UID 1000) via runuser before exec'ing step-ca. user: "0:0" restart: always ... ``` The `STEP_CA_VERSION` arg can be omitted if you're happy with the Dockerfile's own default. Listed explicitly so version drift is visible in the compose file too — bumping in both places at once is the safer pattern. ### §E2 — Remove all `DOCKER_STEPCA_INIT_*` env vars The auto-init facility no longer applies — the volume is hand-populated in Stage G. Note that this also disables the side-effect that flipped `enableSSHCA` on; we re-enable it explicitly in `ca.json.hsm` (Stage F). ### §E3 — Add `STEP_CA_HSM_PIN` env passthrough ```yaml environment: STEP_CA_HSM_PIN: ${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env} VIRTUAL_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env} VIRTUAL_PORT: "9000" VIRTUAL_PROTO: "https" LETSENCRYPT_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env} LETSENCRYPT_EMAIL: buchbend@ph1.uni-koeln.de ``` ### §E4 — Pass through the HSM device Pass the whole USB bus through to the container: ```yaml devices: - "/dev/bus/usb:/dev/bus/usb" ``` The container's pcscd runs as root (because compose `user: "0:0"` plus the entrypoint doesn't drop priv until *after* pcscd is started), so it can claim USB ioctls regardless of `/dev/bus/usb` file perms. The host udev rule (root:plugdev 0660) and `group_add` are **not** required here — leave them off. > **Earlier drafts of this section required** `group_add: ["plugdev"]` > with host plugdev pinned to GID 46. That was true under the > libusb-direct hypothesis (rejected). The udev rule + plugdev > infrastructure on the host (commits `71323f1`, `c30459f`) is now > only useful for ad-hoc operator HSM diagnostics on the host — > see lessons-learned §1. > **Hot replug is NOT supported.** Compose `devices:` is a snapshot > at start time. If the dongle is unplugged and replugged the kernel > may reassign busnum/devnum and the container's view goes stale. > Recovery: `ccat ca restart step-ca`. (See §I for the day-2 ops > note on token contention.) ### §E5 — PIN delivery via tmpfs + entrypoint wrapper with privdrop Add a tmpfs mount and the entrypoint wrapper: ```yaml tmpfs: # uid=1000 because the wrapper chowns the PIN file to step # after writing it. mode=0700 so nothing else can list/enter. - /run/secrets:mode=0700,uid=1000,gid=1000,size=1M volumes: - step-ca-data:/home/step - ./step-ca/ssh-user-template.tpl:/home/step/config/ssh-user-template.tpl:ro - ./step-ca/step-ca-hsm-entrypoint.sh:/usr/local/bin/step-ca-hsm-entrypoint.sh:ro entrypoint: ["/usr/local/bin/step-ca-hsm-entrypoint.sh"] ``` The wrapper at `step-ca/step-ca-hsm-entrypoint.sh` does four things, in order, all as root, then drops privileges: ```bash #!/bin/sh set -eu umask 077 : "${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env}" # 1. Materialize PIN on tmpfs, hand to step user only. printf '%s' "$STEP_CA_HSM_PIN" > /run/secrets/hsm-pin chown 1000:1000 /run/secrets/hsm-pin chmod 0400 /run/secrets/hsm-pin unset STEP_CA_HSM_PIN # 2. Start pcscd. --disable-polkit bypasses the auth check # that would otherwise reject all clients (polkitd is in the # upstream :hsm image but cannot run without systemd/DBus). /usr/sbin/pcscd --disable-polkit # 3. Defensive wait for the pcscd socket — guards against a # race where step-ca's first PKCS#11 call beats pcscd's # socket-bind. i=0 while [ ! -S /run/pcscd/pcscd.comm ] && [ "$i" -lt 50 ]; do sleep 0.1 i=$((i + 1)) done # 4. Drop privileges and exec step-ca. exec runuser -u step -- /usr/local/bin/step-ca \ /home/step/config/ca.json \ --password-file /home/step/secrets/password ``` Goals: - The PIN file on tmpfs is readable only by UID 1000. - `STEP_CA_HSM_PIN` is unset before exec, so it doesn't appear in step-ca's `/proc//environ`. - step-ca uses `pin-source=/run/secrets/hsm-pin` (Stage F) to log into the token at signing time. - Only pcscd retains root; step-ca runs as UID 1000. `chmod +x step-ca/step-ca-hsm-entrypoint.sh` and commit. ### §E5b — Healthcheck (optional but recommended) ```yaml healthcheck: test: - CMD-SHELL - "step ca health --ca-url https://localhost:9000 --root /home/step/certs/root_ca.crt && pkcs11-tool --module /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so --list-slots > /dev/null" # 30s — `step ca health` is fast but the chained pkcs11-tool # round-trip via pcscd → libusb → CCID can take >10s right # after a container restart. Observed in production; do not # lower below 30s. interval: 30s timeout: 30s retries: 3 start_period: 60s ``` Docker compose does NOT auto-restart on unhealthy by default — this healthcheck surfaces problems to monitoring; recovery is a manual `ccat ca restart`. ### §E6 — Drop the SoftHSM plumbing Remove from the `step-ca` service: - The `softhsm-tokens:/var/lib/softhsm/tokens` volume mount. - The `./step-ca/softhsm2.conf:/etc/softhsm/softhsm2.conf:ro` mount. Remove from the top-level `volumes:` block: - The `softhsm-tokens:` declaration. These were vestigial Phase 1 plumbing. HSM #2 is real hardware via PKCS#11 and never used SoftHSM. The `softhsm2.conf` file in `step-ca/` can be deleted from git in the same PR. --- ## Stage F — Write the HSM-aware `ca.json` Create `step-ca/ca.json.hsm`. This is committed in git and is the seed for the new step-ca-data volume in Stage G. Pull `` from the paper PIN sheet (also visible in `pkcs11-tool --list-token-slots` on the host). Key fields: ```jsonc { "address": ":9000", "dnsNames": ["ca.ccat.uni-koeln.de", "localhost"], "logger": { "format": "text" }, "db": { "type": "badgerv2", "dataSource": "/home/step/db" }, "root": "/home/step/certs/root_ca.crt", "crt": "/home/step/certs/intermediate_ca.crt", "key": "pkcs11:id=01", "kms": { "type": "pkcs11", "uri": "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;serial=?pin-source=/run/secrets/hsm-pin" }, "ssh": { "userKey": "pkcs11:id=02", "hostKey": "pkcs11:id=03" }, "authority": { "enableAdmin": false, "claims": { "enableSSHCA": true }, "provisioners": [] }, "tls": { "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" ], "minVersion": 1.2, "maxVersion": 1.3, "renegotiation": false } } ``` Why these choices: - **`serial=`, not `token=ccat-intermediate (UserPIN)`** — see lessons-learned §1. The token form requires URL-encoding for the space and parentheses; the serial form is stable and doesn't need encoding. - **`pin-source=/run/secrets/hsm-pin`, not `pin-value=...`** — the ceremony used `pin-value` because `pin-source` was unreliable from a freshly-installed Ubuntu Live USB. In production the file form is what we want — the PIN is not in the URI string. - **`enableAdmin: false`** — Phase 1 lesson. With remote management on, step-ca uses a BoltDB-backed runtime store and offline edits to `ca.json` are silently ignored. We want `ca.json` to be the runtime source of truth. - **`claims.enableSSHCA: true`** — without this, OIDC SSH cert signing is rejected at the authority layer with `sshCA is disabled for oidc provisioner`. Phase 1 flipped it on via `DOCKER_STEPCA_INIT_SSH=true`; Phase 2 has to set it explicitly because we drop all `DOCKER_STEPCA_INIT_*` (§E2). - **Empty `provisioners: []`** — they get re-added in §G6 by `step-ca/provisioners-bootstrap.sh`. - **`ssh.userKey` / `ssh.hostKey`** point at the SSH user / host CA keys generated on HSM #2 (`id=02`, `id=03`). Commit `step-ca/ca.json.hsm` together with the Stage E changes as the cutover PR. --- ## Stage G — The cutover (downtime window, ~5 minutes) Stages A through F are reversible. Stage G is the volume wipe and bring-up. **Do not start G until A–F are green and the test cohort has been told.** ### §G1 — Announce the window Notify the test cohort that the CA is going down for ~5 minutes and that they will need to re-bootstrap with the new fingerprint after. The new fingerprint is the value from the paper PIN sheet (Stage B1), which they will visually confirm against their bootstrap output. ### §G2 — Tear down step-ca only Dex stays up — its config is in git and its `dex-data` volume holds session state and signing keys that are safe to keep: ```bash ccat ca down ``` ### §G3 — Wipe and re-create the step-ca volume This is the irreversible step. Stop host pcscd first if it's running — the new container's pcscd will compete with it for the USB device: ```bash sudo systemctl stop pcscd.service pcscd.socket sudo systemctl mask pcscd.service pcscd.socket # prevent auto-restart on reboot docker volume rm ccat-ca_step-ca-data docker volume create ccat-ca_step-ca-data ``` > **Why mask host pcscd?** Both the host's pcscd (from the > `hsm_host` Ansible role's opensc package install) and the new > container's pcscd want to claim the same USB device interface > via libusb. The kernel allows only one libusb client per > interface. Whichever pcscd starts first wins; the other fails > silently and the loser's PKCS#11 stack reports "No slots." > Masking host pcscd makes the container the unambiguous owner. > See lessons-learned §5; day-2 ops for ad-hoc host pkcs11-tool > work is documented in this playbook's "Day-2 ops — token > contention" section. ### §G4 — Pre-populate the volume Run this from a checkout of system-integration on input-b (the repo is already deployed there for `ccat ca up`/etc.). All four inputs come from the repo — no export-USB plumbing needed at this stage: - `ansible/roles/ca_trust/files/root_ca.crt` — committed in §B2 - `step-ca/files/intermediate_ca.crt` — committed in 7cbfc2d (relocated out of `ca_trust/files/` since clients don't need it) - `step-ca/ca.json.hsm` — committed in Stage F The helper container must be **running** (so we can `docker exec` into it for `mkdir` and `chown`), not just `docker create`'d: ```bash # 1. Start a running helper container with the volume mounted TMP=$(docker run -d --rm -v ccat-ca_step-ca-data:/home/step alpine sleep 300) # 2. Pre-create the directory tree (fresh volume is empty) docker exec "$TMP" mkdir -p /home/step/certs /home/step/config /home/step/secrets # 3. Copy in the four files docker cp ansible/roles/ca_trust/files/root_ca.crt "$TMP":/home/step/certs/root_ca.crt docker cp step-ca/files/intermediate_ca.crt "$TMP":/home/step/certs/intermediate_ca.crt docker cp step-ca/ca.json.hsm "$TMP":/home/step/config/ca.json # step-ca refuses to start without a password file even when keys are on HSM — # we write a dummy non-empty value; it's never consulted because the key is # referenced via the PKCS#11 URI. printf 'unused-but-required\n' > /tmp/step-ca-password docker cp /tmp/step-ca-password "$TMP":/home/step/secrets/password rm -f /tmp/step-ca-password # 4. Hand the whole tree to UID 1000 — fresh volume root is root:root, # which would block step-ca (UID 1000) from creating /home/step/db at startup. docker exec "$TMP" chown -R 1000:1000 /home/step # 5. Sanity check docker exec "$TMP" ls -la /home/step/certs /home/step/config /home/step/secrets # 6. Tear down — --rm cleans up docker kill "$TMP" ``` If step 5 doesn't show all four files owned by `1000:1000`, **stop** and diagnose before continuing — step-ca will fail in confusing ways on a partially-populated or root-owned volume. > Two `docker cp` gotchas this section guards against: > > 1. **`docker create --rm` is not a *running* container.** `docker > exec` doesn't work on it, so the `mkdir -p` step would silently > fail. Earlier playbook drafts used `docker create`; switching to > `docker run -d --rm` is what makes step 2 work. > 2. **`docker cp -` reads stdin as a tar stream**, not raw bytes. > `echo "unused-but-required" | docker cp - CONTAINER:/path` > fails with "archive/tar: invalid tar header." Always use a temp > file plus `docker cp `. ### §G5 — Bring step-ca back up ```bash ccat ca up ccat ca logs step-ca ``` Expected log lines: - `Loaded key from PKCS#11 URI ...` - `Server listening on :9000` If you see PKCS#11 errors instead (`module not found`, `permission denied`, `token not present`), the most likely causes are: 1. **Container can't see the device** — `docker exec pkcs11-tool --list-slots` reports nothing. Fix: revisit §E4 `group_add`, or check the host udev rule has reloaded. 2. **PIN file empty or unreadable** — `docker exec ls -la /run/secrets/hsm-pin` shows zero bytes or wrong perms. Fix: check the entrypoint script is executable and `STEP_CA_HSM_PIN` is in `.env`. 3. **Wrong serial in `ca.json.hsm`** — the URI `serial=...` must match `pkcs11-tool --list-token-slots` exactly. Fix and rebuild the volume (step G4 onwards). If you have to roll back, the procedure is: ```bash ccat ca down docker volume rm ccat-ca_step-ca-data # Re-populate from a backup of the Phase 1 volume — only possible if # you snapshotted it before Stage G3. Otherwise: re-run Phase 1 # auto-init by reverting the compose changes and bringing up. ``` (In practice, "roll back" means "re-run Phase 1 with the throwaway root and try Stage G again next window." The Phase 1 root in the committed `ca_trust/files/` has been overwritten in Stage B, so a true rollback also requires reverting that commit.) ### §G6 — Re-add provisioners ```bash DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \ OIDC_ADMIN_EMAIL="buchbend@ph1.uni-koeln.de" \ ./step-ca/provisioners-bootstrap.sh ccat ca restart step-ca # required: step-ca caches ca.json at startup ``` This re-creates the six-provisioner set via direct `jq` edits to ca.json: `CCAT-GitHub` (OIDC), `prod-services`, `staging-services`, `service-accounts` (JWK), `acme` (ACME), `sshpop` (SSHPOP). Dex's `dex-data` volume is unchanged so the static step-ca client secret still works. > **Why not `ccat ca provisioner sync`?** That command runs the older > `provisioners-add.sh`, which uses `step ca provisioner add`. > step-cli 0.30.2 has no flag combination that makes that work > against a Phase-2 ca.json (admin-API requires `--ca-url` + admin > auth; `--offline` doesn't exist; `--ca-config` doesn't trigger > offline editing). See lessons-learned §2. The bootstrap script > bypasses step-cli for the add step and uses jq directly, which is > the only path that works in this version. ### §G7 — Verify external endpoints ```bash curl -sI https://ca.ccat.uni-koeln.de/health curl -s https://auth.ccat.uni-koeln.de/.well-known/openid-configuration | jq .issuer ``` `/health` should respond (note: behind nginx-proxy with LE on 443 for the Phase 1 layout — see CA architecture doc § "Why step-ca is NOT behind nginx-proxy" for the Phase 3 cutover plan once the uni firewall opens 9000). Dex `issuer` must be exactly `https://auth.ccat.uni-koeln.de`. ### §G8 — Issue a test cert from inside the box, prove HSM signing ```bash # Provisioner count via direct ca.json read (the API path needs --ca-url) docker exec ccat-ca-step-ca-1 jq -r '.authority.provisioners[] | "\(.type)\t\(.name)"' /home/step/config/ca.json # Should list 6 provisioners # Issue a test cert via the prod-services JWK provisioner — # the entire signing path goes through the HSM intermediate. docker exec ccat-ca-step-ca-1 step ca certificate proof.test /tmp/t.crt /tmp/t.key \ --provisioner prod-services \ --ca-url https://localhost:9000 \ --root /home/step/certs/root_ca.crt \ --provisioner-password-file /home/step/secrets/password # Verify chain: cert -> intermediate (HSM-backed) -> root docker exec ccat-ca-step-ca-1 openssl verify \ -CAfile /home/step/certs/root_ca.crt \ -untrusted /home/step/certs/intermediate_ca.crt \ /tmp/t.crt # Want: "/tmp/t.crt: OK" # Cleanup docker exec ccat-ca-step-ca-1 rm -f /tmp/t.crt /tmp/t.key ``` To rule out *any* doubt that the HSM is actually being used (rather than a file-backed key with the same public part), check the runtime topology: ```bash # step-ca process has the PKCS#11 module loaded, no SoftHSM: docker exec ccat-ca-step-ca-1 sh -c 'cat /proc/$(pidof step-ca)/maps | grep -E "opensc-pkcs11|softhsm"' # pcscd holds the actual USB device file open: docker exec ccat-ca-step-ca-1 sh -c 'ls -l /proc/$(pidof pcscd)/fd/' | grep -E 'bus/usb|ccid' ``` The conjunction of "ca.json points only at PKCS#11", "opensc-pkcs11.so mmap'd into step-ca's address space", "pcscd holding the USB FD", and "issued cert chains back to the intermediate whose public key was bit-equal to HSM `id=01` in §A4" is conclusive proof of HSM-backed signing without needing physical access to the dongle. --- ## Stage H — Test cohort re-bootstraps This is the rehearsed checkpoint. Every future root-rotation event depends on this command working cleanly across the team. ### §H1 — Each member runs ```bash step ca bootstrap --force \ --ca-url https://ca.ccat.uni-koeln.de \ --fingerprint ``` The fingerprint goes in the announcement they got in §G1. They **must** visually compare what step-cli prints against the fingerprint in the announcement before pressing `y`. **If it doesn't match, stop and investigate. Do not click through.** ### §H2 — Each member tests `step ssh login` ```bash step ssh login ssh input-a.data.ccat.uni-koeln.de ``` End-to-end flow: browser → Dex → GitHub OAuth → `ccatobs/datacenter` team check → cert lands in ssh-agent → ssh into a managed host succeeds. If their bootstrap completes but `step ssh login` fails with `x509: certificate signed by unknown authority`, they're hitting the trust-bundle issue from Phase 1 (LE cert on 443 vs CCAT root in `~/.step`). Workaround documented in `docs/source/ca-provisioner-management.md` § "Troubleshooting: x509 certificate signed by unknown authority". Phase 3 fix is opening TCP 9000. ### §H3 — Retrospective Capture any snags in `docs/source/ceremony/` as `lessons-learned-cutover-YYYY-MM-DD.md`. The first real root rotation in 5–10 years will follow the same procedure. --- ## §I — After the cutover settles - [ ] Watch the CA for at least one week. No new production services are migrated yet — soak time only. - [ ] Watch for HSM/USB/udev surprises across container restarts: `docker restart ccat-ca-step-ca-1` and confirm step-ca comes back cleanly without operator intervention. - [ ] Watch LE auto-renewal tick over for `ca.ccat.uni-koeln.de` and `auth.ccat.uni-koeln.de`. - [ ] Confirm provisioner counts hold across restarts. Use the direct ca.json read (admin-API path needs `--ca-url` and is blocked by `enableAdmin: false`): `docker exec ccat-ca-step-ca-1 jq '.authority.provisioners | length' /home/step/config/ca.json` — six entries, every time. - [ ] When the soak passes, mark Phase 2 done in `step-ca/COMMISSIONING-TODO.md` and start Phase 3 (production service rollout). ### Day-2 ops — token contention Only one libusb client per device interface. With the container's pcscd holding the dongle, host-side `sudo pkcs11-tool` etc. will fail. The right workflow for ad-hoc HSM diagnostics on the host is: ```bash ccat ca down # release the device sudo systemctl unmask pcscd.service pcscd.socket sudo systemctl start pcscd.socket # socket-activates sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --list-slots # ... do whatever ... sudo systemctl stop pcscd.service pcscd.socket sudo systemctl mask pcscd.service pcscd.socket ccat ca up # container takes the device back ``` Treat host-side diagnostics as *brief, scheduled* operations — every minute the container is down, certs aren't issuing. --- ## Roll-back / what-if | Symptom | Likely cause | Action | |---|---|---| | §A `pkcs11-tool` shows no slot | Dongle not seated, or udev rule hasn't reloaded | Reseat, `udevadm control --reload-rules && udevadm trigger`, retry | | §A4 diff non-empty | Wrong dongle in server, or tampered export USB | **Stop.** Treat as integrity failure; do not proceed. Plan a fresh ceremony if needed | | §G5 `module not found` in container logs | `opensc-pkcs11` not in image | Confirm Stage D image build (should be `0.30.2-hsm` not `0.30.2`); check `docker exec ccat-ca-step-ca-1 ls /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so` | | §G5 `could not find PKCS#11 token` | Most often: host pcscd is running and holding the dongle, OR container is starting as UID 1000 instead of root and pcscd-in-container can't claim USB ioctls | Mask host pcscd (§G3); confirm compose has `user: "0:0"`; confirm entrypoint runs `/usr/sbin/pcscd --disable-polkit` | | §G5 polkit-related rejection in pcscd logs | Entrypoint missing `--disable-polkit` flag | Update entrypoint script per §E5; rebuild container (entrypoint is bind-mounted, so a `ccat ca restart` picks up changes after `git pull`) | | §G6 step-cli `requires the '--ca-url' flag` | You ran `ccat ca provisioner sync` (the old `provisioners-add.sh` path), which doesn't work with `enableAdmin: false` in step-cli 0.30.2 | Use `step-ca/provisioners-bootstrap.sh` per §G6 instead | | §H1 `step ca bootstrap` fingerprint mismatch | Either bootstrapping the wrong host, or someone substituted an attacker-issued cert | **Stop.** Do not click through. Contact ops | | §G5 `token not present` | Wrong serial in URI, or HSM came back on a different bus after a host reboot | Check `pkcs11-tool --list-token-slots`, fix URI, rebuild volume | | §H1 fingerprint mismatch | **Critical.** Either the bootstrap is hitting the wrong host, or someone has substituted an attacker-issued cert | **Stop.** Do not click through. Contact ops | | Tested cohort can bootstrap, can't `step ssh login` | Trust-bundle hack needed (LE on 443, CCAT root in step trust) | See troubleshooting in `ca-provisioner-management.md` | --- ## Appendix — files this playbook touches New files (committed as part of the Stage E PR): - `step-ca/Dockerfile.hsm` - `step-ca/step-ca-hsm-entrypoint.sh` - `step-ca/ca.json.hsm` Edited files: - `docker-compose.ca.yml` — service `step-ca` (build, env, devices, tmpfs, entrypoint, volume cleanup) - `ansible/host_vars/input-b/hsm.yml` — `_hsm_enforce_verify: true` - `ansible/roles/ca_trust/files/root_ca.crt` — overwritten with ceremony output - `ansible/roles/ca_trust/files/ssh_user_ca.pub` — overwritten - `ansible/roles/ca_trust/files/ssh_host_ca.pub` — overwritten - `ansible/vars_application_schema.yml` — adds `vault_step_ca_hsm_pin` (via `ccat secrets add`) Deleted files: - `step-ca/softhsm2.conf` — vestigial Phase 1, never used in earnest Volumes: - `ccat-ca_step-ca-data` — wiped in §G3, re-populated in §G4 - `ccat-ca_softhsm-tokens` — declared in compose but empty; can be `docker volume rm`'d after Stage E lands - `ccat-ca_dex-data` — **not touched**; Dex state survives the cutover