# Lessons learned — Phase 2 HSM cutover 2026-05-04 This document captures the surprises and corrections from CCAT's first HSM cutover (Phase 2 — switching step-ca's signing keys from the file-backed Phase 1 throwaway to the real Nitrokey HSM 2 dongle on input-b). The [cutover playbook](cutover-playbook.md) has been updated to bake these in, so a future redeploy following the playbook should not re-encounter them. This page exists so future operators understand the *why* behind the playbook's odd-looking choices — and in particular, why three commits (`c30459f`, `c4b4a71`, `9ef17d3`) encode different "right answers" depending on which architectural hypothesis was current at the time. Run details: - **Date:** 2026-05-04 - **Operator:** buchbend, single seat — the second-pair-of-eyes role was filled by an AI assistant that got the architecture wrong twice before getting it right. - **HSM in production:** Nitrokey HSM 2, serial `DENK0403253`, internal USB on input-b - **CA software:** step-ca pinned at `0.30.2-hsm` (upstream image), step-cli at 0.30.2 (matching ceremony's pinned version) ## 1. The architectural multi-pivot The biggest single lesson: **the "right" architecture for accessing a USB HSM from a containerised step-ca on Linux is not the architecture the playbook prescribed**. It took three pivots to find what works, each rejected for valid reasons that turned out to also apply to the eventual right answer. **Path A — libusb-direct (playbook §E5 as originally written):** Container at `user: 1000:1000`, pass `/dev/bus/usb` through, use group_add for plugdev. step-ca → opensc-pkcs11.so → libusb directly. *Failed because* OpenSC's `opensc-pkcs11.so` on Linux/Debian is built only against `libpcsclite` — there is no libusb-direct reader subsystem. Without a pcscd to talk to, `pkcs11-tool --list-slots` returns "No slots" inside the container even when `/dev/bus/usb` is visible and perms are correct. *This was rejected after* spending hours fixing the udev rule + GID alignment + pcscd-on-host contention, only to discover the access path can't work without pcscd. **Path B — host pcscd via Unix-socket bind-mount:** Mount `/run/pcscd:/run/pcscd` into the container; container's libpcsclite talks to host pcscd. *Failed because* RHEL host's default polkit policy denies `org.debian.pcsc-lite.access_pcsc` to any client that isn't the active console user or root. Adding a polkit rule granting access by `subject.isInGroup("plugdev")` *might* work — modern polkit checks supplementary groups via `getgrouplist()` — but the semantics are version-dependent and a heredoc-written polkit rule got mangled silently on terminal paste, so it was never validated end to end. The architectural objection (host pcscd as a SPOF coupled to the CA's lifecycle) survived. **Path C — in-container pcscd as root, with `--disable-polkit`, then drop privileges to step (UID 1000) for step-ca:** This is what works. The upstream `smallstep/step-ca:0.30.2-hsm` image ships both opensc-pkcs11 and pcscd. We start the container as root (compose `user: "0:0"`), let our entrypoint write the HSM PIN to tmpfs, run `pcscd --disable-polkit` (so the container's missing polkit authority doesn't block clients), wait for the socket, then `runuser -u step` and exec step-ca. step-ca runs non-root; pcscd retains root for the libusb USB ioctls (which need it regardless of device-node file perms). **Why C wasn't tried first:** the architect (and I) rejected it on "process supervision + privilege drop complexity" — fair concern. But the actual implementation is ~10 lines of shell with no daemon supervision (pcscd self-daemonises and just stays up), and `runuser` makes the privdrop a one-liner. The complexity worry was overrated. **Playbook decision:** §E1, §E4, §E5 rewritten for Path C. `devices: /dev/bus/usb` retained from §E4 (pcscd-in-container needs it). `group_add: ["plugdev"]` dropped — pcscd-as-root doesn't need plugdev. The udev rule + plugdev infrastructure on the host (commits `71323f1`, `c30459f`) is now load-bearing only for **host operator tooling** (when an operator briefly stops the container's pcscd to run `sudo pkcs11-tool` etc. on the host). Its header comment was written under the libusb-direct hypothesis and **overstates** the rule's importance — re-read it as describing operator ergonomics, not container access. ## 2. step-cli 0.30.2 has no working offline `provisioner add` The playbook §G6 prescribed `ccat ca provisioner sync` → `provisioners-add.sh` → `step ca provisioner add` for re-adding the six provisioners after the volume wipe. With Phase 2's `enableAdmin: false` (intentional, per §F), step-cli is forced into admin-API mode and rejects every call with `requires the '--ca-url' flag`. Tried `--offline` (does not exist in 0.30.2 despite older docs referencing one) and `--ca-config` (still requires admin auth). **There is no flag combination that makes step-cli edit ca.json offline in this version.** Work-around: bypass step-cli for provisioner adds entirely. Use `jq` (present in the upstream image) to write provisioner JSON directly into ca.json's `authority.provisioners` array. step-cli is still used for what only it can do offline — generating JWK keypairs via `step crypto jwk create`. New script: `step-ca/provisioners-bootstrap.sh`. **Playbook decision:** §G6 now points at `provisioners-bootstrap.sh`. `provisioners-add.sh` is left in place as a frozen reference of what the step-cli path attempted to be; a follow-up commit will fold it into the bootstrap or delete it. ## 3. `enableSSHCA` defaults off in Phase 2 `step ssh login` failed with `sshCA is disabled for oidc provisioner 'CCAT-GitHub'`. step-ca's `authority.claims.enableSSHCA` claim defaults to `false`. Phase 1 flipped it on via `DOCKER_STEPCA_INIT_SSH=true` at auto-init time. Phase 2 drops all `DOCKER_STEPCA_INIT_*` (§E2), so the default applies. **Playbook decision:** `step-ca/ca.json.hsm` now sets `authority.claims.enableSSHCA = true`. The top-level `ssh:` block (userKey/hostKey on HSM `id=02`/`id=03`) was already correct; this completes the pair. ## 4. Volume pre-population needs a *running* helper container §G4 originally used `docker create --rm -v ... alpine sleep 300`, then `docker cp` to populate, then `docker kill`. Two problems: - `docker exec` doesn't work on a created-but-not-started container, so we can't `mkdir -p` parent directories before cp. The first `docker cp leaf-file:nested/dir/file` returns mixed "Successfully copied / Could not find file" output and may leave the volume in a bad state. - A fresh volume at `/home/step` is `root:root` mode 0755 — when step-ca starts as UID 1000 it can't create `/home/step/db/` for badger and dies. **Playbook decision:** §G4 rewritten to use `docker run -d --rm` (running container), `docker exec mkdir -p` for the parent dirs, copy the four files, then `docker exec chown -R 1000:1000 /home/step` before tearing down the helper. ## 5. Host pcscd must be stopped post-cutover Host pcscd and in-container pcscd both want to claim the same USB device via libusb. The kernel allows only one libusb client per device interface. Whichever pcscd starts first wins; the other fails silently and the loser's PKCS#11 stack reports "No slots." **Playbook decision:** §G5 now explicitly stops + masks host pcscd (`systemctl mask pcscd.service pcscd.socket`). Operator workflows that need host-side `sudo pkcs11-tool` must first `ccat ca down`, unmask + start pcscd briefly, do the diagnostic, then re-mask and `ccat ca up`. This is captured in the "Day-2 ops" runbook in [`cutover-playbook.md`](cutover-playbook.md) § "Day-2 ops — token contention". ## 6. Plumbing bugs that ate real hours These are not architectural — just shell-level mistakes — but they were *load-bearing on operator productivity* and the playbook now guards against each: - **`docker exec` without `-i` silently drops stdin.** A function that piped JSON to `docker exec ... sh -c 'cat > /tmp/file'` ended up writing an empty file because docker doesn't forward stdin without `-i`. Six provisioners "added" with cheerful output; ca.json unchanged. Fixed in `9ef17d3`. - **`docker cp -` reads stdin as a tar stream, not raw bytes.** An attempt to pipe `echo "unused-but-required" | docker cp - CONTAINER:/path` failed with "archive/tar: invalid tar header." Use a temp file + `docker cp ` instead. Fixed in §G4. - **Heredoc paste mangling for polkit rule.** Pasting a multi-line heredoc into a remote SSH session via `sudo tee FILE <<'EOF' ... EOF` silently failed (file wasn't created), then a manual `emacs` re-create concatenated lines onto a single line in places. The rule still parsed as JavaScript (whitespace-insensitive), but the experience taught us: write content to `/tmp/foo` as the user, then `sudo install` it in place. Or use `printf '%s\n' '...' '...' | sudo tee` to avoid heredoc-paste pitfalls. - **`udevd` caches name→GID resolutions at boot time.** When we pinned plugdev to GID 46 *after* udevd had already started and cached the auto-assigned old GID, the udev rule kept setting the device node to the stale numeric GID. `systemctl restart systemd-udevd` flushes the cache. Re-fired with `udevadm trigger --action=add --subsystem-match=usb` to re-evaluate the device. - **Default healthcheck timeout (10s) is too short for the pcscd→libusb→CCID round-trip after restart.** Bumped to 30s in `c0ce12e`. ## 7. Verifying HSM-backed signing without physical access The dongle is in the server room; we can't watch the LED blink. The non-destructive proof is the conjunction of these checks: 1. `jq '{key,kms,ssh}' /home/step/config/ca.json` shows `key=pkcs11:id=01`, `kms.type=pkcs11`, `ssh.userKey=pkcs11:id=02`, `ssh.hostKey=pkcs11:id=03`. The CA is *configured* to use PKCS#11 only. 2. `cat /proc/$(pidof step-ca)/maps | grep opensc-pkcs11` shows the PKCS#11 module mmap'd into step-ca's address space — the binary actually loaded the library, not just configured a path. 3. `ls -l /proc/$(pidof pcscd)/fd/` shows pcscd holding an open file descriptor on a `/dev/bus/usb//` device. The actual USB handle is open, in this process, right now. 4. A test cert issued via the prod-services JWK provisioner verifies against the intermediate, and the intermediate's public key already proved bit-equal to HSM `id=01` during §A4. No way to fake all four without an HSM in the chain. ## 8. Open follow-ups (not blocking; tracked as GitHub issues) - **JWK provisioner password security regression.** Bootstrap script used the Phase-1-style dummy password file (`unused-but-required`) as the JWK provisioner password. JWK provisioner passwords are the auth gate for cert issuance via those provisioners — anyone who knows the password can issue certs. Phase 1 used a vault-backed `STEP_CA_PASSWORD`. Need to generate per-provisioner secrets (or pull a vault-backed shared secret) and rotate. - **`provisioners-add.sh` consolidation.** Two scripts now exist for the same task, one broken. Either delete the old or have it call the new. Update `ccat ca provisioner sync` to invoke the working path. Update §G6 of the playbook to match. - **Ansible `hsm_host` role completeness for redeploy.** A fresh input-b should be one `make play-hsm-host` away from "ready for `ccat ca up`". The role currently leaves three things to manual operator steps: stop+mask host pcscd, deploy the polkit rule (now optional but harmless), install jq if not present. Bake those in. - **Phase 3: open TCP 9000 at the firewall.** Drops the trust-bundle workaround and lets step-cli verify CCAT root natively. Out of scope for system-integration (firewall is uni IT).