Lessons learned — Phase 2 HSM cutover 2026-05-04#

This document captures the surprises and corrections from CCAT’s first HSM cutover (Phase 2 — switching step-ca’s signing keys from the file-backed Phase 1 throwaway to the real Nitrokey HSM 2 dongle on input-b). The cutover playbook has been updated to bake these in, so a future redeploy following the playbook should not re-encounter them. This page exists so future operators understand the why behind the playbook’s odd-looking choices — and in particular, why three commits (c30459f, c4b4a71, 9ef17d3) encode different “right answers” depending on which architectural hypothesis was current at the time.

Run details:

  • Date: 2026-05-04

  • Operator: buchbend, single seat — the second-pair-of-eyes role was filled by an AI assistant that got the architecture wrong twice before getting it right.

  • HSM in production: Nitrokey HSM 2, serial DENK0403253, internal USB on input-b

  • CA software: step-ca pinned at 0.30.2-hsm (upstream image), step-cli at 0.30.2 (matching ceremony’s pinned version)

1. The architectural multi-pivot#

The biggest single lesson: the “right” architecture for accessing a USB HSM from a containerised step-ca on Linux is not the architecture the playbook prescribed. It took three pivots to find what works, each rejected for valid reasons that turned out to also apply to the eventual right answer.

Path A — libusb-direct (playbook §E5 as originally written): Container at user: 1000:1000, pass /dev/bus/usb through, use group_add for plugdev. step-ca → opensc-pkcs11.so → libusb directly. Failed because OpenSC’s opensc-pkcs11.so on Linux/Debian is built only against libpcsclite — there is no libusb-direct reader subsystem. Without a pcscd to talk to, pkcs11-tool --list-slots returns “No slots” inside the container even when /dev/bus/usb is visible and perms are correct. This was rejected after spending hours fixing the udev rule + GID alignment + pcscd-on-host contention, only to discover the access path can’t work without pcscd.

Path B — host pcscd via Unix-socket bind-mount: Mount /run/pcscd:/run/pcscd into the container; container’s libpcsclite talks to host pcscd. Failed because RHEL host’s default polkit policy denies org.debian.pcsc-lite.access_pcsc to any client that isn’t the active console user or root. Adding a polkit rule granting access by subject.isInGroup("plugdev") might work — modern polkit checks supplementary groups via getgrouplist() — but the semantics are version-dependent and a heredoc-written polkit rule got mangled silently on terminal paste, so it was never validated end to end. The architectural objection (host pcscd as a SPOF coupled to the CA’s lifecycle) survived.

Path C — in-container pcscd as root, with --disable-polkit, then drop privileges to step (UID 1000) for step-ca: This is what works. The upstream smallstep/step-ca:0.30.2-hsm image ships both opensc-pkcs11 and pcscd. We start the container as root (compose user: "0:0"), let our entrypoint write the HSM PIN to tmpfs, run pcscd --disable-polkit (so the container’s missing polkit authority doesn’t block clients), wait for the socket, then runuser -u step and exec step-ca. step-ca runs non-root; pcscd retains root for the libusb USB ioctls (which need it regardless of device-node file perms).

Why C wasn’t tried first: the architect (and I) rejected it on “process supervision + privilege drop complexity” — fair concern. But the actual implementation is ~10 lines of shell with no daemon supervision (pcscd self-daemonises and just stays up), and runuser makes the privdrop a one-liner. The complexity worry was overrated.

Playbook decision: §E1, §E4, §E5 rewritten for Path C. devices: /dev/bus/usb retained from §E4 (pcscd-in-container needs it). group_add: ["plugdev"] dropped — pcscd-as-root doesn’t need plugdev.

The udev rule + plugdev infrastructure on the host (commits 71323f1, c30459f) is now load-bearing only for host operator tooling (when an operator briefly stops the container’s pcscd to run sudo pkcs11-tool etc. on the host). Its header comment was written under the libusb-direct hypothesis and overstates the rule’s importance — re-read it as describing operator ergonomics, not container access.

2. step-cli 0.30.2 has no working offline provisioner add#

The playbook §G6 prescribed ccat ca provisioner syncprovisioners-add.shstep ca provisioner add for re-adding the six provisioners after the volume wipe. With Phase 2’s enableAdmin: false (intentional, per §F), step-cli is forced into admin-API mode and rejects every call with requires the '--ca-url' flag. Tried --offline (does not exist in 0.30.2 despite older docs referencing one) and --ca-config (still requires admin auth). There is no flag combination that makes step-cli edit ca.json offline in this version.

Work-around: bypass step-cli for provisioner adds entirely. Use jq (present in the upstream image) to write provisioner JSON directly into ca.json’s authority.provisioners array. step-cli is still used for what only it can do offline — generating JWK keypairs via step crypto jwk create. New script: step-ca/provisioners-bootstrap.sh.

Playbook decision: §G6 now points at provisioners-bootstrap.sh. provisioners-add.sh is left in place as a frozen reference of what the step-cli path attempted to be; a follow-up commit will fold it into the bootstrap or delete it.

3. enableSSHCA defaults off in Phase 2#

step ssh login failed with sshCA is disabled for oidc provisioner 'CCAT-GitHub'. step-ca’s authority.claims.enableSSHCA claim defaults to false. Phase 1 flipped it on via DOCKER_STEPCA_INIT_SSH=true at auto-init time. Phase 2 drops all DOCKER_STEPCA_INIT_* (§E2), so the default applies.

Playbook decision: step-ca/ca.json.hsm now sets authority.claims.enableSSHCA = true. The top-level ssh: block (userKey/hostKey on HSM id=02/id=03) was already correct; this completes the pair.

4. Volume pre-population needs a running helper container#

§G4 originally used docker create --rm -v ... alpine sleep 300, then docker cp to populate, then docker kill. Two problems:

  • docker exec doesn’t work on a created-but-not-started container, so we can’t mkdir -p parent directories before cp. The first docker cp leaf-file:nested/dir/file returns mixed “Successfully copied / Could not find file” output and may leave the volume in a bad state.

  • A fresh volume at /home/step is root:root mode 0755 — when step-ca starts as UID 1000 it can’t create /home/step/db/ for badger and dies.

Playbook decision: §G4 rewritten to use docker run -d --rm (running container), docker exec mkdir -p for the parent dirs, copy the four files, then docker exec chown -R 1000:1000 /home/step before tearing down the helper.

5. Host pcscd must be stopped post-cutover#

Host pcscd and in-container pcscd both want to claim the same USB device via libusb. The kernel allows only one libusb client per device interface. Whichever pcscd starts first wins; the other fails silently and the loser’s PKCS#11 stack reports “No slots.”

Playbook decision: §G5 now explicitly stops + masks host pcscd (systemctl mask pcscd.service pcscd.socket). Operator workflows that need host-side sudo pkcs11-tool must first ccat ca down, unmask + start pcscd briefly, do the diagnostic, then re-mask and ccat ca up. This is captured in the “Day-2 ops” runbook in cutover-playbook.md § “Day-2 ops — token contention”.

6. Plumbing bugs that ate real hours#

These are not architectural — just shell-level mistakes — but they were load-bearing on operator productivity and the playbook now guards against each:

  • docker exec without -i silently drops stdin. A function that piped JSON to docker exec ... sh -c 'cat > /tmp/file' ended up writing an empty file because docker doesn’t forward stdin without -i. Six provisioners “added” with cheerful output; ca.json unchanged. Fixed in 9ef17d3.

  • docker cp - reads stdin as a tar stream, not raw bytes. An attempt to pipe echo "unused-but-required" | docker cp - CONTAINER:/path failed with “archive/tar: invalid tar header.” Use a temp file + docker cp <tmpfile> instead. Fixed in §G4.

  • Heredoc paste mangling for polkit rule. Pasting a multi-line heredoc into a remote SSH session via sudo tee FILE <<'EOF' ... EOF silently failed (file wasn’t created), then a manual emacs re-create concatenated lines onto a single line in places. The rule still parsed as JavaScript (whitespace-insensitive), but the experience taught us: write content to /tmp/foo as the user, then sudo install it in place. Or use printf '%s\n' '...' '...' | sudo tee to avoid heredoc-paste pitfalls.

  • udevd caches name→GID resolutions at boot time. When we pinned plugdev to GID 46 after udevd had already started and cached the auto-assigned old GID, the udev rule kept setting the device node to the stale numeric GID. systemctl restart systemd-udevd flushes the cache. Re-fired with udevadm trigger --action=add --subsystem-match=usb to re-evaluate the device.

  • Default healthcheck timeout (10s) is too short for the pcscd→libusb→CCID round-trip after restart. Bumped to 30s in c0ce12e.

7. Verifying HSM-backed signing without physical access#

The dongle is in the server room; we can’t watch the LED blink. The non-destructive proof is the conjunction of these checks:

  1. jq '{key,kms,ssh}' /home/step/config/ca.json shows key=pkcs11:id=01, kms.type=pkcs11, ssh.userKey=pkcs11:id=02, ssh.hostKey=pkcs11:id=03. The CA is configured to use PKCS#11 only.

  2. cat /proc/$(pidof step-ca)/maps | grep opensc-pkcs11 shows the PKCS#11 module mmap’d into step-ca’s address space — the binary actually loaded the library, not just configured a path.

  3. ls -l /proc/$(pidof pcscd)/fd/ shows pcscd holding an open file descriptor on a /dev/bus/usb/<n>/<m> device. The actual USB handle is open, in this process, right now.

  4. A test cert issued via the prod-services JWK provisioner verifies against the intermediate, and the intermediate’s public key already proved bit-equal to HSM id=01 during §A4.

No way to fake all four without an HSM in the chain.

8. Open follow-ups (not blocking; tracked as GitHub issues)#

  • JWK provisioner password security regression. Bootstrap script used the Phase-1-style dummy password file (unused-but-required) as the JWK provisioner password. JWK provisioner passwords are the auth gate for cert issuance via those provisioners — anyone who knows the password can issue certs. Phase 1 used a vault-backed STEP_CA_PASSWORD. Need to generate per-provisioner secrets (or pull a vault-backed shared secret) and rotate.

  • provisioners-add.sh consolidation. Two scripts now exist for the same task, one broken. Either delete the old or have it call the new. Update ccat ca provisioner sync to invoke the working path. Update §G6 of the playbook to match.

  • Ansible hsm_host role completeness for redeploy. A fresh input-b should be one make play-hsm-host away from “ready for ccat ca up”. The role currently leaves three things to manual operator steps: stop+mask host pcscd, deploy the polkit rule (now optional but harmless), install jq if not present. Bake those in.

  • Phase 3: open TCP 9000 at the firewall. Drops the trust-bundle workaround and lets step-cli verify CCAT root natively. Out of scope for system-integration (firewall is uni IT).