Lessons learned — Phase 2 HSM cutover 2026-05-04#
This document captures the surprises and corrections from CCAT’s first
HSM cutover (Phase 2 — switching step-ca’s signing keys from the
file-backed Phase 1 throwaway to the real Nitrokey HSM 2 dongle on
input-b). The cutover playbook has been updated
to bake these in, so a future redeploy following the playbook should
not re-encounter them. This page exists so future operators
understand the why behind the playbook’s odd-looking choices — and
in particular, why three commits (c30459f, c4b4a71, 9ef17d3)
encode different “right answers” depending on which architectural
hypothesis was current at the time.
Run details:
Date: 2026-05-04
Operator: buchbend, single seat — the second-pair-of-eyes role was filled by an AI assistant that got the architecture wrong twice before getting it right.
HSM in production: Nitrokey HSM 2, serial
DENK0403253, internal USB on input-bCA software: step-ca pinned at
0.30.2-hsm(upstream image), step-cli at 0.30.2 (matching ceremony’s pinned version)
1. The architectural multi-pivot#
The biggest single lesson: the “right” architecture for accessing a USB HSM from a containerised step-ca on Linux is not the architecture the playbook prescribed. It took three pivots to find what works, each rejected for valid reasons that turned out to also apply to the eventual right answer.
Path A — libusb-direct (playbook §E5 as originally written):
Container at user: 1000:1000, pass /dev/bus/usb through, use
group_add for plugdev. step-ca → opensc-pkcs11.so → libusb directly.
Failed because OpenSC’s opensc-pkcs11.so on Linux/Debian is built
only against libpcsclite — there is no libusb-direct reader
subsystem. Without a pcscd to talk to, pkcs11-tool --list-slots
returns “No slots” inside the container even when /dev/bus/usb is
visible and perms are correct. This was rejected after spending
hours fixing the udev rule + GID alignment + pcscd-on-host
contention, only to discover the access path can’t work without
pcscd.
Path B — host pcscd via Unix-socket bind-mount:
Mount /run/pcscd:/run/pcscd into the container; container’s
libpcsclite talks to host pcscd. Failed because RHEL host’s
default polkit policy denies org.debian.pcsc-lite.access_pcsc to
any client that isn’t the active console user or root. Adding a
polkit rule granting access by subject.isInGroup("plugdev") might
work — modern polkit checks supplementary groups via getgrouplist()
— but the semantics are version-dependent and a heredoc-written
polkit rule got mangled silently on terminal paste, so it was never
validated end to end. The architectural objection (host pcscd as a
SPOF coupled to the CA’s lifecycle) survived.
Path C — in-container pcscd as root, with --disable-polkit,
then drop privileges to step (UID 1000) for step-ca:
This is what works. The upstream smallstep/step-ca:0.30.2-hsm
image ships both opensc-pkcs11 and pcscd. We start the container as
root (compose user: "0:0"), let our entrypoint write the HSM PIN
to tmpfs, run pcscd --disable-polkit (so the container’s missing
polkit authority doesn’t block clients), wait for the socket, then
runuser -u step and exec step-ca. step-ca runs non-root; pcscd
retains root for the libusb USB ioctls (which need it regardless of
device-node file perms).
Why C wasn’t tried first: the architect (and I) rejected it on
“process supervision + privilege drop complexity” — fair concern.
But the actual implementation is ~10 lines of shell with no daemon
supervision (pcscd self-daemonises and just stays up), and runuser
makes the privdrop a one-liner. The complexity worry was overrated.
Playbook decision: §E1, §E4, §E5 rewritten for Path C.
devices: /dev/bus/usb retained from §E4 (pcscd-in-container needs
it). group_add: ["plugdev"] dropped — pcscd-as-root doesn’t need
plugdev.
The udev rule + plugdev infrastructure on the host (commits
71323f1, c30459f) is now load-bearing only for host operator
tooling (when an operator briefly stops the container’s pcscd
to run sudo pkcs11-tool etc. on the host). Its header comment was
written under the libusb-direct hypothesis and overstates the
rule’s importance — re-read it as describing operator ergonomics,
not container access.
2. step-cli 0.30.2 has no working offline provisioner add#
The playbook §G6 prescribed ccat ca provisioner sync →
provisioners-add.sh → step ca provisioner add for re-adding the
six provisioners after the volume wipe. With Phase 2’s
enableAdmin: false (intentional, per §F), step-cli is forced into
admin-API mode and rejects every call with requires the '--ca-url' flag. Tried --offline (does not exist in 0.30.2 despite older
docs referencing one) and --ca-config (still requires admin auth).
There is no flag combination that makes step-cli edit ca.json
offline in this version.
Work-around: bypass step-cli for provisioner adds entirely. Use
jq (present in the upstream image) to write provisioner JSON
directly into ca.json’s authority.provisioners array. step-cli is
still used for what only it can do offline — generating JWK keypairs
via step crypto jwk create. New script:
step-ca/provisioners-bootstrap.sh.
Playbook decision: §G6 now points at provisioners-bootstrap.sh.
provisioners-add.sh is left in place as a frozen reference of what
the step-cli path attempted to be; a follow-up commit will fold it
into the bootstrap or delete it.
3. enableSSHCA defaults off in Phase 2#
step ssh login failed with sshCA is disabled for oidc provisioner 'CCAT-GitHub'. step-ca’s authority.claims.enableSSHCA claim
defaults to false. Phase 1 flipped it on via
DOCKER_STEPCA_INIT_SSH=true at auto-init time. Phase 2 drops all
DOCKER_STEPCA_INIT_* (§E2), so the default applies.
Playbook decision: step-ca/ca.json.hsm now sets
authority.claims.enableSSHCA = true. The top-level ssh: block
(userKey/hostKey on HSM id=02/id=03) was already correct; this
completes the pair.
4. Volume pre-population needs a running helper container#
§G4 originally used docker create --rm -v ... alpine sleep 300,
then docker cp to populate, then docker kill. Two problems:
docker execdoesn’t work on a created-but-not-started container, so we can’tmkdir -pparent directories before cp. The firstdocker cp leaf-file:nested/dir/filereturns mixed “Successfully copied / Could not find file” output and may leave the volume in a bad state.A fresh volume at
/home/stepisroot:rootmode 0755 — when step-ca starts as UID 1000 it can’t create/home/step/db/for badger and dies.
Playbook decision: §G4 rewritten to use docker run -d --rm
(running container), docker exec mkdir -p for the parent dirs,
copy the four files, then docker exec chown -R 1000:1000 /home/step
before tearing down the helper.
5. Host pcscd must be stopped post-cutover#
Host pcscd and in-container pcscd both want to claim the same USB device via libusb. The kernel allows only one libusb client per device interface. Whichever pcscd starts first wins; the other fails silently and the loser’s PKCS#11 stack reports “No slots.”
Playbook decision: §G5 now explicitly stops + masks host pcscd
(systemctl mask pcscd.service pcscd.socket). Operator workflows
that need host-side sudo pkcs11-tool must first ccat ca down,
unmask + start pcscd briefly, do the diagnostic, then re-mask and
ccat ca up. This is captured in the “Day-2 ops” runbook in
cutover-playbook.md § “Day-2 ops — token
contention”.
6. Plumbing bugs that ate real hours#
These are not architectural — just shell-level mistakes — but they were load-bearing on operator productivity and the playbook now guards against each:
docker execwithout-isilently drops stdin. A function that piped JSON todocker exec ... sh -c 'cat > /tmp/file'ended up writing an empty file because docker doesn’t forward stdin without-i. Six provisioners “added” with cheerful output; ca.json unchanged. Fixed in9ef17d3.docker cp -reads stdin as a tar stream, not raw bytes. An attempt to pipeecho "unused-but-required" | docker cp - CONTAINER:/pathfailed with “archive/tar: invalid tar header.” Use a temp file +docker cp <tmpfile>instead. Fixed in §G4.Heredoc paste mangling for polkit rule. Pasting a multi-line heredoc into a remote SSH session via
sudo tee FILE <<'EOF' ... EOFsilently failed (file wasn’t created), then a manualemacsre-create concatenated lines onto a single line in places. The rule still parsed as JavaScript (whitespace-insensitive), but the experience taught us: write content to/tmp/fooas the user, thensudo installit in place. Or useprintf '%s\n' '...' '...' | sudo teeto avoid heredoc-paste pitfalls.udevdcaches name→GID resolutions at boot time. When we pinned plugdev to GID 46 after udevd had already started and cached the auto-assigned old GID, the udev rule kept setting the device node to the stale numeric GID.systemctl restart systemd-udevdflushes the cache. Re-fired withudevadm trigger --action=add --subsystem-match=usbto re-evaluate the device.Default healthcheck timeout (10s) is too short for the pcscd→libusb→CCID round-trip after restart. Bumped to 30s in
c0ce12e.
7. Verifying HSM-backed signing without physical access#
The dongle is in the server room; we can’t watch the LED blink. The non-destructive proof is the conjunction of these checks:
jq '{key,kms,ssh}' /home/step/config/ca.jsonshowskey=pkcs11:id=01,kms.type=pkcs11,ssh.userKey=pkcs11:id=02,ssh.hostKey=pkcs11:id=03. The CA is configured to use PKCS#11 only.cat /proc/$(pidof step-ca)/maps | grep opensc-pkcs11shows the PKCS#11 module mmap’d into step-ca’s address space — the binary actually loaded the library, not just configured a path.ls -l /proc/$(pidof pcscd)/fd/shows pcscd holding an open file descriptor on a/dev/bus/usb/<n>/<m>device. The actual USB handle is open, in this process, right now.A test cert issued via the prod-services JWK provisioner verifies against the intermediate, and the intermediate’s public key already proved bit-equal to HSM
id=01during §A4.
No way to fake all four without an HSM in the chain.
8. Open follow-ups (not blocking; tracked as GitHub issues)#
JWK provisioner password security regression. Bootstrap script used the Phase-1-style dummy password file (
unused-but-required) as the JWK provisioner password. JWK provisioner passwords are the auth gate for cert issuance via those provisioners — anyone who knows the password can issue certs. Phase 1 used a vault-backedSTEP_CA_PASSWORD. Need to generate per-provisioner secrets (or pull a vault-backed shared secret) and rotate.provisioners-add.shconsolidation. Two scripts now exist for the same task, one broken. Either delete the old or have it call the new. Updateccat ca provisioner syncto invoke the working path. Update §G6 of the playbook to match.Ansible
hsm_hostrole completeness for redeploy. A fresh input-b should be onemake play-hsm-hostaway from “ready forccat ca up”. The role currently leaves three things to manual operator steps: stop+mask host pcscd, deploy the polkit rule (now optional but harmless), install jq if not present. Bake those in.Phase 3: open TCP 9000 at the firewall. Drops the trust-bundle workaround and lets step-cli verify CCAT root natively. Out of scope for system-integration (firewall is uni IT).