# Lessons learned — Phase 2 HSM cutover 2026-05-04

This document captures the surprises and corrections from CCAT's first
HSM cutover (Phase 2 — switching step-ca's signing keys from the
file-backed Phase 1 throwaway to the real Nitrokey HSM 2 dongle on
input-b). The [cutover playbook](cutover-playbook.md) has been updated
to bake these in, so a future redeploy following the playbook should
not re-encounter them. This page exists so future operators
understand the *why* behind the playbook's odd-looking choices — and
in particular, why three commits (`c30459f`, `c4b4a71`, `9ef17d3`)
encode different "right answers" depending on which architectural
hypothesis was current at the time.

Run details:
- **Date:** 2026-05-04
- **Operator:** buchbend, single seat — the second-pair-of-eyes role
  was filled by an AI assistant that got the architecture wrong twice
  before getting it right.
- **HSM in production:** Nitrokey HSM 2, serial `DENK0403253`, internal
  USB on input-b
- **CA software:** step-ca pinned at `0.30.2-hsm` (upstream image),
  step-cli at 0.30.2 (matching ceremony's pinned version)

## 1. The architectural multi-pivot

The biggest single lesson: **the "right" architecture for accessing
a USB HSM from a containerised step-ca on Linux is not the architecture
the playbook prescribed**. It took three pivots to find what works,
each rejected for valid reasons that turned out to also apply to the
eventual right answer.

**Path A — libusb-direct (playbook §E5 as originally written):**
Container at `user: 1000:1000`, pass `/dev/bus/usb` through, use
group_add for plugdev. step-ca → opensc-pkcs11.so → libusb directly.
*Failed because* OpenSC's `opensc-pkcs11.so` on Linux/Debian is built
only against `libpcsclite` — there is no libusb-direct reader
subsystem. Without a pcscd to talk to, `pkcs11-tool --list-slots`
returns "No slots" inside the container even when `/dev/bus/usb` is
visible and perms are correct. *This was rejected after* spending
hours fixing the udev rule + GID alignment + pcscd-on-host
contention, only to discover the access path can't work without
pcscd.

**Path B — host pcscd via Unix-socket bind-mount:**
Mount `/run/pcscd:/run/pcscd` into the container; container's
libpcsclite talks to host pcscd. *Failed because* RHEL host's
default polkit policy denies `org.debian.pcsc-lite.access_pcsc` to
any client that isn't the active console user or root. Adding a
polkit rule granting access by `subject.isInGroup("plugdev")` *might*
work — modern polkit checks supplementary groups via `getgrouplist()`
— but the semantics are version-dependent and a heredoc-written
polkit rule got mangled silently on terminal paste, so it was never
validated end to end. The architectural objection (host pcscd as a
SPOF coupled to the CA's lifecycle) survived.

**Path C — in-container pcscd as root, with `--disable-polkit`,
then drop privileges to step (UID 1000) for step-ca:**
This is what works. The upstream `smallstep/step-ca:0.30.2-hsm`
image ships both opensc-pkcs11 and pcscd. We start the container as
root (compose `user: "0:0"`), let our entrypoint write the HSM PIN
to tmpfs, run `pcscd --disable-polkit` (so the container's missing
polkit authority doesn't block clients), wait for the socket, then
`runuser -u step` and exec step-ca. step-ca runs non-root; pcscd
retains root for the libusb USB ioctls (which need it regardless of
device-node file perms).

**Why C wasn't tried first:** the architect (and I) rejected it on
"process supervision + privilege drop complexity" — fair concern.
But the actual implementation is ~10 lines of shell with no daemon
supervision (pcscd self-daemonises and just stays up), and `runuser`
makes the privdrop a one-liner. The complexity worry was overrated.

**Playbook decision:** §E1, §E4, §E5 rewritten for Path C.
`devices: /dev/bus/usb` retained from §E4 (pcscd-in-container needs
it). `group_add: ["plugdev"]` dropped — pcscd-as-root doesn't need
plugdev.

The udev rule + plugdev infrastructure on the host (commits
`71323f1`, `c30459f`) is now load-bearing only for **host operator
tooling** (when an operator briefly stops the container's pcscd
to run `sudo pkcs11-tool` etc. on the host). Its header comment was
written under the libusb-direct hypothesis and **overstates** the
rule's importance — re-read it as describing operator ergonomics,
not container access.

## 2. step-cli 0.30.2 has no working offline `provisioner add`

The playbook §G6 prescribed `ccat ca provisioner sync` →
`provisioners-add.sh` → `step ca provisioner add` for re-adding the
six provisioners after the volume wipe. With Phase 2's
`enableAdmin: false` (intentional, per §F), step-cli is forced into
admin-API mode and rejects every call with `requires the '--ca-url'
flag`. Tried `--offline` (does not exist in 0.30.2 despite older
docs referencing one) and `--ca-config` (still requires admin auth).
**There is no flag combination that makes step-cli edit ca.json
offline in this version.**

Work-around: bypass step-cli for provisioner adds entirely. Use
`jq` (present in the upstream image) to write provisioner JSON
directly into ca.json's `authority.provisioners` array. step-cli is
still used for what only it can do offline — generating JWK keypairs
via `step crypto jwk create`. New script:
`step-ca/provisioners-bootstrap.sh`.

**Playbook decision:** §G6 now points at `provisioners-bootstrap.sh`.
`provisioners-add.sh` is left in place as a frozen reference of what
the step-cli path attempted to be; a follow-up commit will fold it
into the bootstrap or delete it.

## 3. `enableSSHCA` defaults off in Phase 2

`step ssh login` failed with `sshCA is disabled for oidc provisioner
'CCAT-GitHub'`. step-ca's `authority.claims.enableSSHCA` claim
defaults to `false`. Phase 1 flipped it on via
`DOCKER_STEPCA_INIT_SSH=true` at auto-init time. Phase 2 drops all
`DOCKER_STEPCA_INIT_*` (§E2), so the default applies.

**Playbook decision:** `step-ca/ca.json.hsm` now sets
`authority.claims.enableSSHCA = true`. The top-level `ssh:` block
(userKey/hostKey on HSM `id=02`/`id=03`) was already correct; this
completes the pair.

## 4. Volume pre-population needs a *running* helper container

§G4 originally used `docker create --rm -v ... alpine sleep 300`,
then `docker cp` to populate, then `docker kill`. Two problems:

- `docker exec` doesn't work on a created-but-not-started container,
  so we can't `mkdir -p` parent directories before cp. The first
  `docker cp leaf-file:nested/dir/file` returns mixed
  "Successfully copied / Could not find file" output and may leave
  the volume in a bad state.
- A fresh volume at `/home/step` is `root:root` mode 0755 — when
  step-ca starts as UID 1000 it can't create `/home/step/db/` for
  badger and dies.

**Playbook decision:** §G4 rewritten to use `docker run -d --rm`
(running container), `docker exec mkdir -p` for the parent dirs,
copy the four files, then `docker exec chown -R 1000:1000 /home/step`
before tearing down the helper.

## 5. Host pcscd must be stopped post-cutover

Host pcscd and in-container pcscd both want to claim the same USB
device via libusb. The kernel allows only one libusb client per
device interface. Whichever pcscd starts first wins; the other
fails silently and the loser's PKCS#11 stack reports "No slots."

**Playbook decision:** §G5 now explicitly stops + masks host pcscd
(`systemctl mask pcscd.service pcscd.socket`). Operator workflows
that need host-side `sudo pkcs11-tool` must first `ccat ca down`,
unmask + start pcscd briefly, do the diagnostic, then re-mask and
`ccat ca up`. This is captured in the "Day-2 ops" runbook in
[`cutover-playbook.md`](cutover-playbook.md) § "Day-2 ops — token
contention".

## 6. Plumbing bugs that ate real hours

These are not architectural — just shell-level mistakes — but they
were *load-bearing on operator productivity* and the playbook now
guards against each:

- **`docker exec` without `-i` silently drops stdin.** A function
  that piped JSON to `docker exec ... sh -c 'cat > /tmp/file'` ended
  up writing an empty file because docker doesn't forward stdin
  without `-i`. Six provisioners "added" with cheerful output;
  ca.json unchanged. Fixed in `9ef17d3`.

- **`docker cp -` reads stdin as a tar stream, not raw bytes.**
  An attempt to pipe `echo "unused-but-required" | docker cp -
  CONTAINER:/path` failed with "archive/tar: invalid tar header."
  Use a temp file + `docker cp <tmpfile>` instead. Fixed in §G4.

- **Heredoc paste mangling for polkit rule.** Pasting a multi-line
  heredoc into a remote SSH session via `sudo tee FILE <<'EOF' ... EOF`
  silently failed (file wasn't created), then a manual `emacs`
  re-create concatenated lines onto a single line in places. The
  rule still parsed as JavaScript (whitespace-insensitive), but the
  experience taught us: write content to `/tmp/foo` as the user,
  then `sudo install` it in place. Or use `printf '%s\n' '...' '...'
  | sudo tee` to avoid heredoc-paste pitfalls.

- **`udevd` caches name→GID resolutions at boot time.** When we
  pinned plugdev to GID 46 *after* udevd had already started and
  cached the auto-assigned old GID, the udev rule kept setting the
  device node to the stale numeric GID. `systemctl restart
  systemd-udevd` flushes the cache. Re-fired with `udevadm trigger
  --action=add --subsystem-match=usb` to re-evaluate the device.

- **Default healthcheck timeout (10s) is too short for the
  pcscd→libusb→CCID round-trip after restart.** Bumped to 30s in
  `c0ce12e`.

## 7. Verifying HSM-backed signing without physical access

The dongle is in the server room; we can't watch the LED blink. The
non-destructive proof is the conjunction of these checks:

1. `jq '{key,kms,ssh}' /home/step/config/ca.json` shows
   `key=pkcs11:id=01`, `kms.type=pkcs11`, `ssh.userKey=pkcs11:id=02`,
   `ssh.hostKey=pkcs11:id=03`. The CA is *configured* to use
   PKCS#11 only.
2. `cat /proc/$(pidof step-ca)/maps | grep opensc-pkcs11` shows the
   PKCS#11 module mmap'd into step-ca's address space — the binary
   actually loaded the library, not just configured a path.
3. `ls -l /proc/$(pidof pcscd)/fd/` shows pcscd holding an open file
   descriptor on a `/dev/bus/usb/<n>/<m>` device. The actual USB
   handle is open, in this process, right now.
4. A test cert issued via the prod-services JWK provisioner verifies
   against the intermediate, and the intermediate's public key
   already proved bit-equal to HSM `id=01` during §A4.

No way to fake all four without an HSM in the chain.

## 8. Open follow-ups (not blocking; tracked as GitHub issues)

- **JWK provisioner password security regression.** Bootstrap script
  used the Phase-1-style dummy password file
  (`unused-but-required`) as the JWK provisioner password. JWK
  provisioner passwords are the auth gate for cert issuance via
  those provisioners — anyone who knows the password can issue
  certs. Phase 1 used a vault-backed `STEP_CA_PASSWORD`. Need to
  generate per-provisioner secrets (or pull a vault-backed shared
  secret) and rotate.

- **`provisioners-add.sh` consolidation.** Two scripts now exist for
  the same task, one broken. Either delete the old or have it call
  the new. Update `ccat ca provisioner sync` to invoke the working
  path. Update §G6 of the playbook to match.

- **Ansible `hsm_host` role completeness for redeploy.** A fresh
  input-b should be one `make play-hsm-host` away from "ready for
  `ccat ca up`". The role currently leaves three things to manual
  operator steps: stop+mask host pcscd, deploy the polkit rule (now
  optional but harmless), install jq if not present. Bake those in.

- **Phase 3: open TCP 9000 at the firewall.** Drops the
  trust-bundle workaround and lets step-cli verify CCAT root
  natively. Out of scope for system-integration (firewall is uni IT).