CCAT CA — HSM Cutover Playbook (post-ceremony)#

Read this before doing anything. This is the executable companion to the offline-root ceremony playbook. Where that document covers what happened on the air-gapped laptop on 2026-04-29, this document covers what now has to happen on input-b to make the rest of CCAT trust the new HSM-backed root.

Two operators recommended (one driver, one second-pair-of-eyes). Stages A–F are reversible and low-risk. Stage G is the irreversible volume wipe + bring-up, scheduled in a known downtime window. Stage H is the test cohort re-bootstrapping with the new fingerprint.

Reference docs:


§0 — Preconditions#

Before starting Stage A, all of the following must be true:

  • The 2026-04-29 ceremony completed successfully and the export USB contains five files: root_ca.crt, intermediate_ca.crt, ssh_user_ca.pub, ssh_host_ca.pub, FINGERPRINT.txt.

  • The paper PIN sheet from the ceremony is in the safe; the intermediate user PIN is also accessible to the cutover operator (memorised, written separately, or carried in a sealed envelope — not the root PINs).

  • Both HSM serials are recorded on the paper sheet. The HSM #2 serial is needed inline in ca.json.hsm (Stage F).

  • HSM #2 is physically installed in input-b’s internal USB port. Chassis closed. Server up.

  • You are reading this on a trusted workstation (not on input-b itself — keep operator role separate from the host).

  • The Phase 1 test cohort has been told that a re-bootstrap is coming and roughly when.

  • You have a clean working tree on main of system-integration.

If any of the above is missing, stop. Do not proceed.


Stage A — Verify HSM #2 is functional on input-b#

Goal: confirm the OS sees the dongle, OpenSC can talk to it, and the keys on the card match the public artefacts on the export USB. No compose changes here. If anything in this stage fails, stop and diagnose before going further.

§A1 — Host-level visibility#

SSH to input-b. Run:

lsusb | grep -i nitrokey
sudo systemctl status pcscd
sudo pkcs11-tool --list-slots

Expected:

  • Exactly one Nitrokey HSM 2 line in lsusb (vendor 20a0:4230).

  • pcscd active and running.

  • pkcs11-tool --list-slots shows one slot, token label ccat-intermediate (UserPIN).

If pkcs11-tool shows no slot, reseat the dongle and reload udev:

sudo udevadm control --reload-rules
sudo udevadm trigger

Re-check. Still nothing? Stop here — the dongle, the USB port, or the udev rule needs investigating before anything else happens.

§A2 — End-to-end run of the hsm_host Ansible role#

The role installs opensc + opensc-tools, deploys 99-nitrokey-hsm.rules, ensures the plugdev group exists, and verifies the slot is visible:

cd ansible
make play-hsm-host

This is ansible-playbook -i inventory.ini -l input-b playbook_setup_vms.yml --tags hsm_host --vault-password-file .ansible_vault_key --ask-become-pass — make handles the sudo prompt and the vault key for you. Don’t invoke ansible-playbook directly; you’ll trip “Missing sudo password” on the fact-gathering task.

Expected: green run; the final Report detected HSM slots debug task prints the slot info (one slot, label ccat-intermediate (UserPIN)).

After the first clean run, flip the role into hard-fail mode for the future. Edit host_vars/input-b/hsm.yml (create if missing) so the role refuses to skip verification on subsequent runs:

_hsm_enforce_verify: true

Commit + push.

§A3 — Sign-test with the intermediate key on id=01#

Confirms the HSM is not just visible but actually responsive. Before running the sign command, do two free pre-flight checks that don’t consume PIN attempts:

# Confirm id=01, id=02, id=03 actually exist (no-login operation)
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \
    --token-label ccat-intermediate --list-objects --type pubkey

# Read the user PIN retry counter before betting on it
sudo sc-hsm-tool

Expect three pubkey objects (intermediate, SSH user CA, SSH host CA) and a user PIN counter at full value (typically 3 of 3 on a freshly-initialised SC-HSM). If the counter reads 1 of 3, do not proceed to the sign-test on a guess — reset via the SO-PIN first (sudo sc-hsm-tool --unlock-pin, prompts for SO-PIN then new user PIN).

Then the sign-test itself. Run on input-b (RHEL9 path):

TMPIN=$(mktemp /tmp/sigtest-in.XXXXXX)
echo "ccat-test-$(date)" > "$TMPIN"
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \
    --token-label ccat-intermediate --login \
    --sign --id 01 --mechanism ECDSA-SHA384 \
    -i "$TMPIN" -o /tmp/sigtest.bin
ls -la /tmp/sigtest.bin && rm -f "$TMPIN" /tmp/sigtest.bin

Don’t be tempted to use <(echo ...) process substitution for the input — sudo’s default config closes file descriptors above 2 when elevating, so /dev/fd/63 is invisible to the elevated pkcs11-tool and the sign aborts after successful login. PIN attempt is not wasted (successful logins don’t decrement the SC-HSM counter), but you’ve still typed your PIN for nothing. Use a temp file.

PIN goes in via interactive prompt. Never put --pin <value> on the command line — it lands in shell history and ps.

Expected: a non-empty /tmp/sigtest.bin (~96 bytes for ECDSA-SHA384 / P-384, that’s r||s with 48 bytes each).

If the PIN is rejected: stop, do not retry on a guess. Each wrong attempt decrements the counter. Re-read the paper carefully (O vs 0, keyboard layout) and resume only when very confident. Lockout at counter=0 is recoverable via sudo sc-hsm-tool --unlock-pin with the SO-PIN from the safe.

If signing fails for non-PIN reasons (key on a different id, HSM misbehaving): diagnose before continuing — don’t keep retrying.

§A4 — Bind-check: HSM keys ↔ ceremony artefacts#

This proves the dongle in input-b is the same dongle the ceremony wrote to. Run on a workstation with the export USB mounted at /mnt/export-usb/ (or with the five ceremony files copied somewhere local). At this stage the artefacts on the export USB are the canonical reference — the in-repo files under ca_trust/files/ are still the Phase 1 throwaway and will be overwritten in §B2.

Three checks, two different conversion paths because the artefacts have two different formats:

  • id=01intermediate_ca.crt — X.509 path (public key embedded in a certificate; compare PEM-to-PEM)

  • id=02ssh_user_ca.pub — OpenSSH wire-format path

  • id=03ssh_host_ca.pub — OpenSSH wire-format path

Don’t try openssl x509 on the .pub files — they aren’t X.509, they’re OpenSSH wire format, and openssl x509 will error out with “Could not find certificate”.

id=01 — intermediate, X.509 path#

Pull the public part of id=01 off the HSM and compare to the public key embedded in intermediate_ca.crt:

# On input-b:
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 01 --type pubkey -o /tmp/int.der
openssl ec -pubin -inform DER -in /tmp/int.der -pubout -outform PEM > /tmp/int.pem
# On the workstation:
scp input-b:/tmp/int.pem /tmp/int.pem
openssl x509 -in /mnt/export-usb/intermediate_ca.crt -pubkey -noout > /tmp/int_cert.pem
diff /tmp/int.pem /tmp/int_cert.pem    # must be empty

id=02 — SSH user CA, OpenSSH path#

ssh_user_ca.pub is OpenSSH wire format (one line: algorithm, base64 blob, optional comment). Convert the HSM-extracted PEM up to OpenSSH format and fingerprint-compare; don’t byte-compare the strings directly (the comment field and whitespace can vary innocuously).

# On input-b:
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 02 --type pubkey -o /tmp/sshuser.der
openssl ec -pubin -inform DER -in /tmp/sshuser.der -pubout -outform PEM > /tmp/sshuser.pem
# On the workstation:
scp input-b:/tmp/sshuser.pem /tmp/sshuser.pem
ssh-keygen -i -m PKCS8 -f /tmp/sshuser.pem > /tmp/sshuser_from_hsm.pub
ssh-keygen -lf /tmp/sshuser_from_hsm.pub
ssh-keygen -lf /mnt/export-usb/ssh_user_ca.pub

The two ssh-keygen -lf lines print SHA256 fingerprints — visually compare them. Same fingerprint = same key. Don’t read aloud — same audio-leak hygiene as everywhere else.

If you prefer a hard diff over eyeballing fingerprints, strip the algorithm prefix and comment field so only the base64 key blob remains:

diff <(awk '{print $2}' /tmp/sshuser_from_hsm.pub) <(awk '{print $2}' /mnt/export-usb/ssh_user_ca.pub)
# Must be empty. Process substitution is fine here — no sudo involved
# (cf. §A3 where it broke under sudo).

id=03 — SSH host CA, OpenSSH path#

Same shape as id=02 with --id 03, filenames /tmp/sshhost.{der,pem,_from_hsm.pub}, reference /mnt/export-usb/ssh_host_ca.pub.

Decision#

If any comparison fails (non-empty diff or fingerprint mismatch): stop. The dongle in the server is not the one the ceremony wrote, or the ceremony artefacts have been tampered with. Investigate before going further.

§A5 — Decision point#

If §A1 through §A4 are all green, the stick is functional and bound to the ceremony artefacts. Proceed to Stage B.

If any failed, stop. Do not commit ceremony artefacts to the repo until the binding is proven.


Stage B — Commit ceremony artefacts to the repo#

These files are public, but their authenticity is the load-bearing property of the whole CA. Treat the commit as a high-trust action.

§B1 — Verify the fingerprint against the paper#

step certificate fingerprint /mnt/export-usb/root_ca.crt

Compare visually, character by character, against the paper from the safe. Do not read aloud — same audio-leak hygiene as the ceremony.

If they match, continue. If not, stop and investigate (the export USB or the root_ca.crt file is wrong).

§B2 — Overwrite the Phase 1 throwaway artefacts#

cp /mnt/export-usb/root_ca.crt          ansible/roles/ca_trust/files/
cp /mnt/export-usb/ssh_user_ca.pub      ansible/roles/ca_trust/files/
cp /mnt/export-usb/ssh_host_ca.pub      ansible/roles/ca_trust/files/

intermediate_ca.crt does not go into ca_trust/files/. Clients only need the root; the intermediate lives only on input-b in the step-ca volume (Stage F).

§B3 — Commit with a loud message#

git add ansible/roles/ca_trust/files/
git commit -m "ca_trust: rotate to HSM-backed root (Phase 2 cutover)

Replaces 2026-04 Phase 1 throwaway root with the ceremony output
from 2026-04-29. New fingerprint: <PASTE-FROM-PAPER>.

Every CCAT client now needs to re-bootstrap:
  step ca bootstrap --force \\
    --ca-url https://ca.ccat.uni-koeln.de \\
    --fingerprint <PASTE-FROM-PAPER>"
git push origin main

§B4 — Distribute the new root via ca_trust#

cd ansible
make play-ca-trust

This is ansible-playbook -i inventory.ini -l all playbook_setup_vms.yml --tags ca_trust --vault-password-file .ansible_vault_key --ask-become-pass — make handles the vault key and sudo prompt for you. As with §A2, don’t invoke ansible-playbook directly or you’ll trip “Missing sudo password” on fact-gathering.

For a staged rollout (recommended in production — verify staging first, then push to production), use G=<group> or H=<host>:

make play-ca-trust G=input_staging   # staging hosts only
make play-ca-trust G=input_ccat      # production input nodes only
make play-ca-trust H=input-a-staging # single host

The role adds the new root to every managed host’s system trust store and to /etc/ssh/trusted_user_ca_keys. Briefly the old (Phase 1 throwaway) and new roots coexist; that’s fine — existing 16h SSH certs continue to validate against the throwaway root until they expire or step-ca is cut over (Stage G).

§B5 — Spot-check on one host#

ssh input-a sudo trust list | grep -i 'CCAT Observatory Root'

(or ls /etc/ssh/trusted_user_ca_keys + head -c 30 to confirm the new SSH user CA pubkey is in place.)


Stage C — Vault the intermediate user PIN#

The intermediate user PIN is the only secret on input-b that, with HSM #2 plugged in, can produce a signature. It lives in the vault and is rendered into /opt/data-center/system-integration/.env on input-b through the existing application_env schema-driven pipeline.

§C1 — Add the schema entry + populate the vault#

ccat secrets add vault_step_ca_hsm_pin --env production
# When prompted:
#   env_name:    STEP_CA_HSM_PIN
#   description: "User PIN for HSM #2 (intermediate). Source: ceremony 2026-04-29 paper sheet."
#   value:       <paste the intermediate user PIN, no echo>

§C2 — Provision .env on input-b#

ccat secrets provision --host input-b

§C3 — Verify the file on input-b#

ssh input-b "sudo grep STEP_CA_HSM_PIN /opt/data-center/system-integration/.env"

Expected: a line STEP_CA_HSM_PIN=... (visible because you sudo’d — the file is mode 0640, root:jenkins). The PIN value should match what you set in §C1.

Do not echo the PIN to a screen anyone but you can see. Same audio/video-leak hygiene as the ceremony.


Stage D — Build a step-ca image with opensc-pkcs11#

The stock smallstep/step-ca image does not ship a PKCS#11 module. Smallstep maintains a separate -hsm flavour of the same image (e.g. 0.30.2-hsm) with the OpenSC PKCS#11 module pre-installed and pcscd available (though not auto-started — the libusb-direct vs pcscd-mediated runtime choice is settled in Stage E, not at image build time). We use the upstream -hsm tag as our base and add only what it’s missing.

§D1 — Create step-ca/Dockerfile.hsm#

ARG STEP_CA_VERSION=0.30.2
FROM smallstep/step-ca:${STEP_CA_VERSION}-hsm
USER root
RUN apt-get update \
 && apt-get install -y --no-install-recommends ca-certificates \
 && rm -rf /var/lib/apt/lists/*
USER step

Notes:

  • STEP_CA_VERSION is pinned to match the ceremony’s step-cli and step-kms-plugin pinning (see step-ca/prepare-ceremony-usb.sh STEP_CLI_VERSION and ansible/roles/hsm_host/defaults/main.yml step_cli_version). Bump all three together when the time comes, not floating to :latest — Phase 2 needs reproducibility across multi-year dormancy.

  • The -hsm suffix is hardcoded in the FROM line, not part of the ARG, so a version bump cannot accidentally drop the PKCS#11 layer.

  • ca-certificates is genuinely missing from the upstream -hsm image (verified empirically). step-ca needs it for outbound TLS trust during ACME flows. opensc-pkcs11 itself does not, but adding the package is cheap and the integration cost of not having it later is much higher.

§D2 — Confirm the module path inside the image#

docker build -t ccat-step-ca:hsm-test ./step-ca -f step-ca/Dockerfile.hsm
docker run --rm --entrypoint sh ccat-step-ca:hsm-test \
    -c 'ls -la /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so /etc/ssl/certs/ca-certificates.crt'

Expected: both files listed, neither “No such file or directory.” The --entrypoint sh override is needed because the upstream image’s default entrypoint is step-ca, which exits with “no ca.json config file” before the ls runs.

This module path is what goes into ca.json.hsm in Stage F. It is different from the host’s /usr/lib64/pkcs11/opensc-pkcs11.so — container is Debian, host is RHEL.


Stage E — Compose changes for HSM-backed mode#

These are surgical edits to docker-compose.ca.yml on the step-ca service. Make them as a PR; review with second-pair-of-eyes; merge only when ready to schedule the Stage G window.

Architecture note. The 2026-05-04 cutover proved that the initial libusb-direct design did not work — OpenSC on Linux/Debian only reaches the HSM via pcscd. The compose changes below describe the working path: pcscd-in-container as root, with --disable-polkit, then privilege drop to step (UID 1000) for step-ca. See lessons-learned-cutover-2026-05-04.md §1 for the full path-A-to-path-C narrative.

§E1 — Replace image: with build:, container starts as root#

  step-ca:
    build:
      context: ./step-ca
      dockerfile: Dockerfile.hsm
      args:
        STEP_CA_VERSION: 0.30.2
    # Container starts as root so the entrypoint can run pcscd
    # (libusb USB ioctls require root). The entrypoint drops to
    # step (UID 1000) via runuser before exec'ing step-ca.
    user: "0:0"
    restart: always
    ...

The STEP_CA_VERSION arg can be omitted if you’re happy with the Dockerfile’s own default. Listed explicitly so version drift is visible in the compose file too — bumping in both places at once is the safer pattern.

§E2 — Remove all DOCKER_STEPCA_INIT_* env vars#

The auto-init facility no longer applies — the volume is hand-populated in Stage G. Note that this also disables the side-effect that flipped enableSSHCA on; we re-enable it explicitly in ca.json.hsm (Stage F).

§E3 — Add STEP_CA_HSM_PIN env passthrough#

    environment:
      STEP_CA_HSM_PIN: ${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env}
      VIRTUAL_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env}
      VIRTUAL_PORT: "9000"
      VIRTUAL_PROTO: "https"
      LETSENCRYPT_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env}
      LETSENCRYPT_EMAIL: buchbend@ph1.uni-koeln.de

§E4 — Pass through the HSM device#

Pass the whole USB bus through to the container:

    devices:
      - "/dev/bus/usb:/dev/bus/usb"

The container’s pcscd runs as root (because compose user: "0:0" plus the entrypoint doesn’t drop priv until after pcscd is started), so it can claim USB ioctls regardless of /dev/bus/usb file perms. The host udev rule (root:plugdev 0660) and group_add are not required here — leave them off.

Earlier drafts of this section required group_add: ["plugdev"] with host plugdev pinned to GID 46. That was true under the libusb-direct hypothesis (rejected). The udev rule + plugdev infrastructure on the host (commits 71323f1, c30459f) is now only useful for ad-hoc operator HSM diagnostics on the host — see lessons-learned §1.

Hot replug is NOT supported. Compose devices: is a snapshot at start time. If the dongle is unplugged and replugged the kernel may reassign busnum/devnum and the container’s view goes stale. Recovery: ccat ca restart step-ca. (See §I for the day-2 ops note on token contention.)

§E5 — PIN delivery via tmpfs + entrypoint wrapper with privdrop#

Add a tmpfs mount and the entrypoint wrapper:

    tmpfs:
      # uid=1000 because the wrapper chowns the PIN file to step
      # after writing it. mode=0700 so nothing else can list/enter.
      - /run/secrets:mode=0700,uid=1000,gid=1000,size=1M
    volumes:
      - step-ca-data:/home/step
      - ./step-ca/ssh-user-template.tpl:/home/step/config/ssh-user-template.tpl:ro
      - ./step-ca/step-ca-hsm-entrypoint.sh:/usr/local/bin/step-ca-hsm-entrypoint.sh:ro
    entrypoint: ["/usr/local/bin/step-ca-hsm-entrypoint.sh"]

The wrapper at step-ca/step-ca-hsm-entrypoint.sh does four things, in order, all as root, then drops privileges:

#!/bin/sh
set -eu
umask 077

: "${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env}"

# 1. Materialize PIN on tmpfs, hand to step user only.
printf '%s' "$STEP_CA_HSM_PIN" > /run/secrets/hsm-pin
chown 1000:1000 /run/secrets/hsm-pin
chmod 0400 /run/secrets/hsm-pin
unset STEP_CA_HSM_PIN

# 2. Start pcscd. --disable-polkit bypasses the auth check
# that would otherwise reject all clients (polkitd is in the
# upstream :hsm image but cannot run without systemd/DBus).
/usr/sbin/pcscd --disable-polkit

# 3. Defensive wait for the pcscd socket — guards against a
# race where step-ca's first PKCS#11 call beats pcscd's
# socket-bind.
i=0
while [ ! -S /run/pcscd/pcscd.comm ] && [ "$i" -lt 50 ]; do
    sleep 0.1
    i=$((i + 1))
done

# 4. Drop privileges and exec step-ca.
exec runuser -u step -- /usr/local/bin/step-ca \
    /home/step/config/ca.json \
    --password-file /home/step/secrets/password

Goals:

  • The PIN file on tmpfs is readable only by UID 1000.

  • STEP_CA_HSM_PIN is unset before exec, so it doesn’t appear in step-ca’s /proc/<pid>/environ.

  • step-ca uses pin-source=/run/secrets/hsm-pin (Stage F) to log into the token at signing time.

  • Only pcscd retains root; step-ca runs as UID 1000.

chmod +x step-ca/step-ca-hsm-entrypoint.sh and commit.

§E6 — Drop the SoftHSM plumbing#

Remove from the step-ca service:

  • The softhsm-tokens:/var/lib/softhsm/tokens volume mount.

  • The ./step-ca/softhsm2.conf:/etc/softhsm/softhsm2.conf:ro mount.

Remove from the top-level volumes: block:

  • The softhsm-tokens: declaration.

These were vestigial Phase 1 plumbing. HSM #2 is real hardware via PKCS#11 and never used SoftHSM.

The softhsm2.conf file in step-ca/ can be deleted from git in the same PR.


Stage F — Write the HSM-aware ca.json#

Create step-ca/ca.json.hsm. This is committed in git and is the seed for the new step-ca-data volume in Stage G.

Pull <HSM2-SERIAL> from the paper PIN sheet (also visible in pkcs11-tool --list-token-slots on the host).

Key fields:

{
  "address": ":9000",
  "dnsNames": ["ca.ccat.uni-koeln.de", "localhost"],
  "logger": { "format": "text" },
  "db": {
    "type": "badgerv2",
    "dataSource": "/home/step/db"
  },
  "root":  "/home/step/certs/root_ca.crt",
  "crt":   "/home/step/certs/intermediate_ca.crt",
  "key":   "pkcs11:id=01",
  "kms": {
    "type": "pkcs11",
    "uri":  "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;serial=<HSM2-SERIAL>?pin-source=/run/secrets/hsm-pin"
  },
  "ssh": {
    "userKey": "pkcs11:id=02",
    "hostKey": "pkcs11:id=03"
  },
  "authority": {
    "enableAdmin": false,
    "claims": {
      "enableSSHCA": true
    },
    "provisioners": []
  },
  "tls": {
    "cipherSuites": [
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
    ],
    "minVersion": 1.2,
    "maxVersion": 1.3,
    "renegotiation": false
  }
}

Why these choices:

  • serial=<HSM2-SERIAL>, not token=ccat-intermediate (UserPIN) — see lessons-learned §1. The token form requires URL-encoding for the space and parentheses; the serial form is stable and doesn’t need encoding.

  • pin-source=/run/secrets/hsm-pin, not pin-value=... — the ceremony used pin-value because pin-source was unreliable from a freshly-installed Ubuntu Live USB. In production the file form is what we want — the PIN is not in the URI string.

  • enableAdmin: false — Phase 1 lesson. With remote management on, step-ca uses a BoltDB-backed runtime store and offline edits to ca.json are silently ignored. We want ca.json to be the runtime source of truth.

  • claims.enableSSHCA: true — without this, OIDC SSH cert signing is rejected at the authority layer with sshCA is disabled for oidc provisioner. Phase 1 flipped it on via DOCKER_STEPCA_INIT_SSH=true; Phase 2 has to set it explicitly because we drop all DOCKER_STEPCA_INIT_* (§E2).

  • Empty provisioners: [] — they get re-added in §G6 by step-ca/provisioners-bootstrap.sh.

  • ssh.userKey / ssh.hostKey point at the SSH user / host CA keys generated on HSM #2 (id=02, id=03).

Commit step-ca/ca.json.hsm together with the Stage E changes as the cutover PR.


Stage G — The cutover (downtime window, ~5 minutes)#

Stages A through F are reversible. Stage G is the volume wipe and bring-up. Do not start G until A–F are green and the test cohort has been told.

§G1 — Announce the window#

Notify the test cohort that the CA is going down for ~5 minutes and that they will need to re-bootstrap with the new fingerprint after. The new fingerprint is the value from the paper PIN sheet (Stage B1), which they will visually confirm against their bootstrap output.

§G2 — Tear down step-ca only#

Dex stays up — its config is in git and its dex-data volume holds session state and signing keys that are safe to keep:

ccat ca down

§G3 — Wipe and re-create the step-ca volume#

This is the irreversible step. Stop host pcscd first if it’s running — the new container’s pcscd will compete with it for the USB device:

sudo systemctl stop pcscd.service pcscd.socket
sudo systemctl mask pcscd.service pcscd.socket   # prevent auto-restart on reboot

docker volume rm ccat-ca_step-ca-data
docker volume create ccat-ca_step-ca-data

Why mask host pcscd? Both the host’s pcscd (from the hsm_host Ansible role’s opensc package install) and the new container’s pcscd want to claim the same USB device interface via libusb. The kernel allows only one libusb client per interface. Whichever pcscd starts first wins; the other fails silently and the loser’s PKCS#11 stack reports “No slots.” Masking host pcscd makes the container the unambiguous owner. See lessons-learned §5; day-2 ops for ad-hoc host pkcs11-tool work is documented in this playbook’s “Day-2 ops — token contention” section.

§G4 — Pre-populate the volume#

Run this from a checkout of system-integration on input-b (the repo is already deployed there for ccat ca up/etc.). All four inputs come from the repo — no export-USB plumbing needed at this stage:

  • ansible/roles/ca_trust/files/root_ca.crt — committed in §B2

  • step-ca/files/intermediate_ca.crt — committed in 7cbfc2d (relocated out of ca_trust/files/ since clients don’t need it)

  • step-ca/ca.json.hsm — committed in Stage F

The helper container must be running (so we can docker exec into it for mkdir and chown), not just docker create’d:

# 1. Start a running helper container with the volume mounted
TMP=$(docker run -d --rm -v ccat-ca_step-ca-data:/home/step alpine sleep 300)

# 2. Pre-create the directory tree (fresh volume is empty)
docker exec "$TMP" mkdir -p /home/step/certs /home/step/config /home/step/secrets

# 3. Copy in the four files
docker cp ansible/roles/ca_trust/files/root_ca.crt "$TMP":/home/step/certs/root_ca.crt
docker cp step-ca/files/intermediate_ca.crt        "$TMP":/home/step/certs/intermediate_ca.crt
docker cp step-ca/ca.json.hsm                      "$TMP":/home/step/config/ca.json

# step-ca refuses to start without a password file even when keys are on HSM —
# we write a dummy non-empty value; it's never consulted because the key is
# referenced via the PKCS#11 URI.
printf 'unused-but-required\n' > /tmp/step-ca-password
docker cp /tmp/step-ca-password "$TMP":/home/step/secrets/password
rm -f /tmp/step-ca-password

# 4. Hand the whole tree to UID 1000 — fresh volume root is root:root,
# which would block step-ca (UID 1000) from creating /home/step/db at startup.
docker exec "$TMP" chown -R 1000:1000 /home/step

# 5. Sanity check
docker exec "$TMP" ls -la /home/step/certs /home/step/config /home/step/secrets

# 6. Tear down — --rm cleans up
docker kill "$TMP"

If step 5 doesn’t show all four files owned by 1000:1000, stop and diagnose before continuing — step-ca will fail in confusing ways on a partially-populated or root-owned volume.

Two docker cp gotchas this section guards against:

  1. docker create --rm is not a running container. docker exec doesn’t work on it, so the mkdir -p step would silently fail. Earlier playbook drafts used docker create; switching to docker run -d --rm is what makes step 2 work.

  2. docker cp - reads stdin as a tar stream, not raw bytes. echo "unused-but-required" | docker cp - CONTAINER:/path fails with “archive/tar: invalid tar header.” Always use a temp file plus docker cp <tmpfile>.

§G5 — Bring step-ca back up#

ccat ca up
ccat ca logs step-ca

Expected log lines:

  • Loaded key from PKCS#11 URI ...

  • Server listening on :9000

If you see PKCS#11 errors instead (module not found, permission denied, token not present), the most likely causes are:

  1. Container can’t see the devicedocker exec <container> pkcs11-tool --list-slots reports nothing. Fix: revisit §E4 group_add, or check the host udev rule has reloaded.

  2. PIN file empty or unreadabledocker exec <container> ls -la /run/secrets/hsm-pin shows zero bytes or wrong perms. Fix: check the entrypoint script is executable and STEP_CA_HSM_PIN is in .env.

  3. Wrong serial in ca.json.hsm — the URI serial=... must match pkcs11-tool --list-token-slots exactly. Fix and rebuild the volume (step G4 onwards).

If you have to roll back, the procedure is:

ccat ca down
docker volume rm ccat-ca_step-ca-data
# Re-populate from a backup of the Phase 1 volume — only possible if
# you snapshotted it before Stage G3. Otherwise: re-run Phase 1
# auto-init by reverting the compose changes and bringing up.

(In practice, “roll back” means “re-run Phase 1 with the throwaway root and try Stage G again next window.” The Phase 1 root in the committed ca_trust/files/ has been overwritten in Stage B, so a true rollback also requires reverting that commit.)

§G6 — Re-add provisioners#

DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \
OIDC_ADMIN_EMAIL="buchbend@ph1.uni-koeln.de" \
./step-ca/provisioners-bootstrap.sh

ccat ca restart step-ca   # required: step-ca caches ca.json at startup

This re-creates the six-provisioner set via direct jq edits to ca.json: CCAT-GitHub (OIDC), prod-services, staging-services, service-accounts (JWK), acme (ACME), sshpop (SSHPOP). Dex’s dex-data volume is unchanged so the static step-ca client secret still works.

Why not ccat ca provisioner sync? That command runs the older provisioners-add.sh, which uses step ca provisioner add. step-cli 0.30.2 has no flag combination that makes that work against a Phase-2 ca.json (admin-API requires --ca-url + admin auth; --offline doesn’t exist; --ca-config doesn’t trigger offline editing). See lessons-learned §2. The bootstrap script bypasses step-cli for the add step and uses jq directly, which is the only path that works in this version.

§G7 — Verify external endpoints#

curl -sI https://ca.ccat.uni-koeln.de/health
curl -s https://auth.ccat.uni-koeln.de/.well-known/openid-configuration | jq .issuer

/health should respond (note: behind nginx-proxy with LE on 443 for the Phase 1 layout — see CA architecture doc § “Why step-ca is NOT behind nginx-proxy” for the Phase 3 cutover plan once the uni firewall opens 9000).

Dex issuer must be exactly https://auth.ccat.uni-koeln.de.

§G8 — Issue a test cert from inside the box, prove HSM signing#

# Provisioner count via direct ca.json read (the API path needs --ca-url)
docker exec ccat-ca-step-ca-1 jq -r '.authority.provisioners[] | "\(.type)\t\(.name)"' /home/step/config/ca.json
# Should list 6 provisioners

# Issue a test cert via the prod-services JWK provisioner —
# the entire signing path goes through the HSM intermediate.
docker exec ccat-ca-step-ca-1 step ca certificate proof.test /tmp/t.crt /tmp/t.key \
    --provisioner prod-services \
    --ca-url https://localhost:9000 \
    --root /home/step/certs/root_ca.crt \
    --provisioner-password-file /home/step/secrets/password

# Verify chain: cert -> intermediate (HSM-backed) -> root
docker exec ccat-ca-step-ca-1 openssl verify \
    -CAfile /home/step/certs/root_ca.crt \
    -untrusted /home/step/certs/intermediate_ca.crt \
    /tmp/t.crt
# Want: "/tmp/t.crt: OK"

# Cleanup
docker exec ccat-ca-step-ca-1 rm -f /tmp/t.crt /tmp/t.key

To rule out any doubt that the HSM is actually being used (rather than a file-backed key with the same public part), check the runtime topology:

# step-ca process has the PKCS#11 module loaded, no SoftHSM:
docker exec ccat-ca-step-ca-1 sh -c 'cat /proc/$(pidof step-ca)/maps | grep -E "opensc-pkcs11|softhsm"'

# pcscd holds the actual USB device file open:
docker exec ccat-ca-step-ca-1 sh -c 'ls -l /proc/$(pidof pcscd)/fd/' | grep -E 'bus/usb|ccid'

The conjunction of “ca.json points only at PKCS#11”, “opensc-pkcs11.so mmap’d into step-ca’s address space”, “pcscd holding the USB FD”, and “issued cert chains back to the intermediate whose public key was bit-equal to HSM id=01 in §A4” is conclusive proof of HSM-backed signing without needing physical access to the dongle.


Stage H — Test cohort re-bootstraps#

This is the rehearsed checkpoint. Every future root-rotation event depends on this command working cleanly across the team.

§H1 — Each member runs#

step ca bootstrap --force \
  --ca-url https://ca.ccat.uni-koeln.de \
  --fingerprint <NEW-FINGERPRINT-FROM-PAPER>

The fingerprint goes in the announcement they got in §G1. They must visually compare what step-cli prints against the fingerprint in the announcement before pressing y. If it doesn’t match, stop and investigate. Do not click through.

§H2 — Each member tests step ssh login#

step ssh login
ssh input-a.data.ccat.uni-koeln.de

End-to-end flow: browser → Dex → GitHub OAuth → ccatobs/datacenter team check → cert lands in ssh-agent → ssh into a managed host succeeds.

If their bootstrap completes but step ssh login fails with x509: certificate signed by unknown authority, they’re hitting the trust-bundle issue from Phase 1 (LE cert on 443 vs CCAT root in ~/.step). Workaround documented in docs/source/ca-provisioner-management.md § “Troubleshooting: x509 certificate signed by unknown authority”. Phase 3 fix is opening TCP 9000.

§H3 — Retrospective#

Capture any snags in docs/source/ceremony/ as lessons-learned-cutover-YYYY-MM-DD.md. The first real root rotation in 5–10 years will follow the same procedure.


§I — After the cutover settles#

  • Watch the CA for at least one week. No new production services are migrated yet — soak time only.

  • Watch for HSM/USB/udev surprises across container restarts: docker restart ccat-ca-step-ca-1 and confirm step-ca comes back cleanly without operator intervention.

  • Watch LE auto-renewal tick over for ca.ccat.uni-koeln.de and auth.ccat.uni-koeln.de.

  • Confirm provisioner counts hold across restarts. Use the direct ca.json read (admin-API path needs --ca-url and is blocked by enableAdmin: false): docker exec ccat-ca-step-ca-1 jq '.authority.provisioners | length' /home/step/config/ca.json — six entries, every time.

  • When the soak passes, mark Phase 2 done in step-ca/COMMISSIONING-TODO.md and start Phase 3 (production service rollout).

Day-2 ops — token contention#

Only one libusb client per device interface. With the container’s pcscd holding the dongle, host-side sudo pkcs11-tool etc. will fail. The right workflow for ad-hoc HSM diagnostics on the host is:

ccat ca down                                    # release the device
sudo systemctl unmask pcscd.service pcscd.socket
sudo systemctl start pcscd.socket               # socket-activates
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --list-slots
# ... do whatever ...
sudo systemctl stop pcscd.service pcscd.socket
sudo systemctl mask pcscd.service pcscd.socket
ccat ca up                                      # container takes the device back

Treat host-side diagnostics as brief, scheduled operations — every minute the container is down, certs aren’t issuing.


Roll-back / what-if#

Symptom

Likely cause

Action

§A pkcs11-tool shows no slot

Dongle not seated, or udev rule hasn’t reloaded

Reseat, udevadm control --reload-rules && udevadm trigger, retry

§A4 diff non-empty

Wrong dongle in server, or tampered export USB

Stop. Treat as integrity failure; do not proceed. Plan a fresh ceremony if needed

§G5 module not found in container logs

opensc-pkcs11 not in image

Confirm Stage D image build (should be 0.30.2-hsm not 0.30.2); check docker exec ccat-ca-step-ca-1 ls /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so

§G5 could not find PKCS#11 token

Most often: host pcscd is running and holding the dongle, OR container is starting as UID 1000 instead of root and pcscd-in-container can’t claim USB ioctls

Mask host pcscd (§G3); confirm compose has user: "0:0"; confirm entrypoint runs /usr/sbin/pcscd --disable-polkit

§G5 polkit-related rejection in pcscd logs

Entrypoint missing --disable-polkit flag

Update entrypoint script per §E5; rebuild container (entrypoint is bind-mounted, so a ccat ca restart picks up changes after git pull)

§G6 step-cli requires the '--ca-url' flag

You ran ccat ca provisioner sync (the old provisioners-add.sh path), which doesn’t work with enableAdmin: false in step-cli 0.30.2

Use step-ca/provisioners-bootstrap.sh per §G6 instead

§H1 step ca bootstrap fingerprint mismatch

Either bootstrapping the wrong host, or someone substituted an attacker-issued cert

Stop. Do not click through. Contact ops

§G5 token not present

Wrong serial in URI, or HSM came back on a different bus after a host reboot

Check pkcs11-tool --list-token-slots, fix URI, rebuild volume

§H1 fingerprint mismatch

Critical. Either the bootstrap is hitting the wrong host, or someone has substituted an attacker-issued cert

Stop. Do not click through. Contact ops

Tested cohort can bootstrap, can’t step ssh login

Trust-bundle hack needed (LE on 443, CCAT root in step trust)

See troubleshooting in ca-provisioner-management.md


Appendix — files this playbook touches#

New files (committed as part of the Stage E PR):

  • step-ca/Dockerfile.hsm

  • step-ca/step-ca-hsm-entrypoint.sh

  • step-ca/ca.json.hsm

Edited files:

  • docker-compose.ca.yml — service step-ca (build, env, devices, tmpfs, entrypoint, volume cleanup)

  • ansible/host_vars/input-b/hsm.yml_hsm_enforce_verify: true

  • ansible/roles/ca_trust/files/root_ca.crt — overwritten with ceremony output

  • ansible/roles/ca_trust/files/ssh_user_ca.pub — overwritten

  • ansible/roles/ca_trust/files/ssh_host_ca.pub — overwritten

  • ansible/vars_application_schema.yml — adds vault_step_ca_hsm_pin (via ccat secrets add)

Deleted files:

  • step-ca/softhsm2.conf — vestigial Phase 1, never used in earnest

Volumes:

  • ccat-ca_step-ca-data — wiped in §G3, re-populated in §G4

  • ccat-ca_softhsm-tokens — declared in compose but empty; can be docker volume rm’d after Stage E lands

  • ccat-ca_dex-datanot touched; Dex state survives the cutover