CCAT CA — HSM Cutover Playbook (post-ceremony)#
Read this before doing anything. This is the executable companion to the offline-root ceremony playbook. Where that document covers what happened on the air-gapped laptop on 2026-04-29, this document covers what now has to happen on input-b to make the rest of CCAT trust the new HSM-backed root.
Two operators recommended (one driver, one second-pair-of-eyes). Stages A–F are reversible and low-risk. Stage G is the irreversible volume wipe + bring-up, scheduled in a known downtime window. Stage H is the test cohort re-bootstrapping with the new fingerprint.
Reference docs:
playbook.md— the offline ceremony itself
lessons-learned-2026-04-29.md— PKCS#11 URI shapes that actually work
background/ca-architecture.md— design context for the CA this playbook is cutting over
step-ca/COMMISSIONING-TODO.md— overall phasing checklist
§0 — Preconditions#
Before starting Stage A, all of the following must be true:
The 2026-04-29 ceremony completed successfully and the export USB contains five files:
root_ca.crt,intermediate_ca.crt,ssh_user_ca.pub,ssh_host_ca.pub,FINGERPRINT.txt.The paper PIN sheet from the ceremony is in the safe; the intermediate user PIN is also accessible to the cutover operator (memorised, written separately, or carried in a sealed envelope — not the root PINs).
Both HSM serials are recorded on the paper sheet. The HSM #2 serial is needed inline in
ca.json.hsm(Stage F).HSM #2 is physically installed in input-b’s internal USB port. Chassis closed. Server up.
You are reading this on a trusted workstation (not on input-b itself — keep operator role separate from the host).
The Phase 1 test cohort has been told that a re-bootstrap is coming and roughly when.
You have a clean working tree on
mainof system-integration.
If any of the above is missing, stop. Do not proceed.
Stage A — Verify HSM #2 is functional on input-b#
Goal: confirm the OS sees the dongle, OpenSC can talk to it, and the keys on the card match the public artefacts on the export USB. No compose changes here. If anything in this stage fails, stop and diagnose before going further.
§A1 — Host-level visibility#
SSH to input-b. Run:
lsusb | grep -i nitrokey
sudo systemctl status pcscd
sudo pkcs11-tool --list-slots
Expected:
Exactly one Nitrokey HSM 2 line in
lsusb(vendor20a0:4230).pcscdactive and running.pkcs11-tool --list-slotsshows one slot, token labelccat-intermediate (UserPIN).
If pkcs11-tool shows no slot, reseat the dongle and reload udev:
sudo udevadm control --reload-rules
sudo udevadm trigger
Re-check. Still nothing? Stop here — the dongle, the USB port, or the udev rule needs investigating before anything else happens.
§A2 — End-to-end run of the hsm_host Ansible role#
The role installs opensc + opensc-tools, deploys
99-nitrokey-hsm.rules, ensures the plugdev group exists, and
verifies the slot is visible:
cd ansible
make play-hsm-host
This is ansible-playbook -i inventory.ini -l input-b playbook_setup_vms.yml --tags hsm_host --vault-password-file .ansible_vault_key --ask-become-pass — make handles the sudo prompt and the vault key for you. Don’t invoke ansible-playbook directly; you’ll trip “Missing sudo password” on the fact-gathering task.
Expected: green run; the final Report detected HSM slots debug task
prints the slot info (one slot, label ccat-intermediate (UserPIN)).
After the first clean run, flip the role into hard-fail mode for the
future. Edit host_vars/input-b/hsm.yml (create if missing) so the
role refuses to skip verification on subsequent runs:
_hsm_enforce_verify: true
Commit + push.
§A3 — Sign-test with the intermediate key on id=01#
Confirms the HSM is not just visible but actually responsive. Before running the sign command, do two free pre-flight checks that don’t consume PIN attempts:
# Confirm id=01, id=02, id=03 actually exist (no-login operation)
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \
--token-label ccat-intermediate --list-objects --type pubkey
# Read the user PIN retry counter before betting on it
sudo sc-hsm-tool
Expect three pubkey objects (intermediate, SSH user CA, SSH host CA)
and a user PIN counter at full value (typically 3 of 3 on a
freshly-initialised SC-HSM). If the counter reads 1 of 3, do not
proceed to the sign-test on a guess — reset via the SO-PIN first
(sudo sc-hsm-tool --unlock-pin, prompts for SO-PIN then new user
PIN).
Then the sign-test itself. Run on input-b (RHEL9 path):
TMPIN=$(mktemp /tmp/sigtest-in.XXXXXX)
echo "ccat-test-$(date)" > "$TMPIN"
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so \
--token-label ccat-intermediate --login \
--sign --id 01 --mechanism ECDSA-SHA384 \
-i "$TMPIN" -o /tmp/sigtest.bin
ls -la /tmp/sigtest.bin && rm -f "$TMPIN" /tmp/sigtest.bin
Don’t be tempted to use
<(echo ...)process substitution for the input — sudo’s default config closes file descriptors above 2 when elevating, so/dev/fd/63is invisible to the elevated pkcs11-tool and the sign aborts after successful login. PIN attempt is not wasted (successful logins don’t decrement the SC-HSM counter), but you’ve still typed your PIN for nothing. Use a temp file.
PIN goes in via interactive prompt. Never put --pin <value> on
the command line — it lands in shell history and ps.
Expected: a non-empty /tmp/sigtest.bin (~96 bytes for
ECDSA-SHA384 / P-384, that’s r||s with 48 bytes each).
If the PIN is rejected: stop, do not retry on a guess. Each wrong
attempt decrements the counter. Re-read the paper carefully (O vs 0,
keyboard layout) and resume only when very confident. Lockout at
counter=0 is recoverable via sudo sc-hsm-tool --unlock-pin with the
SO-PIN from the safe.
If signing fails for non-PIN reasons (key on a different id, HSM
misbehaving): diagnose before continuing — don’t keep retrying.
§A4 — Bind-check: HSM keys ↔ ceremony artefacts#
This proves the dongle in input-b is the same dongle the ceremony
wrote to. Run on a workstation with the export USB mounted at
/mnt/export-usb/ (or with the five ceremony files copied somewhere
local). At this stage the artefacts on the export USB are the
canonical reference — the in-repo files under ca_trust/files/
are still the Phase 1 throwaway and will be overwritten in §B2.
Three checks, two different conversion paths because the artefacts have two different formats:
id=01↔intermediate_ca.crt— X.509 path (public key embedded in a certificate; compare PEM-to-PEM)id=02↔ssh_user_ca.pub— OpenSSH wire-format pathid=03↔ssh_host_ca.pub— OpenSSH wire-format path
Don’t try openssl x509 on the .pub files — they aren’t X.509,
they’re OpenSSH wire format, and openssl x509 will error out with
“Could not find certificate”.
id=01 — intermediate, X.509 path#
Pull the public part of id=01 off the HSM and compare to the
public key embedded in intermediate_ca.crt:
# On input-b:
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 01 --type pubkey -o /tmp/int.der
openssl ec -pubin -inform DER -in /tmp/int.der -pubout -outform PEM > /tmp/int.pem
# On the workstation:
scp input-b:/tmp/int.pem /tmp/int.pem
openssl x509 -in /mnt/export-usb/intermediate_ca.crt -pubkey -noout > /tmp/int_cert.pem
diff /tmp/int.pem /tmp/int_cert.pem # must be empty
id=02 — SSH user CA, OpenSSH path#
ssh_user_ca.pub is OpenSSH wire format (one line: algorithm,
base64 blob, optional comment). Convert the HSM-extracted PEM up
to OpenSSH format and fingerprint-compare; don’t byte-compare the
strings directly (the comment field and whitespace can vary
innocuously).
# On input-b:
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --token-label ccat-intermediate --read-object --id 02 --type pubkey -o /tmp/sshuser.der
openssl ec -pubin -inform DER -in /tmp/sshuser.der -pubout -outform PEM > /tmp/sshuser.pem
# On the workstation:
scp input-b:/tmp/sshuser.pem /tmp/sshuser.pem
ssh-keygen -i -m PKCS8 -f /tmp/sshuser.pem > /tmp/sshuser_from_hsm.pub
ssh-keygen -lf /tmp/sshuser_from_hsm.pub
ssh-keygen -lf /mnt/export-usb/ssh_user_ca.pub
The two ssh-keygen -lf lines print SHA256 fingerprints — visually
compare them. Same fingerprint = same key. Don’t read aloud —
same audio-leak hygiene as everywhere else.
If you prefer a hard diff over eyeballing fingerprints, strip the algorithm prefix and comment field so only the base64 key blob remains:
diff <(awk '{print $2}' /tmp/sshuser_from_hsm.pub) <(awk '{print $2}' /mnt/export-usb/ssh_user_ca.pub)
# Must be empty. Process substitution is fine here — no sudo involved
# (cf. §A3 where it broke under sudo).
id=03 — SSH host CA, OpenSSH path#
Same shape as id=02 with --id 03, filenames
/tmp/sshhost.{der,pem,_from_hsm.pub}, reference
/mnt/export-usb/ssh_host_ca.pub.
Decision#
If any comparison fails (non-empty diff or fingerprint mismatch): stop. The dongle in the server is not the one the ceremony wrote, or the ceremony artefacts have been tampered with. Investigate before going further.
§A5 — Decision point#
If §A1 through §A4 are all green, the stick is functional and bound to the ceremony artefacts. Proceed to Stage B.
If any failed, stop. Do not commit ceremony artefacts to the repo until the binding is proven.
Stage B — Commit ceremony artefacts to the repo#
These files are public, but their authenticity is the load-bearing property of the whole CA. Treat the commit as a high-trust action.
§B1 — Verify the fingerprint against the paper#
step certificate fingerprint /mnt/export-usb/root_ca.crt
Compare visually, character by character, against the paper from the safe. Do not read aloud — same audio-leak hygiene as the ceremony.
If they match, continue. If not, stop and investigate (the export USB
or the root_ca.crt file is wrong).
§B2 — Overwrite the Phase 1 throwaway artefacts#
cp /mnt/export-usb/root_ca.crt ansible/roles/ca_trust/files/
cp /mnt/export-usb/ssh_user_ca.pub ansible/roles/ca_trust/files/
cp /mnt/export-usb/ssh_host_ca.pub ansible/roles/ca_trust/files/
intermediate_ca.crt does not go into ca_trust/files/. Clients
only need the root; the intermediate lives only on input-b in the
step-ca volume (Stage F).
§B3 — Commit with a loud message#
git add ansible/roles/ca_trust/files/
git commit -m "ca_trust: rotate to HSM-backed root (Phase 2 cutover)
Replaces 2026-04 Phase 1 throwaway root with the ceremony output
from 2026-04-29. New fingerprint: <PASTE-FROM-PAPER>.
Every CCAT client now needs to re-bootstrap:
step ca bootstrap --force \\
--ca-url https://ca.ccat.uni-koeln.de \\
--fingerprint <PASTE-FROM-PAPER>"
git push origin main
§B4 — Distribute the new root via ca_trust#
cd ansible
make play-ca-trust
This is ansible-playbook -i inventory.ini -l all playbook_setup_vms.yml --tags ca_trust --vault-password-file .ansible_vault_key --ask-become-pass — make handles the vault key and sudo prompt for you. As with §A2, don’t invoke ansible-playbook directly or you’ll trip “Missing sudo password” on fact-gathering.
For a staged rollout (recommended in production — verify staging
first, then push to production), use G=<group> or H=<host>:
make play-ca-trust G=input_staging # staging hosts only
make play-ca-trust G=input_ccat # production input nodes only
make play-ca-trust H=input-a-staging # single host
The role adds the new root to every managed host’s system trust store
and to /etc/ssh/trusted_user_ca_keys. Briefly the old (Phase 1
throwaway) and new roots coexist; that’s fine — existing 16h SSH
certs continue to validate against the throwaway root until they
expire or step-ca is cut over (Stage G).
§B5 — Spot-check on one host#
ssh input-a sudo trust list | grep -i 'CCAT Observatory Root'
(or ls /etc/ssh/trusted_user_ca_keys + head -c 30 to confirm the
new SSH user CA pubkey is in place.)
Stage C — Vault the intermediate user PIN#
The intermediate user PIN is the only secret on input-b that, with
HSM #2 plugged in, can produce a signature. It lives in the vault and
is rendered into /opt/data-center/system-integration/.env on input-b
through the existing application_env schema-driven pipeline.
§C1 — Add the schema entry + populate the vault#
ccat secrets add vault_step_ca_hsm_pin --env production
# When prompted:
# env_name: STEP_CA_HSM_PIN
# description: "User PIN for HSM #2 (intermediate). Source: ceremony 2026-04-29 paper sheet."
# value: <paste the intermediate user PIN, no echo>
§C2 — Provision .env on input-b#
ccat secrets provision --host input-b
§C3 — Verify the file on input-b#
ssh input-b "sudo grep STEP_CA_HSM_PIN /opt/data-center/system-integration/.env"
Expected: a line STEP_CA_HSM_PIN=... (visible because you sudo’d —
the file is mode 0640, root:jenkins). The PIN value should match what
you set in §C1.
Do not echo the PIN to a screen anyone but you can see. Same audio/video-leak hygiene as the ceremony.
Stage D — Build a step-ca image with opensc-pkcs11#
The stock smallstep/step-ca image does not ship a PKCS#11
module. Smallstep maintains a separate -hsm flavour of the same
image (e.g. 0.30.2-hsm) with the OpenSC PKCS#11 module pre-installed
and pcscd available (though not auto-started — the libusb-direct
vs pcscd-mediated runtime choice is settled in Stage E, not at image
build time). We use the upstream -hsm tag as our base and add
only what it’s missing.
§D1 — Create step-ca/Dockerfile.hsm#
ARG STEP_CA_VERSION=0.30.2
FROM smallstep/step-ca:${STEP_CA_VERSION}-hsm
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates \
&& rm -rf /var/lib/apt/lists/*
USER step
Notes:
STEP_CA_VERSIONis pinned to match the ceremony’s step-cli and step-kms-plugin pinning (seestep-ca/prepare-ceremony-usb.shSTEP_CLI_VERSIONandansible/roles/hsm_host/defaults/main.ymlstep_cli_version). Bump all three together when the time comes, not floating to:latest— Phase 2 needs reproducibility across multi-year dormancy.The
-hsmsuffix is hardcoded in the FROM line, not part of the ARG, so a version bump cannot accidentally drop the PKCS#11 layer.ca-certificatesis genuinely missing from the upstream-hsmimage (verified empirically). step-ca needs it for outbound TLS trust during ACME flows. opensc-pkcs11 itself does not, but adding the package is cheap and the integration cost of not having it later is much higher.
§D2 — Confirm the module path inside the image#
docker build -t ccat-step-ca:hsm-test ./step-ca -f step-ca/Dockerfile.hsm
docker run --rm --entrypoint sh ccat-step-ca:hsm-test \
-c 'ls -la /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so /etc/ssl/certs/ca-certificates.crt'
Expected: both files listed, neither “No such file or directory.”
The --entrypoint sh override is needed because the upstream image’s
default entrypoint is step-ca, which exits with “no ca.json config
file” before the ls runs.
This module path is what goes into ca.json.hsm in Stage F. It is
different from the host’s /usr/lib64/pkcs11/opensc-pkcs11.so —
container is Debian, host is RHEL.
Stage E — Compose changes for HSM-backed mode#
These are surgical edits to docker-compose.ca.yml on the step-ca
service. Make them as a PR; review with second-pair-of-eyes; merge
only when ready to schedule the Stage G window.
Architecture note. The 2026-05-04 cutover proved that the initial libusb-direct design did not work — OpenSC on Linux/Debian only reaches the HSM via pcscd. The compose changes below describe the working path: pcscd-in-container as root, with
--disable-polkit, then privilege drop tostep(UID 1000) for step-ca. Seelessons-learned-cutover-2026-05-04.md§1 for the full path-A-to-path-C narrative.
§E1 — Replace image: with build:, container starts as root#
step-ca:
build:
context: ./step-ca
dockerfile: Dockerfile.hsm
args:
STEP_CA_VERSION: 0.30.2
# Container starts as root so the entrypoint can run pcscd
# (libusb USB ioctls require root). The entrypoint drops to
# step (UID 1000) via runuser before exec'ing step-ca.
user: "0:0"
restart: always
...
The STEP_CA_VERSION arg can be omitted if you’re happy with the
Dockerfile’s own default. Listed explicitly so version drift is
visible in the compose file too — bumping in both places at once is
the safer pattern.
§E2 — Remove all DOCKER_STEPCA_INIT_* env vars#
The auto-init facility no longer applies — the volume is
hand-populated in Stage G. Note that this also disables the
side-effect that flipped enableSSHCA on; we re-enable it
explicitly in ca.json.hsm (Stage F).
§E3 — Add STEP_CA_HSM_PIN env passthrough#
environment:
STEP_CA_HSM_PIN: ${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env}
VIRTUAL_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env}
VIRTUAL_PORT: "9000"
VIRTUAL_PROTO: "https"
LETSENCRYPT_HOST: ${CA_DOMAIN:?CA_DOMAIN must be set in .env}
LETSENCRYPT_EMAIL: buchbend@ph1.uni-koeln.de
§E4 — Pass through the HSM device#
Pass the whole USB bus through to the container:
devices:
- "/dev/bus/usb:/dev/bus/usb"
The container’s pcscd runs as root (because compose user: "0:0"
plus the entrypoint doesn’t drop priv until after pcscd is
started), so it can claim USB ioctls regardless of /dev/bus/usb
file perms. The host udev rule (root:plugdev 0660) and group_add
are not required here — leave them off.
Earlier drafts of this section required
group_add: ["plugdev"]with host plugdev pinned to GID 46. That was true under the libusb-direct hypothesis (rejected). The udev rule + plugdev infrastructure on the host (commits71323f1,c30459f) is now only useful for ad-hoc operator HSM diagnostics on the host — see lessons-learned §1.
Hot replug is NOT supported. Compose
devices:is a snapshot at start time. If the dongle is unplugged and replugged the kernel may reassign busnum/devnum and the container’s view goes stale. Recovery:ccat ca restart step-ca. (See §I for the day-2 ops note on token contention.)
§E5 — PIN delivery via tmpfs + entrypoint wrapper with privdrop#
Add a tmpfs mount and the entrypoint wrapper:
tmpfs:
# uid=1000 because the wrapper chowns the PIN file to step
# after writing it. mode=0700 so nothing else can list/enter.
- /run/secrets:mode=0700,uid=1000,gid=1000,size=1M
volumes:
- step-ca-data:/home/step
- ./step-ca/ssh-user-template.tpl:/home/step/config/ssh-user-template.tpl:ro
- ./step-ca/step-ca-hsm-entrypoint.sh:/usr/local/bin/step-ca-hsm-entrypoint.sh:ro
entrypoint: ["/usr/local/bin/step-ca-hsm-entrypoint.sh"]
The wrapper at step-ca/step-ca-hsm-entrypoint.sh does four things,
in order, all as root, then drops privileges:
#!/bin/sh
set -eu
umask 077
: "${STEP_CA_HSM_PIN:?STEP_CA_HSM_PIN must be set in .env}"
# 1. Materialize PIN on tmpfs, hand to step user only.
printf '%s' "$STEP_CA_HSM_PIN" > /run/secrets/hsm-pin
chown 1000:1000 /run/secrets/hsm-pin
chmod 0400 /run/secrets/hsm-pin
unset STEP_CA_HSM_PIN
# 2. Start pcscd. --disable-polkit bypasses the auth check
# that would otherwise reject all clients (polkitd is in the
# upstream :hsm image but cannot run without systemd/DBus).
/usr/sbin/pcscd --disable-polkit
# 3. Defensive wait for the pcscd socket — guards against a
# race where step-ca's first PKCS#11 call beats pcscd's
# socket-bind.
i=0
while [ ! -S /run/pcscd/pcscd.comm ] && [ "$i" -lt 50 ]; do
sleep 0.1
i=$((i + 1))
done
# 4. Drop privileges and exec step-ca.
exec runuser -u step -- /usr/local/bin/step-ca \
/home/step/config/ca.json \
--password-file /home/step/secrets/password
Goals:
The PIN file on tmpfs is readable only by UID 1000.
STEP_CA_HSM_PINis unset before exec, so it doesn’t appear in step-ca’s/proc/<pid>/environ.step-ca uses
pin-source=/run/secrets/hsm-pin(Stage F) to log into the token at signing time.Only pcscd retains root; step-ca runs as UID 1000.
chmod +x step-ca/step-ca-hsm-entrypoint.sh and commit.
§E5b — Healthcheck (optional but recommended)#
healthcheck:
test:
- CMD-SHELL
- "step ca health --ca-url https://localhost:9000 --root /home/step/certs/root_ca.crt && pkcs11-tool --module /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so --list-slots > /dev/null"
# 30s — `step ca health` is fast but the chained pkcs11-tool
# round-trip via pcscd → libusb → CCID can take >10s right
# after a container restart. Observed in production; do not
# lower below 30s.
interval: 30s
timeout: 30s
retries: 3
start_period: 60s
Docker compose does NOT auto-restart on unhealthy by default — this
healthcheck surfaces problems to monitoring; recovery is a manual
ccat ca restart.
§E6 — Drop the SoftHSM plumbing#
Remove from the step-ca service:
The
softhsm-tokens:/var/lib/softhsm/tokensvolume mount.The
./step-ca/softhsm2.conf:/etc/softhsm/softhsm2.conf:romount.
Remove from the top-level volumes: block:
The
softhsm-tokens:declaration.
These were vestigial Phase 1 plumbing. HSM #2 is real hardware via PKCS#11 and never used SoftHSM.
The softhsm2.conf file in step-ca/ can be deleted from git in the
same PR.
Stage F — Write the HSM-aware ca.json#
Create step-ca/ca.json.hsm. This is committed in git and is the
seed for the new step-ca-data volume in Stage G.
Pull <HSM2-SERIAL> from the paper PIN sheet (also visible in
pkcs11-tool --list-token-slots on the host).
Key fields:
{
"address": ":9000",
"dnsNames": ["ca.ccat.uni-koeln.de", "localhost"],
"logger": { "format": "text" },
"db": {
"type": "badgerv2",
"dataSource": "/home/step/db"
},
"root": "/home/step/certs/root_ca.crt",
"crt": "/home/step/certs/intermediate_ca.crt",
"key": "pkcs11:id=01",
"kms": {
"type": "pkcs11",
"uri": "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;serial=<HSM2-SERIAL>?pin-source=/run/secrets/hsm-pin"
},
"ssh": {
"userKey": "pkcs11:id=02",
"hostKey": "pkcs11:id=03"
},
"authority": {
"enableAdmin": false,
"claims": {
"enableSSHCA": true
},
"provisioners": []
},
"tls": {
"cipherSuites": [
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
],
"minVersion": 1.2,
"maxVersion": 1.3,
"renegotiation": false
}
}
Why these choices:
serial=<HSM2-SERIAL>, nottoken=ccat-intermediate (UserPIN)— see lessons-learned §1. The token form requires URL-encoding for the space and parentheses; the serial form is stable and doesn’t need encoding.pin-source=/run/secrets/hsm-pin, notpin-value=...— the ceremony usedpin-valuebecausepin-sourcewas unreliable from a freshly-installed Ubuntu Live USB. In production the file form is what we want — the PIN is not in the URI string.enableAdmin: false— Phase 1 lesson. With remote management on, step-ca uses a BoltDB-backed runtime store and offline edits toca.jsonare silently ignored. We wantca.jsonto be the runtime source of truth.claims.enableSSHCA: true— without this, OIDC SSH cert signing is rejected at the authority layer withsshCA is disabled for oidc provisioner. Phase 1 flipped it on viaDOCKER_STEPCA_INIT_SSH=true; Phase 2 has to set it explicitly because we drop allDOCKER_STEPCA_INIT_*(§E2).Empty
provisioners: []— they get re-added in §G6 bystep-ca/provisioners-bootstrap.sh.ssh.userKey/ssh.hostKeypoint at the SSH user / host CA keys generated on HSM #2 (id=02,id=03).
Commit step-ca/ca.json.hsm together with the Stage E changes as the
cutover PR.
Stage G — The cutover (downtime window, ~5 minutes)#
Stages A through F are reversible. Stage G is the volume wipe and bring-up. Do not start G until A–F are green and the test cohort has been told.
§G1 — Announce the window#
Notify the test cohort that the CA is going down for ~5 minutes and that they will need to re-bootstrap with the new fingerprint after. The new fingerprint is the value from the paper PIN sheet (Stage B1), which they will visually confirm against their bootstrap output.
§G2 — Tear down step-ca only#
Dex stays up — its config is in git and its dex-data volume holds
session state and signing keys that are safe to keep:
ccat ca down
§G3 — Wipe and re-create the step-ca volume#
This is the irreversible step. Stop host pcscd first if it’s running — the new container’s pcscd will compete with it for the USB device:
sudo systemctl stop pcscd.service pcscd.socket
sudo systemctl mask pcscd.service pcscd.socket # prevent auto-restart on reboot
docker volume rm ccat-ca_step-ca-data
docker volume create ccat-ca_step-ca-data
Why mask host pcscd? Both the host’s pcscd (from the
hsm_hostAnsible role’s opensc package install) and the new container’s pcscd want to claim the same USB device interface via libusb. The kernel allows only one libusb client per interface. Whichever pcscd starts first wins; the other fails silently and the loser’s PKCS#11 stack reports “No slots.” Masking host pcscd makes the container the unambiguous owner. See lessons-learned §5; day-2 ops for ad-hoc host pkcs11-tool work is documented in this playbook’s “Day-2 ops — token contention” section.
§G4 — Pre-populate the volume#
Run this from a checkout of system-integration on input-b (the
repo is already deployed there for ccat ca up/etc.). All four
inputs come from the repo — no export-USB plumbing needed at this
stage:
ansible/roles/ca_trust/files/root_ca.crt— committed in §B2step-ca/files/intermediate_ca.crt— committed in 7cbfc2d (relocated out ofca_trust/files/since clients don’t need it)step-ca/ca.json.hsm— committed in Stage F
The helper container must be running (so we can docker exec
into it for mkdir and chown), not just docker create’d:
# 1. Start a running helper container with the volume mounted
TMP=$(docker run -d --rm -v ccat-ca_step-ca-data:/home/step alpine sleep 300)
# 2. Pre-create the directory tree (fresh volume is empty)
docker exec "$TMP" mkdir -p /home/step/certs /home/step/config /home/step/secrets
# 3. Copy in the four files
docker cp ansible/roles/ca_trust/files/root_ca.crt "$TMP":/home/step/certs/root_ca.crt
docker cp step-ca/files/intermediate_ca.crt "$TMP":/home/step/certs/intermediate_ca.crt
docker cp step-ca/ca.json.hsm "$TMP":/home/step/config/ca.json
# step-ca refuses to start without a password file even when keys are on HSM —
# we write a dummy non-empty value; it's never consulted because the key is
# referenced via the PKCS#11 URI.
printf 'unused-but-required\n' > /tmp/step-ca-password
docker cp /tmp/step-ca-password "$TMP":/home/step/secrets/password
rm -f /tmp/step-ca-password
# 4. Hand the whole tree to UID 1000 — fresh volume root is root:root,
# which would block step-ca (UID 1000) from creating /home/step/db at startup.
docker exec "$TMP" chown -R 1000:1000 /home/step
# 5. Sanity check
docker exec "$TMP" ls -la /home/step/certs /home/step/config /home/step/secrets
# 6. Tear down — --rm cleans up
docker kill "$TMP"
If step 5 doesn’t show all four files owned by 1000:1000, stop
and diagnose before continuing — step-ca will fail in confusing ways
on a partially-populated or root-owned volume.
Two
docker cpgotchas this section guards against:
docker create --rmis not a running container.docker execdoesn’t work on it, so themkdir -pstep would silently fail. Earlier playbook drafts useddocker create; switching todocker run -d --rmis what makes step 2 work.
docker cp -reads stdin as a tar stream, not raw bytes.echo "unused-but-required" | docker cp - CONTAINER:/pathfails with “archive/tar: invalid tar header.” Always use a temp file plusdocker cp <tmpfile>.
§G5 — Bring step-ca back up#
ccat ca up
ccat ca logs step-ca
Expected log lines:
Loaded key from PKCS#11 URI ...Server listening on :9000
If you see PKCS#11 errors instead (module not found, permission denied, token not present), the most likely causes are:
Container can’t see the device —
docker exec <container> pkcs11-tool --list-slotsreports nothing. Fix: revisit §E4group_add, or check the host udev rule has reloaded.PIN file empty or unreadable —
docker exec <container> ls -la /run/secrets/hsm-pinshows zero bytes or wrong perms. Fix: check the entrypoint script is executable andSTEP_CA_HSM_PINis in.env.Wrong serial in
ca.json.hsm— the URIserial=...must matchpkcs11-tool --list-token-slotsexactly. Fix and rebuild the volume (step G4 onwards).
If you have to roll back, the procedure is:
ccat ca down
docker volume rm ccat-ca_step-ca-data
# Re-populate from a backup of the Phase 1 volume — only possible if
# you snapshotted it before Stage G3. Otherwise: re-run Phase 1
# auto-init by reverting the compose changes and bringing up.
(In practice, “roll back” means “re-run Phase 1 with the throwaway
root and try Stage G again next window.” The Phase 1 root in the
committed ca_trust/files/ has been overwritten in Stage B, so a
true rollback also requires reverting that commit.)
§G6 — Re-add provisioners#
DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \
OIDC_ADMIN_EMAIL="buchbend@ph1.uni-koeln.de" \
./step-ca/provisioners-bootstrap.sh
ccat ca restart step-ca # required: step-ca caches ca.json at startup
This re-creates the six-provisioner set via direct jq edits to
ca.json: CCAT-GitHub (OIDC), prod-services, staging-services,
service-accounts (JWK), acme (ACME), sshpop (SSHPOP). Dex’s
dex-data volume is unchanged so the static step-ca client secret
still works.
Why not
ccat ca provisioner sync? That command runs the olderprovisioners-add.sh, which usesstep ca provisioner add. step-cli 0.30.2 has no flag combination that makes that work against a Phase-2 ca.json (admin-API requires--ca-url+ admin auth;--offlinedoesn’t exist;--ca-configdoesn’t trigger offline editing). See lessons-learned §2. The bootstrap script bypasses step-cli for the add step and uses jq directly, which is the only path that works in this version.
§G7 — Verify external endpoints#
curl -sI https://ca.ccat.uni-koeln.de/health
curl -s https://auth.ccat.uni-koeln.de/.well-known/openid-configuration | jq .issuer
/health should respond (note: behind nginx-proxy with LE on 443 for
the Phase 1 layout — see CA architecture doc § “Why step-ca is NOT
behind nginx-proxy” for the Phase 3 cutover plan once the uni
firewall opens 9000).
Dex issuer must be exactly https://auth.ccat.uni-koeln.de.
§G8 — Issue a test cert from inside the box, prove HSM signing#
# Provisioner count via direct ca.json read (the API path needs --ca-url)
docker exec ccat-ca-step-ca-1 jq -r '.authority.provisioners[] | "\(.type)\t\(.name)"' /home/step/config/ca.json
# Should list 6 provisioners
# Issue a test cert via the prod-services JWK provisioner —
# the entire signing path goes through the HSM intermediate.
docker exec ccat-ca-step-ca-1 step ca certificate proof.test /tmp/t.crt /tmp/t.key \
--provisioner prod-services \
--ca-url https://localhost:9000 \
--root /home/step/certs/root_ca.crt \
--provisioner-password-file /home/step/secrets/password
# Verify chain: cert -> intermediate (HSM-backed) -> root
docker exec ccat-ca-step-ca-1 openssl verify \
-CAfile /home/step/certs/root_ca.crt \
-untrusted /home/step/certs/intermediate_ca.crt \
/tmp/t.crt
# Want: "/tmp/t.crt: OK"
# Cleanup
docker exec ccat-ca-step-ca-1 rm -f /tmp/t.crt /tmp/t.key
To rule out any doubt that the HSM is actually being used (rather than a file-backed key with the same public part), check the runtime topology:
# step-ca process has the PKCS#11 module loaded, no SoftHSM:
docker exec ccat-ca-step-ca-1 sh -c 'cat /proc/$(pidof step-ca)/maps | grep -E "opensc-pkcs11|softhsm"'
# pcscd holds the actual USB device file open:
docker exec ccat-ca-step-ca-1 sh -c 'ls -l /proc/$(pidof pcscd)/fd/' | grep -E 'bus/usb|ccid'
The conjunction of “ca.json points only at PKCS#11”, “opensc-pkcs11.so
mmap’d into step-ca’s address space”, “pcscd holding the USB FD”,
and “issued cert chains back to the intermediate whose public key
was bit-equal to HSM id=01 in §A4” is conclusive proof of
HSM-backed signing without needing physical access to the dongle.
Stage H — Test cohort re-bootstraps#
This is the rehearsed checkpoint. Every future root-rotation event depends on this command working cleanly across the team.
§H1 — Each member runs#
step ca bootstrap --force \
--ca-url https://ca.ccat.uni-koeln.de \
--fingerprint <NEW-FINGERPRINT-FROM-PAPER>
The fingerprint goes in the announcement they got in §G1. They
must visually compare what step-cli prints against the
fingerprint in the announcement before pressing y. If it doesn’t
match, stop and investigate. Do not click through.
§H2 — Each member tests step ssh login#
step ssh login
ssh input-a.data.ccat.uni-koeln.de
End-to-end flow: browser → Dex → GitHub OAuth →
ccatobs/datacenter team check → cert lands in ssh-agent → ssh
into a managed host succeeds.
If their bootstrap completes but step ssh login fails with
x509: certificate signed by unknown authority, they’re hitting the
trust-bundle issue from Phase 1 (LE cert on 443 vs CCAT root in
~/.step). Workaround documented in
docs/source/ca-provisioner-management.md § “Troubleshooting:
x509 certificate signed by unknown authority”. Phase 3 fix is opening
TCP 9000.
§H3 — Retrospective#
Capture any snags in docs/source/ceremony/ as
lessons-learned-cutover-YYYY-MM-DD.md. The first real root
rotation in 5–10 years will follow the same procedure.
§I — After the cutover settles#
Watch the CA for at least one week. No new production services are migrated yet — soak time only.
Watch for HSM/USB/udev surprises across container restarts:
docker restart ccat-ca-step-ca-1and confirm step-ca comes back cleanly without operator intervention.Watch LE auto-renewal tick over for
ca.ccat.uni-koeln.deandauth.ccat.uni-koeln.de.Confirm provisioner counts hold across restarts. Use the direct ca.json read (admin-API path needs
--ca-urland is blocked byenableAdmin: false):docker exec ccat-ca-step-ca-1 jq '.authority.provisioners | length' /home/step/config/ca.json— six entries, every time.When the soak passes, mark Phase 2 done in
step-ca/COMMISSIONING-TODO.mdand start Phase 3 (production service rollout).
Day-2 ops — token contention#
Only one libusb client per device interface. With the container’s
pcscd holding the dongle, host-side sudo pkcs11-tool etc. will
fail. The right workflow for ad-hoc HSM diagnostics on the host is:
ccat ca down # release the device
sudo systemctl unmask pcscd.service pcscd.socket
sudo systemctl start pcscd.socket # socket-activates
sudo pkcs11-tool --module /usr/lib64/pkcs11/opensc-pkcs11.so --list-slots
# ... do whatever ...
sudo systemctl stop pcscd.service pcscd.socket
sudo systemctl mask pcscd.service pcscd.socket
ccat ca up # container takes the device back
Treat host-side diagnostics as brief, scheduled operations — every minute the container is down, certs aren’t issuing.
Roll-back / what-if#
Symptom |
Likely cause |
Action |
|---|---|---|
§A |
Dongle not seated, or udev rule hasn’t reloaded |
Reseat, |
§A4 diff non-empty |
Wrong dongle in server, or tampered export USB |
Stop. Treat as integrity failure; do not proceed. Plan a fresh ceremony if needed |
§G5 |
|
Confirm Stage D image build (should be |
§G5 |
Most often: host pcscd is running and holding the dongle, OR container is starting as UID 1000 instead of root and pcscd-in-container can’t claim USB ioctls |
Mask host pcscd (§G3); confirm compose has |
§G5 polkit-related rejection in pcscd logs |
Entrypoint missing |
Update entrypoint script per §E5; rebuild container (entrypoint is bind-mounted, so a |
§G6 step-cli |
You ran |
Use |
§H1 |
Either bootstrapping the wrong host, or someone substituted an attacker-issued cert |
Stop. Do not click through. Contact ops |
§G5 |
Wrong serial in URI, or HSM came back on a different bus after a host reboot |
Check |
§H1 fingerprint mismatch |
Critical. Either the bootstrap is hitting the wrong host, or someone has substituted an attacker-issued cert |
Stop. Do not click through. Contact ops |
Tested cohort can bootstrap, can’t |
Trust-bundle hack needed (LE on 443, CCAT root in step trust) |
See troubleshooting in |
Appendix — files this playbook touches#
New files (committed as part of the Stage E PR):
step-ca/Dockerfile.hsmstep-ca/step-ca-hsm-entrypoint.shstep-ca/ca.json.hsm
Edited files:
docker-compose.ca.yml— servicestep-ca(build, env, devices, tmpfs, entrypoint, volume cleanup)ansible/host_vars/input-b/hsm.yml—_hsm_enforce_verify: trueansible/roles/ca_trust/files/root_ca.crt— overwritten with ceremony outputansible/roles/ca_trust/files/ssh_user_ca.pub— overwrittenansible/roles/ca_trust/files/ssh_host_ca.pub— overwrittenansible/vars_application_schema.yml— addsvault_step_ca_hsm_pin(viaccat secrets add)
Deleted files:
step-ca/softhsm2.conf— vestigial Phase 1, never used in earnest
Volumes:
ccat-ca_step-ca-data— wiped in §G3, re-populated in §G4ccat-ca_softhsm-tokens— declared in compose but empty; can bedocker volume rm’d after Stage E landsccat-ca_dex-data— not touched; Dex state survives the cutover