CA provisioner management#
This page is a how-to for adding, updating, removing, and rotating
the CCAT step-ca provisioners. It also covers the operational rollout
of Pattern A service-account certs via ssh_service_cert. For the
provisioner set itself (six entries with their types and lifetimes),
see the lookup table in CCAT CA — Provisioner set and reference tables. For the
why behind the lifetime choices and the GitHub-team gate, see
CCAT Certificate Authority — Architecture and Design.
When to run provisioners-add.sh#
The script is idempotent by skip: each provisioner is checked
against step ca provisioner list before being added, and existing
entries are left alone. You run it:
Once during Phase 1 commissioning, right after populating the Dex step-ca client secret in the vault (
vault_dex_stepca_client_secret). See the Phase 1 checklist instep-ca/COMMISSIONING-TODO.md.Once during Phase 2 cutover, after the
step-ca-datavolume has been wiped and pre-populated with ceremony outputs. The newca.jsonstarts fresh with no provisioners — you re-run the script to restore the set.Any time you want to add a new provisioner. Edit the script to append a new
add_provisionerblock, commit, run. Existing ones are skipped; only the new one gets added.
How to run it#
# On input-b (or via ssh from a laptop if you prefer)
cd /opt/data-center/system-integration
# Recommended: use the `ccat ca provisioner sync` wrapper, which
# prompts for the client secret (hidden), reads it from your
# terminal, and runs the script with the right env for you.
ccat ca provisioner sync
# Or run the script directly:
DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \
OIDC_ADMIN_EMAIL="you@uni-koeln.de" \
./step-ca/provisioners-add.sh
# Then apply the changes
ccat ca restart step-ca
The script:
Aborts cleanly if required env vars are missing (
DEX_STEPCA_CLIENT_SECRET,OIDC_ADMIN_EMAIL).Pre-flights the target container is running and the password file is readable inside it.
Adds each provisioner via
docker exec ... step ca provisioner add, reusing/home/step/secrets/password(which containsSTEP_CA_PASSWORD) for both JWK encryption and admin API auth.Prints a summary of the final provisioner list.
Updating lifetimes on existing provisioners#
The script does not modify existing provisioners. If you want to
change a lifetime — say, loosen prod-services from 90d to 180d —
use step ca provisioner update directly inside the container:
docker exec -it ccat-ca-step-ca-1 step ca provisioner update prod-services \
--x509-default-dur 4320h \
--x509-max-dur 4320h
ccat ca restart step-ca
For the full table of valid lifetime flags, see CCAT CA — Provisioner set and reference tables § “Lifetime flags”.
All durations are passed as Go time.Duration strings — h for hours,
m for minutes. Don’t use d or w (not supported).
After any provisioner update, always restart step-ca so it
re-reads ca.json:
ccat ca restart step-ca
You can also update the script’s default values and re-commit, so
that a future DR re-install gets the new defaults. But the live
provisioners won’t change until you also run step ca provisioner update.
Removing a provisioner#
docker exec -it ccat-ca-step-ca-1 step ca provisioner remove <name>
ccat ca restart step-ca
Be careful: removing a provisioner does NOT revoke the certs it previously issued. Those keep validating until their own expiry. If you need to actually invalidate issued certs, bump the intermediate or add them to the CRL. See CA rotation and disaster recovery for intermediate rotation.
Rotating JWK provisioner passwords#
The three JWK provisioners — prod-services, staging-services,
service-accounts — encrypt their private keys inside ca.json
using a password each. The password is also what step-cli clients
must supply (--password-file) when issuing certificates through
that provisioner. Anyone with the password can issue certs against
the provisioner’s authorized cert types and lifetimes. Treat each
password like a service credential — vault-stored, rotated on a
schedule or on suspicion of disclosure.
The passwords live in the application_env vault as:
vault_step_ca_prov_prod_services_passwordvault_step_ca_prov_staging_services_passwordvault_step_ca_prov_service_accounts_password
To rotate one or more passwords:
# 1. Rotate the vault var(s). On any operator workstation:
ccat secrets rotate vault_step_ca_prov_prod_services_password --env production
ccat secrets rotate vault_step_ca_prov_staging_services_password --env production
ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production
# 2. Run the explicit-tag-only Ansible task. Ansible reads the four
# vault vars (three JWK + Dex client secret) directly from the
# encrypted vault, materialises them into a 0400 root-owned tmpfile
# on input-b, sources from a `bash -c` invocation, then runs
# `provisioners-bootstrap.sh --rotate-jwk`. The tmpfile is removed
# in an `always:` block (incl. on failure); no_log: true keeps
# secrets out of Ansible logs.
make play-hsm-host T=hsm_host_rotate_jwk
# 3. Restart step-ca so the new ca.json is loaded:
ssh input-b 'ccat ca restart step-ca'
# 4. Re-issue any short-lived service-account SSH certs that were
# minted under the old password. Existing TLS certs keep
# validating until expiry; re-issuance happens at next renewal.
The T=hsm_host_rotate_jwk tag is a [never, hsm_host_rotate_jwk]
gate — default plays (make play-hsm-host without T=) skip the
rotation block entirely. The operator has to ask for it explicitly.
Earlier .env-round-trip runbooks for this rotation are obsolete and
removed; the Ansible-driven path above is the only supported one.
Notes:
Existing certificates remain valid until their natural expiry. The rotation only changes who can issue new certs going forward.
The script is idempotent without
--rotate-jwk(skips existing provisioners). With--rotate-jwk, it removes the three JWK entries first and re-adds them — never touches OIDC, ACME, or SSHPOP.If you’re rotating because a password was leaked, rotate and also revoke any certs that were already issued under the old password (CRL or intermediate bump). Rotation alone does not invalidate already-issued certificates.
Wiring Pattern A — rollout#
Pattern A is the long-lived-cert-with-auto-renewal model used by
Jenkins and ccat_transfer. The conceptual description (what Pattern
A is, why it exists, what the role does) lives in
CCAT Certificate Authority — Architecture and Design § “Service-account SSH patterns”.
The operational rollout — staging first, then production — is here.
Rollout — staging#
# 1. (One-time) populate the staging vault with the same provisioner
# password the production vault has. The CA on input-b is shared,
# so the password is the same.
ccat secrets show vault_step_ca_prov_service_accounts_password --env production --reveal | tail -1
ccat secrets set vault_step_ca_prov_service_accounts_password --env staging
# (paste the value above)
# 2. Apply the role (full play or scoped via tag).
make play-staging T=ssh_service_cert
# 3. Verify on each staging input node (input-{a,b,c}-staging):
ssh input-b.staging
sudo -u ccat_transfer step ssh inspect ~ccat_transfer/.ssh/ccat_id_ed25519-cert.pub
# Expected: Type: user certificate, principals: ccat_transfer, valid 24h.
systemctl status step-renew@ccat_transfer.timer
# Expected: active (waiting), Trigger: <next 6h boundary>.
# 4. Smoke-test: ccat_transfer can SSH between staging nodes using the cert.
sudo -u ccat_transfer ssh -i ~ccat_transfer/.ssh/ccat_id_ed25519 input-a.staging hostname
Rollout — production#
After staging has been clean for a few days:
# Add the same vars file under group_vars/input_ccat/.
cp ansible/group_vars/input_staging/vars_ssh_service_cert.yml \
ansible/group_vars/input_ccat/vars_ssh_service_cert.yml
git add ansible/group_vars/input_ccat/vars_ssh_service_cert.yml
git commit -m "ssh_service_cert: enable ccat_transfer on production input nodes"
git push
# Apply.
make play-input-ccat T=ssh_service_cert
Troubleshooting provisioner setup#
“Live API shows fewer provisioners than ca.json” (the split-brain
we hit during Phase 1). Root cause: enableAdmin: true in ca.json
puts step-ca into “remote management” mode, where the runtime uses
an internal BoltDB for provisioners and reads ca.json only at
first-ever boot when the DB is empty. Since the init path
auto-creates admin + sshpop, the DB is never empty, so subsequent
offline edits to ca.json (the mode step ca provisioner add uses
when it has filesystem access) are invisible to the running CA.
We intentionally do not enable remote management on CCAT’s
step-ca. The docker-compose.ca.yml omits DOCKER_STEPCA_INIT_REMOTE_MANAGEMENT
so that ca.json stays the single source of truth for provisioners
and offline-mode step ca provisioner add calls take effect on
restart.
If you somehow end up with enableAdmin: true in an existing ca.json
(e.g., legacy volume from before we fixed the compose file), flip
it back:
ccat ca down
docker run --rm -v ccat-ca_step-ca-data:/home/step busybox sh -c '
sed -i "s/\"enableAdmin\": true/\"enableAdmin\": false/" /home/step/config/ca.json
grep enableAdmin /home/step/config/ca.json
'
ccat ca up
Then step ca provisioner list should show everything that was in
ca.json.
“error getting admin:” or HTTP 401 from the admin API — the
remote management layer is enabled (enableAdmin: true in ca.json)
and your step ca provisioner add call is not authenticating as an
admin. See the split-brain troubleshoot above — disabling remote
management is the right fix. If for some reason you need to keep
remote management on, the script’s --password-file /home/step/secrets/password
pattern should work because the auto-init admin provisioner is
created with STEP_CA_PASSWORD. If that still fails:
docker exec ccat-ca-step-ca-1 step ca admin list
This lists the current admins and their provisioner. If the auto-init
admin is not present (unusual), you can fall back to editing ca.json
directly:
# 1. Stop step-ca
ccat ca down
# 2. Copy ca.json out of the volume
docker run --rm -v ccat-ca_step-ca-data:/src -v "$PWD":/dst alpine \
cp /src/config/ca.json /dst/ca.json.backup
# 3. Edit ca.json.backup by hand: set "authority": { "enableAdmin": false }
# 4. Write it back:
docker run --rm -v ccat-ca_step-ca-data:/dst -v "$PWD":/src alpine \
cp /src/ca.json.backup /dst/config/ca.json
# 5. Start step-ca, re-run the provisioner script (which now edits
# ca.json directly without admin auth), then re-enable admin:
ccat ca up
./step-ca/provisioners-add.sh
# re-edit ca.json to flip enableAdmin back to true
ccat ca restart step-ca
This fallback is ugly but deterministic. Report back if you hit it so we can improve the script.
“OIDC configuration endpoint not reachable” — step-ca tries to
fetch https://auth.ccat.uni-koeln.de/.well-known/openid-configuration
on add. If Dex is down, or if the TLS cert is not trusted by the
step-ca container’s OS trust store, this fails. Check:
# From inside the step-ca container
docker exec ccat-ca-step-ca-1 wget -qO- \
https://auth.ccat.uni-koeln.de/.well-known/openid-configuration
Should return JSON with an issuer field equal to
https://auth.ccat.uni-koeln.de. If it returns a TLS error, the
step-ca image’s trust store doesn’t have Let’s Encrypt — unusual
but possible. If it returns 404, Dex isn’t actually running behind
the nginx-proxy vhost: check ccat ca status and ccat ca logs dex.
“provisioner already exists” — the script should handle this, but
if you’re running step ca provisioner add manually without the
existence check, you hit this. Use step ca provisioner update
instead, or remove then add.
See also#
CCAT CA — Provisioner set and reference tables — provisioner set table, lifetime-flag table.
CCAT Certificate Authority — Architecture and Design — Pattern A narrative, why these lifetimes.
CA day-to-day operations — issuing certs against existing provisioners.
CA rotation and disaster recovery — for “I need to invalidate already-issued certs”.