CA provisioner management#

This page is a how-to for adding, updating, removing, and rotating the CCAT step-ca provisioners. It also covers the operational rollout of Pattern A service-account certs via ssh_service_cert. For the provisioner set itself (six entries with their types and lifetimes), see the lookup table in CCAT CA — Provisioner set and reference tables. For the why behind the lifetime choices and the GitHub-team gate, see CCAT Certificate Authority — Architecture and Design.

When to run provisioners-add.sh#

The script is idempotent by skip: each provisioner is checked against step ca provisioner list before being added, and existing entries are left alone. You run it:

  1. Once during Phase 1 commissioning, right after populating the Dex step-ca client secret in the vault (vault_dex_stepca_client_secret). See the Phase 1 checklist in step-ca/COMMISSIONING-TODO.md.

  2. Once during Phase 2 cutover, after the step-ca-data volume has been wiped and pre-populated with ceremony outputs. The new ca.json starts fresh with no provisioners — you re-run the script to restore the set.

  3. Any time you want to add a new provisioner. Edit the script to append a new add_provisioner block, commit, run. Existing ones are skipped; only the new one gets added.

How to run it#

# On input-b (or via ssh from a laptop if you prefer)
cd /opt/data-center/system-integration

# Recommended: use the `ccat ca provisioner sync` wrapper, which
# prompts for the client secret (hidden), reads it from your
# terminal, and runs the script with the right env for you.
ccat ca provisioner sync

# Or run the script directly:
DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \
OIDC_ADMIN_EMAIL="you@uni-koeln.de" \
./step-ca/provisioners-add.sh

# Then apply the changes
ccat ca restart step-ca

The script:

  • Aborts cleanly if required env vars are missing (DEX_STEPCA_CLIENT_SECRET, OIDC_ADMIN_EMAIL).

  • Pre-flights the target container is running and the password file is readable inside it.

  • Adds each provisioner via docker exec ... step ca provisioner add, reusing /home/step/secrets/password (which contains STEP_CA_PASSWORD) for both JWK encryption and admin API auth.

  • Prints a summary of the final provisioner list.

Updating lifetimes on existing provisioners#

The script does not modify existing provisioners. If you want to change a lifetime — say, loosen prod-services from 90d to 180d — use step ca provisioner update directly inside the container:

docker exec -it ccat-ca-step-ca-1 step ca provisioner update prod-services \
  --x509-default-dur 4320h \
  --x509-max-dur 4320h

ccat ca restart step-ca

For the full table of valid lifetime flags, see CCAT CA — Provisioner set and reference tables § “Lifetime flags”.

All durations are passed as Go time.Duration strings — h for hours, m for minutes. Don’t use d or w (not supported).

After any provisioner update, always restart step-ca so it re-reads ca.json:

ccat ca restart step-ca

You can also update the script’s default values and re-commit, so that a future DR re-install gets the new defaults. But the live provisioners won’t change until you also run step ca provisioner update.

Removing a provisioner#

docker exec -it ccat-ca-step-ca-1 step ca provisioner remove <name>
ccat ca restart step-ca

Be careful: removing a provisioner does NOT revoke the certs it previously issued. Those keep validating until their own expiry. If you need to actually invalidate issued certs, bump the intermediate or add them to the CRL. See CA rotation and disaster recovery for intermediate rotation.

Rotating JWK provisioner passwords#

The three JWK provisioners — prod-services, staging-services, service-accounts — encrypt their private keys inside ca.json using a password each. The password is also what step-cli clients must supply (--password-file) when issuing certificates through that provisioner. Anyone with the password can issue certs against the provisioner’s authorized cert types and lifetimes. Treat each password like a service credential — vault-stored, rotated on a schedule or on suspicion of disclosure.

The passwords live in the application_env vault as:

  • vault_step_ca_prov_prod_services_password

  • vault_step_ca_prov_staging_services_password

  • vault_step_ca_prov_service_accounts_password

To rotate one or more passwords:

# 1. Rotate the vault var(s). On any operator workstation:
ccat secrets rotate vault_step_ca_prov_prod_services_password --env production
ccat secrets rotate vault_step_ca_prov_staging_services_password --env production
ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production

# 2. Run the explicit-tag-only Ansible task. Ansible reads the four
#    vault vars (three JWK + Dex client secret) directly from the
#    encrypted vault, materialises them into a 0400 root-owned tmpfile
#    on input-b, sources from a `bash -c` invocation, then runs
#    `provisioners-bootstrap.sh --rotate-jwk`. The tmpfile is removed
#    in an `always:` block (incl. on failure); no_log: true keeps
#    secrets out of Ansible logs.
make play-hsm-host T=hsm_host_rotate_jwk

# 3. Restart step-ca so the new ca.json is loaded:
ssh input-b 'ccat ca restart step-ca'

# 4. Re-issue any short-lived service-account SSH certs that were
#    minted under the old password. Existing TLS certs keep
#    validating until expiry; re-issuance happens at next renewal.

The T=hsm_host_rotate_jwk tag is a [never, hsm_host_rotate_jwk] gate — default plays (make play-hsm-host without T=) skip the rotation block entirely. The operator has to ask for it explicitly.

Earlier .env-round-trip runbooks for this rotation are obsolete and removed; the Ansible-driven path above is the only supported one.

Notes:

  • Existing certificates remain valid until their natural expiry. The rotation only changes who can issue new certs going forward.

  • The script is idempotent without --rotate-jwk (skips existing provisioners). With --rotate-jwk, it removes the three JWK entries first and re-adds them — never touches OIDC, ACME, or SSHPOP.

  • If you’re rotating because a password was leaked, rotate and also revoke any certs that were already issued under the old password (CRL or intermediate bump). Rotation alone does not invalidate already-issued certificates.

Wiring Pattern A — rollout#

Pattern A is the long-lived-cert-with-auto-renewal model used by Jenkins and ccat_transfer. The conceptual description (what Pattern A is, why it exists, what the role does) lives in CCAT Certificate Authority — Architecture and Design § “Service-account SSH patterns”. The operational rollout — staging first, then production — is here.

Rollout — staging#

# 1. (One-time) populate the staging vault with the same provisioner
#    password the production vault has. The CA on input-b is shared,
#    so the password is the same.
ccat secrets show vault_step_ca_prov_service_accounts_password --env production --reveal | tail -1
ccat secrets set vault_step_ca_prov_service_accounts_password --env staging
# (paste the value above)

# 2. Apply the role (full play or scoped via tag).
make play-staging T=ssh_service_cert

# 3. Verify on each staging input node (input-{a,b,c}-staging):
ssh input-b.staging
sudo -u ccat_transfer step ssh inspect ~ccat_transfer/.ssh/ccat_id_ed25519-cert.pub
# Expected: Type: user certificate, principals: ccat_transfer, valid 24h.

systemctl status step-renew@ccat_transfer.timer
# Expected: active (waiting), Trigger: <next 6h boundary>.

# 4. Smoke-test: ccat_transfer can SSH between staging nodes using the cert.
sudo -u ccat_transfer ssh -i ~ccat_transfer/.ssh/ccat_id_ed25519 input-a.staging hostname

Rollout — production#

After staging has been clean for a few days:

# Add the same vars file under group_vars/input_ccat/.
cp ansible/group_vars/input_staging/vars_ssh_service_cert.yml \
   ansible/group_vars/input_ccat/vars_ssh_service_cert.yml
git add ansible/group_vars/input_ccat/vars_ssh_service_cert.yml
git commit -m "ssh_service_cert: enable ccat_transfer on production input nodes"
git push

# Apply.
make play-input-ccat T=ssh_service_cert

Troubleshooting: x509: certificate signed by unknown authority#

Symptom: after step ca bootstrap succeeds, any subsequent command that talks to the CA (step ssh login, step ca provisioner list, etc.) fails with:

client GET https://ca.ccat.uni-koeln.de/... failed:
tls: failed to verify certificate: x509: certificate signed
by unknown authority

This means the proxy at :443 is presenting something that doesn’t chain to the CCAT root in /home/<user>/.step/certs/root_ca.crt. There are two cases.

Case A — uni-Köln IP client. The vhost cert is wrong. Check what the proxy is actually serving:

echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 \
    -servername ca.ccat.uni-koeln.de 2>/dev/null \
    | openssl x509 -noout -issuer
# Expect: issuer=O=CCAT Observatory, CN=CCAT Intermediate CA

If the issuer is Let's Encrypt (or any non-CCAT issuer), the CCAT-rooted vhost cert was overwritten or the timer fell behind. Force a renewal or re-issue per the runbook in CA rotation and disaster recovery § “Vhost cert routine rotation” and § “Vhost cert emergency re-issue”, then ccat proxy restart.

Case B — off-uni client. You’re stopped at the proxy IP allowlist (proxy/data/vhost.d/ca.ccat.uni-koeln.de) before TLS ever negotiates — what you actually see is a 403 Forbidden page served with the proxy’s default LE cert, which doesn’t chain to the CCAT root. Either onboard your CIDR (see Adding a partner subnet in CA day-to-day operations) or tunnel through hera (see Bootstrapping from off-network in Client setup — SSH with step-ca certificates). Client-side trust-bundle workarounds are not supported and won’t help — the TLS error is downstream of an HTTP 403, not an actual chain verification failure.

Troubleshooting provisioner setup#

“Live API shows fewer provisioners than ca.json” (the split-brain we hit during Phase 1). Root cause: enableAdmin: true in ca.json puts step-ca into “remote management” mode, where the runtime uses an internal BoltDB for provisioners and reads ca.json only at first-ever boot when the DB is empty. Since the init path auto-creates admin + sshpop, the DB is never empty, so subsequent offline edits to ca.json (the mode step ca provisioner add uses when it has filesystem access) are invisible to the running CA.

We intentionally do not enable remote management on CCAT’s step-ca. The docker-compose.ca.yml omits DOCKER_STEPCA_INIT_REMOTE_MANAGEMENT so that ca.json stays the single source of truth for provisioners and offline-mode step ca provisioner add calls take effect on restart.

If you somehow end up with enableAdmin: true in an existing ca.json (e.g., legacy volume from before we fixed the compose file), flip it back:

ccat ca down
docker run --rm -v ccat-ca_step-ca-data:/home/step busybox sh -c '
  sed -i "s/\"enableAdmin\": true/\"enableAdmin\": false/" /home/step/config/ca.json
  grep enableAdmin /home/step/config/ca.json
'
ccat ca up

Then step ca provisioner list should show everything that was in ca.json.

“error getting admin:” or HTTP 401 from the admin API — the remote management layer is enabled (enableAdmin: true in ca.json) and your step ca provisioner add call is not authenticating as an admin. See the split-brain troubleshoot above — disabling remote management is the right fix. If for some reason you need to keep remote management on, the script’s --password-file /home/step/secrets/password pattern should work because the auto-init admin provisioner is created with STEP_CA_PASSWORD. If that still fails:

docker exec ccat-ca-step-ca-1 step ca admin list

This lists the current admins and their provisioner. If the auto-init admin is not present (unusual), you can fall back to editing ca.json directly:

# 1. Stop step-ca
ccat ca down

# 2. Copy ca.json out of the volume
docker run --rm -v ccat-ca_step-ca-data:/src -v "$PWD":/dst alpine \
  cp /src/config/ca.json /dst/ca.json.backup

# 3. Edit ca.json.backup by hand: set "authority": { "enableAdmin": false }
# 4. Write it back:
docker run --rm -v ccat-ca_step-ca-data:/dst -v "$PWD":/src alpine \
  cp /src/ca.json.backup /dst/config/ca.json

# 5. Start step-ca, re-run the provisioner script (which now edits
#    ca.json directly without admin auth), then re-enable admin:
ccat ca up
./step-ca/provisioners-add.sh
# re-edit ca.json to flip enableAdmin back to true
ccat ca restart step-ca

This fallback is ugly but deterministic. Report back if you hit it so we can improve the script.

“OIDC configuration endpoint not reachable” — step-ca tries to fetch https://auth.ccat.uni-koeln.de/.well-known/openid-configuration on add. If Dex is down, or if the TLS cert is not trusted by the step-ca container’s OS trust store, this fails. Check:

# From inside the step-ca container
docker exec ccat-ca-step-ca-1 wget -qO- \
  https://auth.ccat.uni-koeln.de/.well-known/openid-configuration

Should return JSON with an issuer field equal to https://auth.ccat.uni-koeln.de. If it returns a TLS error, the step-ca image’s trust store doesn’t have Let’s Encrypt — unusual but possible. If it returns 404, Dex isn’t actually running behind the nginx-proxy vhost: check ccat ca status and ccat ca logs dex.

“provisioner already exists” — the script should handle this, but if you’re running step ca provisioner add manually without the existence check, you hit this. Use step ca provisioner update instead, or remove then add.

See also#