# CA provisioner management This page is a **how-to** for adding, updating, removing, and rotating the CCAT step-ca provisioners. It also covers the operational rollout of Pattern A service-account certs via `ssh_service_cert`. For the provisioner set itself (six entries with their types and lifetimes), see the lookup table in {doc}`background/ca-provisioner-set`. For the *why* behind the lifetime choices and the GitHub-team gate, see {doc}`background/ca-architecture`. ```{contents} :local: :depth: 2 ``` ## When to run `provisioners-add.sh` The script is **idempotent by skip**: each provisioner is checked against `step ca provisioner list` before being added, and existing entries are left alone. You run it: 1. **Once during Phase 1 commissioning**, right after populating the Dex step-ca client secret in the vault (`vault_dex_stepca_client_secret`). See the Phase 1 checklist in `step-ca/COMMISSIONING-TODO.md`. 2. **Once during Phase 2 cutover**, after the `step-ca-data` volume has been wiped and pre-populated with ceremony outputs. The new `ca.json` starts fresh with no provisioners — you re-run the script to restore the set. 3. **Any time you want to add a new provisioner.** Edit the script to append a new `add_provisioner` block, commit, run. Existing ones are skipped; only the new one gets added. ## How to run it ```bash # On input-b (or via ssh from a laptop if you prefer) cd /opt/data-center/system-integration # Recommended: use the `ccat ca provisioner sync` wrapper, which # prompts for the client secret (hidden), reads it from your # terminal, and runs the script with the right env for you. ccat ca provisioner sync # Or run the script directly: DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \ OIDC_ADMIN_EMAIL="you@uni-koeln.de" \ ./step-ca/provisioners-add.sh # Then apply the changes ccat ca restart step-ca ``` The script: - Aborts cleanly if required env vars are missing (`DEX_STEPCA_CLIENT_SECRET`, `OIDC_ADMIN_EMAIL`). - Pre-flights the target container is running and the password file is readable inside it. - Adds each provisioner via `docker exec ... step ca provisioner add`, reusing `/home/step/secrets/password` (which contains `STEP_CA_PASSWORD`) for both JWK encryption and admin API auth. - Prints a summary of the final provisioner list. ## Updating lifetimes on existing provisioners The script does **not** modify existing provisioners. If you want to change a lifetime — say, loosen `prod-services` from 90d to 180d — use `step ca provisioner update` directly inside the container: ```bash docker exec -it ccat-ca-step-ca-1 step ca provisioner update prod-services \ --x509-default-dur 4320h \ --x509-max-dur 4320h ccat ca restart step-ca ``` For the full table of valid lifetime flags, see {doc}`background/ca-provisioner-set` § "Lifetime flags". All durations are passed as Go time.Duration strings — `h` for hours, `m` for minutes. Don't use `d` or `w` (not supported). **After any provisioner update, always restart step-ca** so it re-reads `ca.json`: ```bash ccat ca restart step-ca ``` You can also update the script's default values and re-commit, so that a future DR re-install gets the new defaults. But the live provisioners won't change until you also run `step ca provisioner update`. ## Removing a provisioner ```bash docker exec -it ccat-ca-step-ca-1 step ca provisioner remove ccat ca restart step-ca ``` Be careful: removing a provisioner does NOT revoke the certs it previously issued. Those keep validating until their own expiry. If you need to actually invalidate issued certs, bump the intermediate or add them to the CRL. See {doc}`ca-rotation-and-recovery` for intermediate rotation. ## Rotating JWK provisioner passwords The three JWK provisioners — `prod-services`, `staging-services`, `service-accounts` — encrypt their private keys inside `ca.json` using a password each. The password is also what step-cli clients must supply (`--password-file`) when issuing certificates through that provisioner. Anyone with the password can issue certs against the provisioner's authorized cert types and lifetimes. **Treat each password like a service credential** — vault-stored, rotated on a schedule or on suspicion of disclosure. The passwords live in the application_env vault as: - `vault_step_ca_prov_prod_services_password` - `vault_step_ca_prov_staging_services_password` - `vault_step_ca_prov_service_accounts_password` To rotate one or more passwords: ```bash # 1. Rotate the vault var(s). On any operator workstation: ccat secrets rotate vault_step_ca_prov_prod_services_password --env production ccat secrets rotate vault_step_ca_prov_staging_services_password --env production ccat secrets rotate vault_step_ca_prov_service_accounts_password --env production # 2. Run the explicit-tag-only Ansible task. Ansible reads the four # vault vars (three JWK + Dex client secret) directly from the # encrypted vault, materialises them into a 0400 root-owned tmpfile # on input-b, sources from a `bash -c` invocation, then runs # `provisioners-bootstrap.sh --rotate-jwk`. The tmpfile is removed # in an `always:` block (incl. on failure); no_log: true keeps # secrets out of Ansible logs. make play-hsm-host T=hsm_host_rotate_jwk # 3. Restart step-ca so the new ca.json is loaded: ssh input-b 'ccat ca restart step-ca' # 4. Re-issue any short-lived service-account SSH certs that were # minted under the old password. Existing TLS certs keep # validating until expiry; re-issuance happens at next renewal. ``` The `T=hsm_host_rotate_jwk` tag is a `[never, hsm_host_rotate_jwk]` gate — default plays (`make play-hsm-host` without `T=`) skip the rotation block entirely. The operator has to ask for it explicitly. Earlier `.env`-round-trip runbooks for this rotation are obsolete and removed; the Ansible-driven path above is the only supported one. Notes: - Existing certificates remain valid until their natural expiry. The rotation only changes who can issue *new* certs going forward. - The script is idempotent without `--rotate-jwk` (skips existing provisioners). With `--rotate-jwk`, it removes the three JWK entries first and re-adds them — never touches OIDC, ACME, or SSHPOP. - If you're rotating because a password was leaked, rotate **and** also revoke any certs that were already issued under the old password (CRL or intermediate bump). Rotation alone does not invalidate already-issued certificates. ## Wiring Pattern A — rollout Pattern A is the long-lived-cert-with-auto-renewal model used by Jenkins and `ccat_transfer`. The conceptual description (what Pattern A is, why it exists, what the role does) lives in {doc}`background/ca-architecture` § "Service-account SSH patterns". The operational rollout — staging first, then production — is here. ### Rollout — staging ```bash # 1. (One-time) populate the staging vault with the same provisioner # password the production vault has. The CA on input-b is shared, # so the password is the same. ccat secrets show vault_step_ca_prov_service_accounts_password --env production --reveal | tail -1 ccat secrets set vault_step_ca_prov_service_accounts_password --env staging # (paste the value above) # 2. Apply the role (full play or scoped via tag). make play-staging T=ssh_service_cert # 3. Verify on each staging input node (input-{a,b,c}-staging): ssh input-b.staging sudo -u ccat_transfer step ssh inspect ~ccat_transfer/.ssh/ccat_id_ed25519-cert.pub # Expected: Type: user certificate, principals: ccat_transfer, valid 24h. systemctl status step-renew@ccat_transfer.timer # Expected: active (waiting), Trigger: . # 4. Smoke-test: ccat_transfer can SSH between staging nodes using the cert. sudo -u ccat_transfer ssh -i ~ccat_transfer/.ssh/ccat_id_ed25519 input-a.staging hostname ``` ### Rollout — production After staging has been clean for a few days: ```bash # Add the same vars file under group_vars/input_ccat/. cp ansible/group_vars/input_staging/vars_ssh_service_cert.yml \ ansible/group_vars/input_ccat/vars_ssh_service_cert.yml git add ansible/group_vars/input_ccat/vars_ssh_service_cert.yml git commit -m "ssh_service_cert: enable ccat_transfer on production input nodes" git push # Apply. make play-input-ccat T=ssh_service_cert ``` ## Troubleshooting: `x509: certificate signed by unknown authority` Symptom: after `step ca bootstrap` succeeds, any subsequent command that talks to the CA (`step ssh login`, `step ca provisioner list`, etc.) fails with: ``` client GET https://ca.ccat.uni-koeln.de/... failed: tls: failed to verify certificate: x509: certificate signed by unknown authority ``` This means the proxy at `:443` is presenting something that doesn't chain to the CCAT root in `/home//.step/certs/root_ca.crt`. There are two cases. **Case A — uni-Köln IP client.** The vhost cert is wrong. Check what the proxy is actually serving: ```bash echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 \ -servername ca.ccat.uni-koeln.de 2>/dev/null \ | openssl x509 -noout -issuer # Expect: issuer=O=CCAT Observatory, CN=CCAT Intermediate CA ``` If the issuer is `Let's Encrypt` (or any non-CCAT issuer), the CCAT-rooted vhost cert was overwritten or the timer fell behind. Force a renewal or re-issue per the runbook in {doc}`ca-rotation-and-recovery` § "Vhost cert routine rotation" and § "Vhost cert emergency re-issue", then `ccat proxy restart`. **Case B — off-uni client.** You're stopped at the proxy IP allowlist ({file}`proxy/data/vhost.d/ca.ccat.uni-koeln.de`) before TLS ever negotiates — what you actually see is a `403 Forbidden` page served with the proxy's default LE cert, which doesn't chain to the CCAT root. Either onboard your CIDR (see *Adding a partner subnet* in {doc}`ca-day-to-day`) or tunnel through hera (see *Bootstrapping from off-network* in {doc}`ca-client-onboarding`). Client-side trust-bundle workarounds are not supported and won't help — the TLS error is downstream of an HTTP 403, not an actual chain verification failure. ## Troubleshooting provisioner setup **"Live API shows fewer provisioners than ca.json"** (the split-brain we hit during Phase 1). Root cause: `enableAdmin: true` in `ca.json` puts step-ca into "remote management" mode, where the runtime uses an internal BoltDB for provisioners and reads `ca.json` only at first-ever boot when the DB is empty. Since the init path auto-creates `admin` + `sshpop`, the DB is never empty, so subsequent offline edits to `ca.json` (the mode `step ca provisioner add` uses when it has filesystem access) are invisible to the running CA. We intentionally **do not enable remote management** on CCAT's step-ca. The docker-compose.ca.yml omits `DOCKER_STEPCA_INIT_REMOTE_MANAGEMENT` so that ca.json stays the single source of truth for provisioners and offline-mode `step ca provisioner add` calls take effect on restart. If you somehow end up with `enableAdmin: true` in an existing ca.json (e.g., legacy volume from before we fixed the compose file), flip it back: ```bash ccat ca down docker run --rm -v ccat-ca_step-ca-data:/home/step busybox sh -c ' sed -i "s/\"enableAdmin\": true/\"enableAdmin\": false/" /home/step/config/ca.json grep enableAdmin /home/step/config/ca.json ' ccat ca up ``` Then `step ca provisioner list` should show everything that was in `ca.json`. **"error getting admin:" or HTTP 401 from the admin API** — the remote management layer is enabled (`enableAdmin: true` in `ca.json`) and your `step ca provisioner add` call is not authenticating as an admin. See the split-brain troubleshoot above — disabling remote management is the right fix. If for some reason you need to keep remote management on, the script's `--password-file /home/step/secrets/password` pattern should work because the auto-init admin provisioner is created with `STEP_CA_PASSWORD`. If that still fails: ```bash docker exec ccat-ca-step-ca-1 step ca admin list ``` This lists the current admins and their provisioner. If the auto-init admin is not present (unusual), you can fall back to editing `ca.json` directly: ```bash # 1. Stop step-ca ccat ca down # 2. Copy ca.json out of the volume docker run --rm -v ccat-ca_step-ca-data:/src -v "$PWD":/dst alpine \ cp /src/config/ca.json /dst/ca.json.backup # 3. Edit ca.json.backup by hand: set "authority": { "enableAdmin": false } # 4. Write it back: docker run --rm -v ccat-ca_step-ca-data:/dst -v "$PWD":/src alpine \ cp /src/ca.json.backup /dst/config/ca.json # 5. Start step-ca, re-run the provisioner script (which now edits # ca.json directly without admin auth), then re-enable admin: ccat ca up ./step-ca/provisioners-add.sh # re-edit ca.json to flip enableAdmin back to true ccat ca restart step-ca ``` This fallback is ugly but deterministic. Report back if you hit it so we can improve the script. **"OIDC configuration endpoint not reachable"** — step-ca tries to fetch `https://auth.ccat.uni-koeln.de/.well-known/openid-configuration` on add. If Dex is down, or if the TLS cert is not trusted by the step-ca container's OS trust store, this fails. Check: ```bash # From inside the step-ca container docker exec ccat-ca-step-ca-1 wget -qO- \ https://auth.ccat.uni-koeln.de/.well-known/openid-configuration ``` Should return JSON with an `issuer` field equal to `https://auth.ccat.uni-koeln.de`. If it returns a TLS error, the step-ca image's trust store doesn't have Let's Encrypt — unusual but possible. If it returns 404, Dex isn't actually running behind the nginx-proxy vhost: check `ccat ca status` and `ccat ca logs dex`. **"provisioner already exists"** — the script should handle this, but if you're running `step ca provisioner add` manually without the existence check, you hit this. Use `step ca provisioner update` instead, or `remove` then `add`. ## See also - {doc}`background/ca-provisioner-set` — provisioner set table, lifetime-flag table. - {doc}`background/ca-architecture` — Pattern A narrative, why these lifetimes. - {doc}`ca-day-to-day` — issuing certs against existing provisioners. - {doc}`ca-rotation-and-recovery` — for "I need to *invalidate* already-issued certs".