# CA day-to-day operations This page is a **how-to** for the routine operational tasks against the CCAT step-ca: bringing the stack up/down/restarting, issuing new certs on demand, monitoring expiry, and backing up the relevant volumes. It is the working runbook for anyone on rotation. For the design rationale behind any of this (why these volumes, why these lifetimes), see {doc}`background/ca-architecture`. For the lookup tables (provisioner set, lifetime flags), see {doc}`background/ca-provisioner-set`. ```{contents} :local: :depth: 2 ``` ## Bringing the stack up, down, restart All via the `ccat ca` CLI, which wraps `docker compose -f docker-compose.ca.yml`: ```bash ccat ca status # show container status ccat ca logs # tail all services ccat ca logs step-ca # tail a specific service ccat ca restart step-ca # restart without image pull ccat ca update # git pull → image pull → up -d ccat ca down # stop, preserve volumes (always) ``` `ccat ca down` **never** passes `-v`. This is deliberate. The `step-ca-data` volume is irreplaceable in Phase 2+ — losing it means re-doing the root ceremony and re-bootstrapping every client. The `dex-data` volume is safe to wipe in principle (Dex regenerates signing keys on startup), but a fresh JWKS briefly invalidates step-ca's cached discovery until it refetches, so there's no reason to do it during a normal restart. If you need to truly wipe the CA, do it by hand with `docker volume rm` and think three times. ## Issuing new certs Humans use the `step` CLI after `step ca bootstrap`: ```bash # SSH user cert (opens browser for GitHub OAuth) step ssh login # x509 cert for a service (JWK provisioner) step ca certificate service.local service.crt service.key \ --provisioner prod-services # ACME cert (automatic, for internal services) step ca certificate service.local service.crt service.key \ --acme ``` For the full provisioner set (which one to use when), see {doc}`background/ca-provisioner-set`. Operator client setup (so `step ssh login` works on a fresh laptop) is in {doc}`ca-client-onboarding`. ## Vhost cert lifecycle The `ca.ccat.uni-koeln.de` vhost is served by nginx-proxy with a **CCAT-rooted** cert (not Let's Encrypt — `acme-companion` is opted out for this vhost). The cert renews itself on a systemd timer; you should not normally have to touch it. See {doc}`background/ca-architecture` for *why* the per-vhost split. **What's installed where (input-b):** - `/opt/proxy/certs/ca.ccat.uni-koeln.de.crt` — full chain (mode 0644) - `/opt/proxy/certs/ca.ccat.uni-koeln.de.key` — private key (mode 0600) - `/opt/data-center/system-integration/step-ca/renew-vhost-cert.sh` — renewal helper invoked by the timer - `step-ca-vhost-renew.timer` / `.service` — systemd units **Timer behavior:** - Schedule: `OnBootSec=15min`, `OnUnitActiveSec=12h`, persistent. Fires at every boot (after 15 min) and every 12 h thereafter. - Each fire calls `step ca renew`, which short-circuits unless the cert is within ~1/3 of its lifetime (default step-cli behavior). Most fires do nothing. - nginx-proxy is reloaded **only when the cert mtime actually changed** — the helper checks PRE/POST mtime and skips the reload on a no-op renewal. **Inspection commands (on input-b):** ```bash # Timer status, next/last fire times systemctl list-timers step-ca-vhost-renew* # Last service run output (look here if the cert is overdue) journalctl -u step-ca-vhost-renew.service -n 50 # Cert validity window and SANs step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt \ --short # What the proxy is actually serving on the wire echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 \ -servername ca.ccat.uni-koeln.de 2>/dev/null \ | openssl x509 -noout -subject -issuer -dates ``` The unhappy paths — manual force-renew, emergency re-issue — live in {doc}`ca-rotation-and-recovery`. ## Adding a partner subnet Access to `ca.ccat.uni-koeln.de` is gated by an in-repo IP allowlist at the proxy ({file}`proxy/data/vhost.d/ca.ccat.uni-koeln.de`), defaulting to Uni Köln `/16` and `deny all` otherwise. Adding a new partner subnet is a four-step rollout: 1. **Edit the allowlist** on a feature branch: ```nginx # proxy/data/vhost.d/ca.ccat.uni-koeln.de allow 134.95.0.0/16; # Uni Köln main allow 198.51.100.0/24; # NEW partner — describe in commit deny all; ``` 2. **Land the change** through the normal PR flow, then on input-b: ```bash ssh input-b cd /opt/data-center/system-integration && git pull docker exec nginx-proxy nginx -s reload ``` 3. **Send the partner the bootstrap command.** They follow {doc}`ca-client-onboarding`. The canonical fingerprint is on that page so the partner is always reading the current value. 4. **Confirm.** Ask the partner to run `step ssh login` and report back. If they get a `403 Forbidden` instead of a TLS handshake, the reload didn't pick up the new CIDR — re-check step 2. ```{tip} Keep one allowlist line per partner with a short identifying comment on the same line. The file is the source of truth for "who is allowed to bootstrap CCAT trust" and a short comment per CIDR is the only audit trail it carries. ``` ## Monitoring cert expiry step-ca's internal database tracks issued certs. For CCAT operational visibility, expiry should be surfaced in Grafana via InfluxDB. The pattern (to be implemented in Phase 2): - A systemd timer on each managed host runs `step certificate inspect --format json ` periodically and pushes an `cert_expiry_days` metric to InfluxDB. - Grafana alerts on `cert_expiry_days < 7` for any service. ## Backup Two Docker volumes on input-b must be backed up: - `ccat-ca_step-ca-data` — step-ca config, db, intermediate public cert (not the key — that's on HSM #2). - `ccat-proxy_html` + `/opt/proxy/certs` — LE certs (cheap to re-issue but saves a round-trip on DR). The `ccat-ca_dex-data` sqlite3 volume does not need backup: Dex's entire config is in git (`step-ca/dex/config.yaml`), and the volume holds only ephemeral session state + signing keys that are safely regenerated on first start. The CCAT backup pipeline should cover these paths (see {doc}`backup-restore` for the backup architecture). The HSM keys themselves are **not** in any backup — they cannot be. This is acceptable because: - HSM #1 failure is a planned-for disaster with a documented recovery procedure (emergency root rotation, re-bootstrap all clients). - HSM #2 failure is a routine rotation (root ceremony, new intermediate, swap in fresh HSM). Recovery scenarios for both HSMs are in {doc}`ca-rotation-and-recovery`. ## See also - {doc}`background/ca-architecture` — design context for the CA. - {doc}`background/ca-provisioner-set` — provisioner reference table. - {doc}`ca-client-onboarding` — set up `step` on your laptop. - {doc}`ca-provisioner-management` — adding, updating, removing provisioners; rotating JWK passwords. - {doc}`ca-rotation-and-recovery` — intermediate / root rotation, disaster recovery.