CA day-to-day operations#

This page is a how-to for the routine operational tasks against the CCAT step-ca: bringing the stack up/down/restarting, issuing new certs on demand, monitoring expiry, and backing up the relevant volumes. It is the working runbook for anyone on rotation. For the design rationale behind any of this (why these volumes, why these lifetimes), see CCAT Certificate Authority — Architecture and Design. For the lookup tables (provisioner set, lifetime flags), see CCAT CA — Provisioner set and reference tables.

Bringing the stack up, down, restart #

All via the ccat ca CLI, which wraps docker compose -f docker-compose.ca.yml:

ccat ca status             # show container status
ccat ca logs               # tail all services
ccat ca logs step-ca       # tail a specific service
ccat ca restart step-ca    # restart without image pull
ccat ca update             # git pull → image pull → up -d
ccat ca down               # stop, preserve volumes (always)

ccat ca down never passes -v. This is deliberate. The step-ca-data volume is irreplaceable in Phase 2+ — losing it means re-doing the root ceremony and re-bootstrapping every client. The dex-data volume is safe to wipe in principle (Dex regenerates signing keys on startup), but a fresh JWKS briefly invalidates step-ca’s cached discovery until it refetches, so there’s no reason to do it during a normal restart. If you need to truly wipe the CA, do it by hand with docker volume rm and think three times.

Issuing new certs #

Humans use the step CLI after step ca bootstrap:

# SSH user cert (opens browser for GitHub OAuth)
step ssh login

# x509 cert for a service (JWK provisioner)
step ca certificate service.local service.crt service.key \
  --provisioner prod-services

# ACME cert (automatic, for internal services)
step ca certificate service.local service.crt service.key \
  --acme

For the full provisioner set (which one to use when), see CCAT CA — Provisioner set and reference tables. Operator client setup (so step ssh login works on a fresh laptop) is in Client setup — SSH with step-ca certificates.

Vhost cert lifecycle #

The ca.ccat.uni-koeln.de vhost is served by nginx-proxy with a CCAT-rooted cert (not Let’s Encrypt — acme-companion is opted out for this vhost). The cert renews itself on a systemd timer; you should not normally have to touch it. See CCAT Certificate Authority — Architecture and Design for why the per-vhost split.

What’s installed where (input-b):

/opt/proxy/certs/ca.ccat.uni-koeln.de.crt — full chain (mode 0644)
/opt/proxy/certs/ca.ccat.uni-koeln.de.key — private key (mode 0600)
/opt/data-center/system-integration/step-ca/renew-vhost-cert.sh — renewal helper invoked by the timer
step-ca-vhost-renew.timer / .service — systemd units

Timer behavior:

Schedule: OnBootSec=15min, OnUnitActiveSec=12h, persistent. Fires at every boot (after 15 min) and every 12 h thereafter.
Each fire calls step ca renew, which short-circuits unless the cert is within ~1/3 of its lifetime (default step-cli behavior). Most fires do nothing.
nginx-proxy is reloaded only when the cert mtime actually changed — the helper checks PRE/POST mtime and skips the reload on a no-op renewal.

Inspection commands (on input-b):

# Timer status, next/last fire times
systemctl list-timers step-ca-vhost-renew*

# Last service run output (look here if the cert is overdue)
journalctl -u step-ca-vhost-renew.service -n 50

# Cert validity window and SANs
step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt \
    --short

# What the proxy is actually serving on the wire
echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 \
    -servername ca.ccat.uni-koeln.de 2>/dev/null \
    | openssl x509 -noout -subject -issuer -dates

The unhappy paths — manual force-renew, emergency re-issue — live in CA rotation and disaster recovery.

Adding a partner subnet #

Access to ca.ccat.uni-koeln.de is gated by an in-repo IP allowlist at the proxy (proxy/data/vhost.d/ca.ccat.uni-koeln.de), defaulting to Uni Köln /16 and deny all otherwise. Adding a new partner subnet is a four-step rollout:

Edit the allowlist on a feature branch:

# proxy/data/vhost.d/ca.ccat.uni-koeln.de
allow 134.95.0.0/16;          # Uni Köln main
allow 198.51.100.0/24;        # NEW partner — describe in commit
deny all;

Land the change through the normal PR flow, then on input-b:

ssh input-b
cd /opt/data-center/system-integration && git pull
docker exec nginx-proxy nginx -s reload

Send the partner the bootstrap command. They follow Client setup — SSH with step-ca certificates. The canonical fingerprint is on that page so the partner is always reading the current value.
Confirm. Ask the partner to run step ssh login and report back. If they get a 403 Forbidden instead of a TLS handshake, the reload didn’t pick up the new CIDR — re-check step 2.

Tip

Keep one allowlist line per partner with a short identifying comment on the same line. The file is the source of truth for “who is allowed to bootstrap CCAT trust” and a short comment per CIDR is the only audit trail it carries.

Monitoring cert expiry #

step-ca’s internal database tracks issued certs. For CCAT operational visibility, expiry should be surfaced in Grafana via InfluxDB. The pattern (to be implemented in Phase 2):

A systemd timer on each managed host runs step certificate inspect --format json <cert> periodically and pushes an cert_expiry_days metric to InfluxDB.
Grafana alerts on cert_expiry_days < 7 for any service.

Backup #

Two Docker volumes on input-b must be backed up:

ccat-ca_step-ca-data — step-ca config, db, intermediate public cert (not the key — that’s on HSM #2).
ccat-proxy_html + /opt/proxy/certs — LE certs (cheap to re-issue but saves a round-trip on DR).

The ccat-ca_dex-data sqlite3 volume does not need backup: Dex’s entire config is in git (step-ca/dex/config.yaml), and the volume holds only ephemeral session state + signing keys that are safely regenerated on first start.

The CCAT backup pipeline should cover these paths (see Backup and Restore for the backup architecture).

The HSM keys themselves are not in any backup — they cannot be. This is acceptable because:

HSM #1 failure is a planned-for disaster with a documented recovery procedure (emergency root rotation, re-bootstrap all clients).
HSM #2 failure is a routine rotation (root ceremony, new intermediate, swap in fresh HSM).

Recovery scenarios for both HSMs are in CA rotation and disaster recovery.

CA day-to-day operations#

Bringing the stack up, down, restart#

Issuing new certs#

Vhost cert lifecycle#

Adding a partner subnet#

Monitoring cert expiry#

Backup#

See also#