# CA day-to-day operations

This page is a **how-to** for the routine operational tasks against the
CCAT step-ca: bringing the stack up/down/restarting, issuing new certs
on demand, monitoring expiry, and backing up the relevant volumes. It
is the working runbook for anyone on rotation. For the design rationale
behind any of this (why these volumes, why these lifetimes), see
{doc}`background/ca-architecture`. For the lookup tables (provisioner
set, lifetime flags), see {doc}`background/ca-provisioner-set`.

```{contents}
:local:
:depth: 2
```

## Bringing the stack up, down, restart

All via the `ccat ca` CLI, which wraps `docker compose -f
docker-compose.ca.yml`:

```bash
ccat ca status             # show container status
ccat ca logs               # tail all services
ccat ca logs step-ca       # tail a specific service
ccat ca restart step-ca    # restart without image pull
ccat ca update             # git pull → image pull → up -d
ccat ca down               # stop, preserve volumes (always)
```

`ccat ca down` **never** passes `-v`. This is deliberate. The
`step-ca-data` volume is irreplaceable in Phase 2+ — losing it means
re-doing the root ceremony and re-bootstrapping every client. The
`dex-data` volume is safe to wipe in principle (Dex regenerates
signing keys on startup), but a fresh JWKS briefly invalidates
step-ca's cached discovery until it refetches, so there's no reason
to do it during a normal restart. If you need to truly wipe the CA,
do it by hand with `docker volume rm` and think three times.

## Issuing new certs

Humans use the `step` CLI after `step ca bootstrap`:

```bash
# SSH user cert (opens browser for GitHub OAuth)
step ssh login

# x509 cert for a service (JWK provisioner)
step ca certificate service.local service.crt service.key \
  --provisioner prod-services

# ACME cert (automatic, for internal services)
step ca certificate service.local service.crt service.key \
  --acme
```

For the full provisioner set (which one to use when), see
{doc}`background/ca-provisioner-set`. Operator client setup (so
`step ssh login` works on a fresh laptop) is in
{doc}`ca-client-onboarding`.

## Vhost cert lifecycle

The `ca.ccat.uni-koeln.de` vhost is served by nginx-proxy with a
**CCAT-rooted** cert (not Let's Encrypt — `acme-companion` is opted
out for this vhost). The cert renews itself on a systemd timer; you
should not normally have to touch it. See
{doc}`background/ca-architecture` for *why* the per-vhost split.

**What's installed where (input-b):**

- `/opt/proxy/certs/ca.ccat.uni-koeln.de.crt` — full chain (mode 0644)
- `/opt/proxy/certs/ca.ccat.uni-koeln.de.key` — private key (mode 0600)
- `/opt/data-center/system-integration/step-ca/renew-vhost-cert.sh`
  — renewal helper invoked by the timer
- `step-ca-vhost-renew.timer` / `.service` — systemd units

**Timer behavior:**

- Schedule: `OnBootSec=15min`, `OnUnitActiveSec=12h`, persistent.
  Fires at every boot (after 15 min) and every 12 h thereafter.
- Each fire calls `step ca renew`, which short-circuits unless the
  cert is within ~1/3 of its lifetime (default step-cli behavior).
  Most fires do nothing.
- nginx-proxy is reloaded **only when the cert mtime actually
  changed** — the helper checks PRE/POST mtime and skips the reload
  on a no-op renewal.

**Inspection commands (on input-b):**

```bash
# Timer status, next/last fire times
systemctl list-timers step-ca-vhost-renew*

# Last service run output (look here if the cert is overdue)
journalctl -u step-ca-vhost-renew.service -n 50

# Cert validity window and SANs
step certificate inspect /opt/proxy/certs/ca.ccat.uni-koeln.de.crt \
    --short

# What the proxy is actually serving on the wire
echo | openssl s_client -connect ca.ccat.uni-koeln.de:443 \
    -servername ca.ccat.uni-koeln.de 2>/dev/null \
    | openssl x509 -noout -subject -issuer -dates
```

The unhappy paths — manual force-renew, emergency re-issue — live in
{doc}`ca-rotation-and-recovery`.

## Adding a partner subnet

Access to `ca.ccat.uni-koeln.de` is gated by an in-repo IP allowlist
at the proxy ({file}`proxy/data/vhost.d/ca.ccat.uni-koeln.de`),
defaulting to Uni Köln `/16` and `deny all` otherwise. Adding a new
partner subnet is a four-step rollout:

1. **Edit the allowlist** on a feature branch:

   ```nginx
   # proxy/data/vhost.d/ca.ccat.uni-koeln.de
   allow 134.95.0.0/16;          # Uni Köln main
   allow 198.51.100.0/24;        # NEW partner — describe in commit
   deny all;
   ```

2. **Land the change** through the normal PR flow, then on input-b:

   ```bash
   ssh input-b
   cd /opt/data-center/system-integration && git pull
   docker exec nginx-proxy nginx -s reload
   ```

3. **Send the partner the bootstrap command.** They follow
   {doc}`ca-client-onboarding`. The canonical fingerprint is on that
   page so the partner is always reading the current value.

4. **Confirm.** Ask the partner to run `step ssh login` and report
   back. If they get a `403 Forbidden` instead of a TLS handshake,
   the reload didn't pick up the new CIDR — re-check step 2.

```{tip}
Keep one allowlist line per partner with a short identifying comment
on the same line. The file is the source of truth for "who is
allowed to bootstrap CCAT trust" and a short comment per CIDR is
the only audit trail it carries.
```

## Monitoring cert expiry

step-ca's internal database tracks issued certs. For CCAT operational
visibility, expiry should be surfaced in Grafana via InfluxDB. The
pattern (to be implemented in Phase 2):

- A systemd timer on each managed host runs `step certificate
  inspect --format json <cert>` periodically and pushes an
  `cert_expiry_days` metric to InfluxDB.
- Grafana alerts on `cert_expiry_days < 7` for any service.

## Backup

Two Docker volumes on input-b must be backed up:

- `ccat-ca_step-ca-data` — step-ca config, db, intermediate public
  cert (not the key — that's on HSM #2).
- `ccat-proxy_html` + `/opt/proxy/certs` — LE certs (cheap to
  re-issue but saves a round-trip on DR).

The `ccat-ca_dex-data` sqlite3 volume does not need backup: Dex's
entire config is in git (`step-ca/dex/config.yaml`), and the volume
holds only ephemeral session state + signing keys that are safely
regenerated on first start.

The CCAT backup pipeline should cover these paths (see
{doc}`backup-restore` for the backup architecture).

The HSM keys themselves are **not** in any backup — they cannot be.
This is acceptable because:

- HSM #1 failure is a planned-for disaster with a documented recovery
  procedure (emergency root rotation, re-bootstrap all clients).
- HSM #2 failure is a routine rotation (root ceremony, new
  intermediate, swap in fresh HSM).

Recovery scenarios for both HSMs are in {doc}`ca-rotation-and-recovery`.

## See also

- {doc}`background/ca-architecture` — design context for the CA.
- {doc}`background/ca-provisioner-set` — provisioner reference table.
- {doc}`ca-client-onboarding` — set up `step` on your laptop.
- {doc}`ca-provisioner-management` — adding, updating, removing
  provisioners; rotating JWK passwords.
- {doc}`ca-rotation-and-recovery` — intermediate / root rotation,
  disaster recovery.