# CCAT Certificate Authority — Threat Model & Attack Surface

This document is the security-review companion to
{doc}`ca-architecture`. Where that document explains **how** the
CCAT step-ca deployment is built and operated, this one explains **what
it is exposed to**, what defends it, and what to do when something goes
wrong. It is intended as reference material for security review,
incident response, and the Phase 3 go/no-go decision on worldwide
exposure of `ca.ccat.uni-koeln.de:9000`.

For PKI fundamentals (what a cert is, how chains work, why short
lifetimes matter), see {doc}`tls-and-pki`. For OS and container patch
cadence — which directly affects how fast we close CVEs in step-ca
itself — see {doc}`patch-management-and-supply-chain`.

```{contents}
:local:
:depth: 2
```

## Scope

**What this document covers.** The public and auth-gated attack surface
of the deployed step-ca instance, the Dex OIDC provider it relies on,
realistic attacker scenarios, the defense-in-depth layers that stop
them, and an incident-response sketch per failure mode.

**What it does not cover.** Cryptographic primitives and PKI theory
(see {doc}`tls-and-pki`), day-to-day operational runbooks such as
adding provisioners or rotating the intermediate (see
{doc}`../ca-provisioner-management` and {doc}`../ca-rotation-and-recovery`),
and the broader supply-chain story for OS and container updates (see
{doc}`patch-management-and-supply-chain`).

### Deployment phases

The threat model differs sharply between the rollout phases:

| Phase | Root key | Production trust | Threat-model posture |
|---|---|---|---|
| **Phase 1** — dry-run | File-based, throwaway | No production service trusts the root yet | Everything issued is a dress rehearsal. A full compromise = wipe and redo the ceremony. Low stakes. |
| **Phase 2** — HSM rehearsal | HSM-backed, real | Still no production trust; rotation drills only | Keys are extraction-resistant. Signing key never leaves the HSM. |
| **Phase 3** — steady state | HSM-backed, real | All CCAT hosts trust the root; SSH, mTLS, internal web UIs chain to it | Real blast radius. The operational hardening checklist below must be complete before entering this phase. |

### Scale assumption

CCAT is a small, university-hosted, internal-use PKI: roughly **20
trusted humans** (the ccatobs GitHub org) plus a handful of service
accounts and hosts. The threat model reflects this — we are not
defending a public CA with millions of subscribers. We do not need
WebPKI-grade transparency logs, CRLs, or staffed SOCs. We *do* need the
discipline that comes with being an internal trust anchor for real
infrastructure.

## Attack surface

Two DNS names are relevant:

- `ca.ccat.uni-koeln.de` — step-ca's API. Fronted by nginx-proxy on
  port 443 with a **CCAT-rooted** TLS cert (issued by step-ca itself,
  renewed via systemd timer), so step-cli's `RootCAs`-only verification
  works without any client-side trust-bundle plumbing. step-ca's
  native :9000 is also published on input-b's host for same-host
  workflows but is firewalled to the Uni Köln /16 (and dropped between
  subnets by Uni IT regardless), so :443 is the universal client path.
- `auth.ccat.uni-koeln.de` — Dex's OIDC endpoints and GitHub login
  redirect page, on port 443 with Let's Encrypt termination via
  acme-companion. Dex has no admin UI and no path that needs IP
  gating.

### Public-by-design endpoints

These endpoints are **meant** to be reachable by anyone. Serving them
to the entire internet is not a security regression — it is how the
protocols are specified. The equivalent endpoints on Let's Encrypt and
every other public CA are worldwide-reachable by design.

| Endpoint | Purpose | What an attacker learns |
|---|---|---|
| `GET /health` | Liveness probe | The CA is up |
| `GET /roots.pem` | Public root certificate | The public half of our trust anchor — exactly the material every client has to fetch anyway |
| `GET /provisioners` | Provisioner discovery | Provisioner names, types, and public config: OIDC issuer URL, OIDC client ID, allowed group claims, ACME directory URL |
| ACME directory (`/acme/acme/directory`) | RFC 8555 discovery | Standard ACME endpoints |
| Dex OIDC discovery (`/.well-known/openid-configuration`) | OIDC metadata | Issuer URL, JWKS URI, supported flows |

What is **not** in those responses:

- JWK provisioner passwords (these are held in the Ansible vault and
  never served)
- OIDC client secrets (held by Dex via its static clients config, never emitted)
- Signing keys or any secret material — the CA emits only public certs
- User identities, SSH principal lists, or issued-cert history

Publishing provisioner metadata is intentional. A client that wants to
request a cert needs to know which provisioners exist and how to
authenticate to them. This is the same discovery model Let's Encrypt
uses via its ACME directory.

### Auth-gated endpoints — where the real security lives

Everything that actually **issues** a cert sits behind one of four
authentication gates, each with its own strength profile.

**`POST /1.0/ssh/sign` — OIDC (CCAT-GitHub provisioner)**

Requires a Dex-issued OIDC token whose `groups` claim contains
the `ccatobs/datacenter` GitHub team slug. Strong by design: an
attacker has to clear three independent gates — a valid GitHub
identity, successful GitHub OAuth with `read:org` scope, and
actual membership in the `ccatobs/datacenter` team at the moment
Dex calls GitHub's team-membership endpoint. Membership is checked
live against GitHub on every authentication, so a user removed
from the team cannot authenticate even if their browser session
is still warm. Output is a 16-hour SSH user cert with the user's
principal only.

**`POST /1.0/sign` — JWK (prod-services, staging-services)**

Requires knowledge of the provisioner password (`STEP_CA_PASSWORD`,
held in the Ansible vault as `vault_step_ca_password`). Medium
strength: it is a single long-lived secret, but it is high-entropy,
scoped to a single provisioner, and never leaves the vault except
during `ccat secrets provision`. An attacker with vault access already
has a much larger problem. Output is 30- or 90-day x509 certs with
principal restrictions enforced by the provisioner template.

**`POST /1.0/sign` — JWK (service-accounts)**

Same mechanism as above, but issues 24-hour SSH service certs that are
auto-renewed every 6 hours by the target host. Blast radius on
compromise is small because the certs expire quickly on their own.

**`POST /acme/...` — ACME challenge response**

Strong by protocol design: the attacker must prove control of the
hostname they are requesting a cert for, via HTTP-01, DNS-01, or
TLS-ALPN-01. An attacker who does not control `example.ccat.uni-koeln.de`
cannot satisfy a challenge for it, full stop. Output is 90-day x509.

**`POST /1.0/ssh/renew` — SSHPOP**

Requires possession of a currently-valid SSH host cert. There is **no
bootstrap path** through this endpoint — it only renews an existing
cert, it never issues the first one. An attacker who already has a
valid host cert is an attacker who already has the host.

## Realistic attack scenarios

Ordered roughly from "happens every day" to "we hope this is
hypothetical."

### 1. Random reconnaissance scans

Botnets sweep the internet continuously. Exposing port 9000 worldwide
means we will see constant, low-grade scan traffic.

- **What they find.** `/health`, `/roots.pem`, `/provisioners`.
  Everything public-by-design.
- **What's actionable for them.** Nothing. The information is
  equivalent to what Let's Encrypt publishes about itself.
- **What stops them.** Nothing needs to — there's nothing to steal.
- **What monitoring catches it.** Loki + Grafana will show steady
  low-rate 200s on the public endpoints. Useful as baseline.

### 2. Vulnerability scanning against known step-ca CVEs

Scanners try CVEs indiscriminately. step-ca is open source and actively
maintained by Smallstep.

- **What they gain.** If we're patched, nothing. If we're not, it
  depends on the CVE.
- **What stops them.** Prompt patching. Subscribe to Smallstep's
  security advisories. See {doc}`patch-management-and-supply-chain`
  for the general upgrade cadence story.
- **What monitoring catches it.** 4xx/5xx spikes, odd user-agent
  strings in Loki, Grafana alerts on error-rate anomalies.

### 3. Brute-force against JWK provisioner passwords

The `prod-services` / `staging-services` / `service-accounts`
provisioners authenticate with a password. An attacker who knows the
provisioner name could attempt to guess the password by repeatedly
POSTing sign requests.

- **What they gain.** Arithmetically bounded to nothing: the password
  is high-entropy (generated via `ccat secrets rotate`), and step-ca
  enforces request rate limits. A 128-bit password against a
  rate-limited endpoint is not brute-forceable in any human timescale.
- **What stops them.** Password entropy + step-ca rate limits +
  nginx-proxy rate limiting if needed + fail2ban on repeated 401s.
- **What monitoring catches it.** Repeated auth failures from the same
  source IP in the step-ca log.

### 4. Denial of service / cert spamming

Flood the sign endpoint to exhaust resources or fill the DB with
issued certs.

- **What they gain.** Degraded availability for legitimate issuance.
- **What stops them.** step-ca has built-in per-provisioner rate
  limits. nginx-proxy can add an outer rate limit. Docker port binding
  to a specific interface limits blast surface. host iptables provides
  a final layer.
- **What monitoring catches it.** Request-rate dashboards in Grafana;
  alerts on sustained high throughput.

### 5. Targeted phishing of the OIDC flow

A social-engineering attack against a ccatobs/datacenter team
member that tricks them into completing a Dex OIDC flow the
attacker initiated.

- **Key observation.** This attack works identically against a
  localhost-only CA. Exposing port 9000 worldwide neither helps nor
  hurts the attacker here — the flow is in the browser, not on the
  network.
- **What they gain.** A 16-hour SSH user cert for the victim's
  principal.
- **What stops them.** GitHub 2FA on the upstream identity
  (mandatory on ccatobs org); the team-membership check happens on
  every login, so any user not currently in `ccatobs/datacenter`
  is rejected at Dex; user awareness training; the 16-hour lifetime
  bounds the blast window; removing the victim from the GitHub team
  immediately blocks any new authentication attempts.
- **What monitoring catches it.** Anomalous issuance patterns for a
  user (odd hours, unexpected source IP), cross-checked against the
  user's usual behavior.

## Defense layers

The exposure above is safe because no single layer is load-bearing —
each attack scenario is stopped by multiple independent defenses.

1. **Network layer.** Optional and cumulative:
   - Uni Köln firewall (outermost — currently closed on TCP 9000,
     opening is the Phase 3 request)
   - Host iptables on input-b
   - Docker port binding (can bind 9000 to a specific interface only)
   - nginx-proxy IP allowlists (available via `proxy/data/vhost.d/`
     drop-in files if ever needed; not currently used on
     auth.ccat.uni-koeln.de because Dex has no admin UI to gate)
2. **Application layer.** step-ca's own auth model: every endpoint
   that issues a cert requires one of the auth mechanisms above. There
   is no unauthenticated path to issuance.
3. **Provisioner layer.** Each provisioner has its own independent
   auth gate. Compromising one provisioner does not compromise the
   others.
4. **Authorization gate.** Role-based or challenge-based checks on top
   of authentication: `ccatobs/datacenter` GitHub team membership
   enforced by Dex for OIDC, password secrecy for JWK, challenge
   response for ACME, cert possession for SSHPOP.
5. **Issued-cert constraints.** Short lifetimes are the single
   largest blast-radius reducer:
   - 16h human SSH user certs
   - 24h service-account SSH certs (renewed every 6h)
   - 7d SSH host certs
   - 30–90d TLS certs
   - Signing key never leaves the HSM (Phase 2+)
6. **Target-host opt-in.** A cert is only useful against hosts that
   have been told to trust the CCAT CA, via
   `/etc/ssh/trusted_user_ca_keys` deployed by the `ca_trust` Ansible
   role. A leaked cert against a host that doesn't trust us is a
   leaked cert against a host that doesn't care.

| Scenario | Stopped by layer(s) |
|---|---|
| Recon scans | 1, 2 |
| Known-CVE scanning | Patch cadence + 1, 2 |
| JWK brute force | 1, 2, 3, 4 |
| DoS / cert spam | 1, 2, 5 |
| OIDC phishing | 4, 5, 6 |

## The Phase 3 decision: per-vhost cert split on :443

The team explored two designs for exposing the CA API to the internal
client population (and eventually the wider observatory team):

- **Plan B**: open TCP 9000 worldwide and have step-cli talk directly
  to step-ca's native TLS endpoint, bypassing nginx-proxy.
- **Plan B-revised** (chosen): keep nginx-proxy but give the
  `ca.ccat.uni-koeln.de` vhost a **CCAT-rooted** cert (issued by
  step-ca itself via the prod-services JWK provisioner, renewed by a
  systemd timer); other vhosts in the same proxy stack keep Let's
  Encrypt because they serve browsers and OAuth callbacks.

Plan B failed in practice because Uni Köln IT drops :9000 between
subnets — the only hosts that could reach :9000 directly were on
input-b's own /24, which excludes essentially all clients. Plan
B-revised sidesteps the firewall constraint entirely (since :443 is
already permitted everywhere) without sacrificing the CCAT trust
chain that step-cli's `RootCAs`-only verification requires.

**Why it is defensible from a threat-model standpoint.**

- step-ca is designed for public internet exposure. Smallstep's own
  commercial offering runs this way, as do many third-party hosted
  step-ca deployments. The protocols it speaks (ACME, OIDC) *require*
  public reachability for large parts of the client base.
- The auth gates do the work. Nothing in the threat model above gets
  easier for an attacker when the endpoint moves from "reachable from
  one /24" to "reachable from the whole campus" or "the world." The
  network-layer restriction is not the security boundary.
- Per-vhost cert split also dodges the operational fragility of the
  Phase 1 trust-bundle workaround (`SSL_CERT_FILE` + appended PEMs in
  `~/.step/certs/root_ca.crt` broke `step ssh certificate` with
  multi-PEM errors). The CA presents a CCAT-rooted chain at the wire
  and clients have no client-side bundle plumbing to maintain.

**What it buys operationally.**

- Developers can SSH to CCAT hosts from anywhere on the Uni Köln
  network with the same cert flow. Worldwide reach is incremental
  from here — Uni IT firewall narrows :443 and the :9000 same-host
  rule today; it can be opened later without architectural change.
- SSHPOP renewal works for hosts at remote sites that can reach :443.

**What it does not change.**

- The attack surface in "public-by-design endpoints" above is
  identical whether the network range is "input-b /24", "Uni Köln",
  or "the world." Auth gates stop issuance, not source IP.
- Incident-response procedures are unchanged.

**Mitigations available now and later.**

- nginx-proxy can apply per-vhost ACLs (already used to lock the CA
  vhost to the Uni Köln /16; see `proxy/data/vhost.d/`).
- step-ca's native :9000 is gated by firewalld via the `hsm_host`
  role's `ca_allowed_source_cidrs` variable.
- fail2ban watching the step-ca log can block sources with repeated
  auth failures.
- Uni Köln firewall opening is reversible; the per-vhost cert posture
  works regardless of whether the subnet is widened, narrowed, or
  worldwide.

## Phase 3 operational hardening checklist

These items are **not blockers** for Phase 1 dry-run or Phase 2 HSM
rehearsal. They **are blockers** for Phase 3, when real services start
trusting the CA.

- [ ] step-ca logs shipped to Loki via promtail and visible in
      Grafana.
- [ ] Grafana dashboard: cert issuance per provisioner per hour, with
      baseline annotations.
- [ ] Alert: 4xx error rate spikes above baseline (sign of abuse,
      misconfig, or scanning).
- [ ] Alert: repeated auth failures from a single source IP above a
      threshold in a rolling window.
- [ ] fail2ban (or equivalent) watching the step-ca log and
      temporarily blocking sources that trip the repeated-failure
      threshold.
- [ ] Subscribed to Smallstep security advisories; upgrade procedure
      for step-ca documented in {doc}`../ca-day-to-day` and
      tested on staging.
- [ ] Weekly "who got certs, who tried and failed" review as part of
      the security hygiene rhythm.
- [ ] JWK provisioner password rotation procedure documented and
      rehearsed end-to-end.
- [ ] Backup verification: `step-ca-data` volume backed up, and a
      **restore** rehearsed into a scratch environment. Dex state
      is regenerated from `step-ca/dex/config.yaml` in git; no
      separate backup needed.
- [ ] HSM #1 access procedure (the offline root ceremony for
      intermediate rotation) documented and walked through by at least
      two operators.
- [ ] Dex GitHub OAuth App audit: client ID + secret in vault,
      app restricted to the `ccatobs` org, scope is `read:org`,
      no other apps share the secret.
- [ ] GitHub team `ccatobs/datacenter` membership reviewed —
      everyone in the team should have a current operational need
      for SSH access to CCAT Data Center hosts. Leavers pruned.

## Incident response sketch

Response procedures by suspected-failure mode. In every case the
short cert lifetimes mean most remediation is "revoke the mechanism
that issues, and wait" rather than "chase down every issued cert."

### JWK provisioner password leaked

Rotate the password. In-flight short-lived certs expire on their own;
no client re-bootstrap is needed because clients authenticate to the
CA with the password, not to each other.

```bash
ccat secrets rotate vault_step_ca_password --env production && ccat secrets provision --host input-b
```

After provisioning, restart step-ca on input-b and confirm new
issuance works. Audit the step-ca log for any issuance during the
exposure window and revoke suspicious certs.

### An SSH user cert has been used maliciously

Identify the principal from the step-ca issuance log. Remove the
user from the `ccatobs/datacenter` GitHub team — Dex checks team
membership on every authentication, so the next `step ssh login`
from that user fails immediately. The cert itself expires within
16 hours; no host-side action is required unless the principal
shows ongoing activity.

```bash
# Find issuance events for a given principal
docker compose logs step-ca | grep '"principal":"alice"'
```

### Dex static client secret leaked

The secret that step-ca uses to authenticate to Dex
(`vault_dex_stepca_client_secret`) could, if leaked, let an attacker
exchange OIDC codes for tokens on step-ca's behalf — useful only in
combination with an already-valid user authentication flow, so the
risk is bounded. Rotate:

```bash
ccat secrets rotate vault_dex_stepca_client_secret --env production
ccat secrets provision --host input-b
ccat ca down && ccat ca up        # reload Dex with new secret
ccat ca provisioner remove CCAT-GitHub
ccat ca provisioner sync          # re-add with new secret
```

### A ccatobs/datacenter team member has gone rogue

Remove them from the `ccatobs/datacenter` GitHub team. Dex checks
team membership live on every authentication, so any new
`step ssh login` will fail at the Dex layer. Their existing SSH
user cert expires within 16 hours. For faster eviction from active
sessions, force-terminate their SSH connections on the target
hosts.

### input-b has been compromised

The blast radius depends on the phase.

**Phase 1 (file-based intermediate key).** The intermediate signing
key must be assumed compromised — it sits on disk. Response: full
intermediate rotation via an offline root ceremony, redeploy trust
bundles (but since Phase 1 = throwaway, the simpler answer is to wipe
and restart the ceremony from scratch).

**Phase 2+ (HSM-backed intermediate key).** The key itself cannot be
extracted from the HSM, but the attacker could have *used* it while
they had access to input-b. Response: rotate the intermediate
(ceremony with HSM #1, no root rotation needed), audit the step-ca
issuance log for anything signed during the exposure window, and
revoke or actively expire any suspicious certs. The root stays
intact; clients do not need to re-bootstrap.

In both phases, Dex's state on input-b is also within blast radius,
but Dex has no user database to leak — its entire config is in git
and the only secrets it holds are the GitHub OAuth client secret and
the static step-ca client secret, both in the Ansible vault. Rotate
both as part of the same response.

## Further reading

- {doc}`tls-and-pki` — PKI fundamentals, cert chains, key material
- {doc}`ca-architecture` — design context for the CCAT CA
- {doc}`ca-provisioner-set` — provisioner reference tables
- {doc}`../ca-client-onboarding` — laptop setup for `step ssh login`
- {doc}`../ca-day-to-day` — routine ops (stack lifecycle, issuance,
  expiry monitoring, backup)
- {doc}`../ca-provisioner-management` — provisioner add/update/remove,
  JWK password rotation
- {doc}`../ca-rotation-and-recovery` — intermediate / root rotation,
  disaster recovery
- {doc}`patch-management-and-supply-chain` — the upgrade cadence and
  supply-chain story that keeps step-ca itself patched
- [Smallstep step-ca documentation](https://smallstep.com/docs/step-ca/) —
  upstream reference
- [RFC 8555 — Automatic Certificate Management Environment (ACME)](https://datatracker.ietf.org/doc/html/rfc8555)