# CCAT Certificate Authority — Threat Model & Attack Surface This document is the security-review companion to {doc}`ca-architecture`. Where that document explains **how** the CCAT step-ca deployment is built and operated, this one explains **what it is exposed to**, what defends it, and what to do when something goes wrong. It is intended as reference material for security review, incident response, and the Phase 3 go/no-go decision on worldwide exposure of `ca.ccat.uni-koeln.de:9000`. For PKI fundamentals (what a cert is, how chains work, why short lifetimes matter), see {doc}`tls-and-pki`. For OS and container patch cadence — which directly affects how fast we close CVEs in step-ca itself — see {doc}`patch-management-and-supply-chain`. ```{contents} :local: :depth: 2 ``` ## Scope **What this document covers.** The public and auth-gated attack surface of the deployed step-ca instance, the Dex OIDC provider it relies on, realistic attacker scenarios, the defense-in-depth layers that stop them, and an incident-response sketch per failure mode. **What it does not cover.** Cryptographic primitives and PKI theory (see {doc}`tls-and-pki`), day-to-day operational runbooks such as adding provisioners or rotating the intermediate (see {doc}`../ca-provisioner-management` and {doc}`../ca-rotation-and-recovery`), and the broader supply-chain story for OS and container updates (see {doc}`patch-management-and-supply-chain`). ### Deployment phases The threat model differs sharply between the rollout phases: | Phase | Root key | Production trust | Threat-model posture | |---|---|---|---| | **Phase 1** — dry-run | File-based, throwaway | No production service trusts the root yet | Everything issued is a dress rehearsal. A full compromise = wipe and redo the ceremony. Low stakes. | | **Phase 2** — HSM rehearsal | HSM-backed, real | Still no production trust; rotation drills only | Keys are extraction-resistant. Signing key never leaves the HSM. | | **Phase 3** — steady state | HSM-backed, real | All CCAT hosts trust the root; SSH, mTLS, internal web UIs chain to it | Real blast radius. The operational hardening checklist below must be complete before entering this phase. | ### Scale assumption CCAT is a small, university-hosted, internal-use PKI: roughly **20 trusted humans** (the ccatobs GitHub org) plus a handful of service accounts and hosts. The threat model reflects this — we are not defending a public CA with millions of subscribers. We do not need WebPKI-grade transparency logs, CRLs, or staffed SOCs. We *do* need the discipline that comes with being an internal trust anchor for real infrastructure. ## Attack surface Two DNS names are relevant: - `ca.ccat.uni-koeln.de` — step-ca's API. Fronted by nginx-proxy on port 443 with a **CCAT-rooted** TLS cert (issued by step-ca itself, renewed via systemd timer), so step-cli's `RootCAs`-only verification works without any client-side trust-bundle plumbing. step-ca's native :9000 is also published on input-b's host for same-host workflows but is firewalled to the Uni Köln /16 (and dropped between subnets by Uni IT regardless), so :443 is the universal client path. - `auth.ccat.uni-koeln.de` — Dex's OIDC endpoints and GitHub login redirect page, on port 443 with Let's Encrypt termination via acme-companion. Dex has no admin UI and no path that needs IP gating. ### Public-by-design endpoints These endpoints are **meant** to be reachable by anyone. Serving them to the entire internet is not a security regression — it is how the protocols are specified. The equivalent endpoints on Let's Encrypt and every other public CA are worldwide-reachable by design. | Endpoint | Purpose | What an attacker learns | |---|---|---| | `GET /health` | Liveness probe | The CA is up | | `GET /roots.pem` | Public root certificate | The public half of our trust anchor — exactly the material every client has to fetch anyway | | `GET /provisioners` | Provisioner discovery | Provisioner names, types, and public config: OIDC issuer URL, OIDC client ID, allowed group claims, ACME directory URL | | ACME directory (`/acme/acme/directory`) | RFC 8555 discovery | Standard ACME endpoints | | Dex OIDC discovery (`/.well-known/openid-configuration`) | OIDC metadata | Issuer URL, JWKS URI, supported flows | What is **not** in those responses: - JWK provisioner passwords (these are held in the Ansible vault and never served) - OIDC client secrets (held by Dex via its static clients config, never emitted) - Signing keys or any secret material — the CA emits only public certs - User identities, SSH principal lists, or issued-cert history Publishing provisioner metadata is intentional. A client that wants to request a cert needs to know which provisioners exist and how to authenticate to them. This is the same discovery model Let's Encrypt uses via its ACME directory. ### Auth-gated endpoints — where the real security lives Everything that actually **issues** a cert sits behind one of four authentication gates, each with its own strength profile. **`POST /1.0/ssh/sign` — OIDC (CCAT-GitHub provisioner)** Requires a Dex-issued OIDC token whose `groups` claim contains the `ccatobs/datacenter` GitHub team slug. Strong by design: an attacker has to clear three independent gates — a valid GitHub identity, successful GitHub OAuth with `read:org` scope, and actual membership in the `ccatobs/datacenter` team at the moment Dex calls GitHub's team-membership endpoint. Membership is checked live against GitHub on every authentication, so a user removed from the team cannot authenticate even if their browser session is still warm. Output is a 16-hour SSH user cert with the user's principal only. **`POST /1.0/sign` — JWK (prod-services, staging-services)** Requires knowledge of the provisioner password (`STEP_CA_PASSWORD`, held in the Ansible vault as `vault_step_ca_password`). Medium strength: it is a single long-lived secret, but it is high-entropy, scoped to a single provisioner, and never leaves the vault except during `ccat secrets provision`. An attacker with vault access already has a much larger problem. Output is 30- or 90-day x509 certs with principal restrictions enforced by the provisioner template. **`POST /1.0/sign` — JWK (service-accounts)** Same mechanism as above, but issues 24-hour SSH service certs that are auto-renewed every 6 hours by the target host. Blast radius on compromise is small because the certs expire quickly on their own. **`POST /acme/...` — ACME challenge response** Strong by protocol design: the attacker must prove control of the hostname they are requesting a cert for, via HTTP-01, DNS-01, or TLS-ALPN-01. An attacker who does not control `example.ccat.uni-koeln.de` cannot satisfy a challenge for it, full stop. Output is 90-day x509. **`POST /1.0/ssh/renew` — SSHPOP** Requires possession of a currently-valid SSH host cert. There is **no bootstrap path** through this endpoint — it only renews an existing cert, it never issues the first one. An attacker who already has a valid host cert is an attacker who already has the host. ## Realistic attack scenarios Ordered roughly from "happens every day" to "we hope this is hypothetical." ### 1. Random reconnaissance scans Botnets sweep the internet continuously. Exposing port 9000 worldwide means we will see constant, low-grade scan traffic. - **What they find.** `/health`, `/roots.pem`, `/provisioners`. Everything public-by-design. - **What's actionable for them.** Nothing. The information is equivalent to what Let's Encrypt publishes about itself. - **What stops them.** Nothing needs to — there's nothing to steal. - **What monitoring catches it.** Loki + Grafana will show steady low-rate 200s on the public endpoints. Useful as baseline. ### 2. Vulnerability scanning against known step-ca CVEs Scanners try CVEs indiscriminately. step-ca is open source and actively maintained by Smallstep. - **What they gain.** If we're patched, nothing. If we're not, it depends on the CVE. - **What stops them.** Prompt patching. Subscribe to Smallstep's security advisories. See {doc}`patch-management-and-supply-chain` for the general upgrade cadence story. - **What monitoring catches it.** 4xx/5xx spikes, odd user-agent strings in Loki, Grafana alerts on error-rate anomalies. ### 3. Brute-force against JWK provisioner passwords The `prod-services` / `staging-services` / `service-accounts` provisioners authenticate with a password. An attacker who knows the provisioner name could attempt to guess the password by repeatedly POSTing sign requests. - **What they gain.** Arithmetically bounded to nothing: the password is high-entropy (generated via `ccat secrets rotate`), and step-ca enforces request rate limits. A 128-bit password against a rate-limited endpoint is not brute-forceable in any human timescale. - **What stops them.** Password entropy + step-ca rate limits + nginx-proxy rate limiting if needed + fail2ban on repeated 401s. - **What monitoring catches it.** Repeated auth failures from the same source IP in the step-ca log. ### 4. Denial of service / cert spamming Flood the sign endpoint to exhaust resources or fill the DB with issued certs. - **What they gain.** Degraded availability for legitimate issuance. - **What stops them.** step-ca has built-in per-provisioner rate limits. nginx-proxy can add an outer rate limit. Docker port binding to a specific interface limits blast surface. host iptables provides a final layer. - **What monitoring catches it.** Request-rate dashboards in Grafana; alerts on sustained high throughput. ### 5. Targeted phishing of the OIDC flow A social-engineering attack against a ccatobs/datacenter team member that tricks them into completing a Dex OIDC flow the attacker initiated. - **Key observation.** This attack works identically against a localhost-only CA. Exposing port 9000 worldwide neither helps nor hurts the attacker here — the flow is in the browser, not on the network. - **What they gain.** A 16-hour SSH user cert for the victim's principal. - **What stops them.** GitHub 2FA on the upstream identity (mandatory on ccatobs org); the team-membership check happens on every login, so any user not currently in `ccatobs/datacenter` is rejected at Dex; user awareness training; the 16-hour lifetime bounds the blast window; removing the victim from the GitHub team immediately blocks any new authentication attempts. - **What monitoring catches it.** Anomalous issuance patterns for a user (odd hours, unexpected source IP), cross-checked against the user's usual behavior. ## Defense layers The exposure above is safe because no single layer is load-bearing — each attack scenario is stopped by multiple independent defenses. 1. **Network layer.** Optional and cumulative: - Uni Köln firewall (outermost — currently closed on TCP 9000, opening is the Phase 3 request) - Host iptables on input-b - Docker port binding (can bind 9000 to a specific interface only) - nginx-proxy IP allowlists (available via `proxy/data/vhost.d/` drop-in files if ever needed; not currently used on auth.ccat.uni-koeln.de because Dex has no admin UI to gate) 2. **Application layer.** step-ca's own auth model: every endpoint that issues a cert requires one of the auth mechanisms above. There is no unauthenticated path to issuance. 3. **Provisioner layer.** Each provisioner has its own independent auth gate. Compromising one provisioner does not compromise the others. 4. **Authorization gate.** Role-based or challenge-based checks on top of authentication: `ccatobs/datacenter` GitHub team membership enforced by Dex for OIDC, password secrecy for JWK, challenge response for ACME, cert possession for SSHPOP. 5. **Issued-cert constraints.** Short lifetimes are the single largest blast-radius reducer: - 16h human SSH user certs - 24h service-account SSH certs (renewed every 6h) - 7d SSH host certs - 30–90d TLS certs - Signing key never leaves the HSM (Phase 2+) 6. **Target-host opt-in.** A cert is only useful against hosts that have been told to trust the CCAT CA, via `/etc/ssh/trusted_user_ca_keys` deployed by the `ca_trust` Ansible role. A leaked cert against a host that doesn't trust us is a leaked cert against a host that doesn't care. | Scenario | Stopped by layer(s) | |---|---| | Recon scans | 1, 2 | | Known-CVE scanning | Patch cadence + 1, 2 | | JWK brute force | 1, 2, 3, 4 | | DoS / cert spam | 1, 2, 5 | | OIDC phishing | 4, 5, 6 | ## The Phase 3 decision: per-vhost cert split on :443 The team explored two designs for exposing the CA API to the internal client population (and eventually the wider observatory team): - **Plan B**: open TCP 9000 worldwide and have step-cli talk directly to step-ca's native TLS endpoint, bypassing nginx-proxy. - **Plan B-revised** (chosen): keep nginx-proxy but give the `ca.ccat.uni-koeln.de` vhost a **CCAT-rooted** cert (issued by step-ca itself via the prod-services JWK provisioner, renewed by a systemd timer); other vhosts in the same proxy stack keep Let's Encrypt because they serve browsers and OAuth callbacks. Plan B failed in practice because Uni Köln IT drops :9000 between subnets — the only hosts that could reach :9000 directly were on input-b's own /24, which excludes essentially all clients. Plan B-revised sidesteps the firewall constraint entirely (since :443 is already permitted everywhere) without sacrificing the CCAT trust chain that step-cli's `RootCAs`-only verification requires. **Why it is defensible from a threat-model standpoint.** - step-ca is designed for public internet exposure. Smallstep's own commercial offering runs this way, as do many third-party hosted step-ca deployments. The protocols it speaks (ACME, OIDC) *require* public reachability for large parts of the client base. - The auth gates do the work. Nothing in the threat model above gets easier for an attacker when the endpoint moves from "reachable from one /24" to "reachable from the whole campus" or "the world." The network-layer restriction is not the security boundary. - Per-vhost cert split also dodges the operational fragility of the Phase 1 trust-bundle workaround (`SSL_CERT_FILE` + appended PEMs in `~/.step/certs/root_ca.crt` broke `step ssh certificate` with multi-PEM errors). The CA presents a CCAT-rooted chain at the wire and clients have no client-side bundle plumbing to maintain. **What it buys operationally.** - Developers can SSH to CCAT hosts from anywhere on the Uni Köln network with the same cert flow. Worldwide reach is incremental from here — Uni IT firewall narrows :443 and the :9000 same-host rule today; it can be opened later without architectural change. - SSHPOP renewal works for hosts at remote sites that can reach :443. **What it does not change.** - The attack surface in "public-by-design endpoints" above is identical whether the network range is "input-b /24", "Uni Köln", or "the world." Auth gates stop issuance, not source IP. - Incident-response procedures are unchanged. **Mitigations available now and later.** - nginx-proxy can apply per-vhost ACLs (already used to lock the CA vhost to the Uni Köln /16; see `proxy/data/vhost.d/`). - step-ca's native :9000 is gated by firewalld via the `hsm_host` role's `ca_allowed_source_cidrs` variable. - fail2ban watching the step-ca log can block sources with repeated auth failures. - Uni Köln firewall opening is reversible; the per-vhost cert posture works regardless of whether the subnet is widened, narrowed, or worldwide. ## Phase 3 operational hardening checklist These items are **not blockers** for Phase 1 dry-run or Phase 2 HSM rehearsal. They **are blockers** for Phase 3, when real services start trusting the CA. - [ ] step-ca logs shipped to Loki via promtail and visible in Grafana. - [ ] Grafana dashboard: cert issuance per provisioner per hour, with baseline annotations. - [ ] Alert: 4xx error rate spikes above baseline (sign of abuse, misconfig, or scanning). - [ ] Alert: repeated auth failures from a single source IP above a threshold in a rolling window. - [ ] fail2ban (or equivalent) watching the step-ca log and temporarily blocking sources that trip the repeated-failure threshold. - [ ] Subscribed to Smallstep security advisories; upgrade procedure for step-ca documented in {doc}`../ca-day-to-day` and tested on staging. - [ ] Weekly "who got certs, who tried and failed" review as part of the security hygiene rhythm. - [ ] JWK provisioner password rotation procedure documented and rehearsed end-to-end. - [ ] Backup verification: `step-ca-data` volume backed up, and a **restore** rehearsed into a scratch environment. Dex state is regenerated from `step-ca/dex/config.yaml` in git; no separate backup needed. - [ ] HSM #1 access procedure (the offline root ceremony for intermediate rotation) documented and walked through by at least two operators. - [ ] Dex GitHub OAuth App audit: client ID + secret in vault, app restricted to the `ccatobs` org, scope is `read:org`, no other apps share the secret. - [ ] GitHub team `ccatobs/datacenter` membership reviewed — everyone in the team should have a current operational need for SSH access to CCAT Data Center hosts. Leavers pruned. ## Incident response sketch Response procedures by suspected-failure mode. In every case the short cert lifetimes mean most remediation is "revoke the mechanism that issues, and wait" rather than "chase down every issued cert." ### JWK provisioner password leaked Rotate the password. In-flight short-lived certs expire on their own; no client re-bootstrap is needed because clients authenticate to the CA with the password, not to each other. ```bash ccat secrets rotate vault_step_ca_password --env production && ccat secrets provision --host input-b ``` After provisioning, restart step-ca on input-b and confirm new issuance works. Audit the step-ca log for any issuance during the exposure window and revoke suspicious certs. ### An SSH user cert has been used maliciously Identify the principal from the step-ca issuance log. Remove the user from the `ccatobs/datacenter` GitHub team — Dex checks team membership on every authentication, so the next `step ssh login` from that user fails immediately. The cert itself expires within 16 hours; no host-side action is required unless the principal shows ongoing activity. ```bash # Find issuance events for a given principal docker compose logs step-ca | grep '"principal":"alice"' ``` ### Dex static client secret leaked The secret that step-ca uses to authenticate to Dex (`vault_dex_stepca_client_secret`) could, if leaked, let an attacker exchange OIDC codes for tokens on step-ca's behalf — useful only in combination with an already-valid user authentication flow, so the risk is bounded. Rotate: ```bash ccat secrets rotate vault_dex_stepca_client_secret --env production ccat secrets provision --host input-b ccat ca down && ccat ca up # reload Dex with new secret ccat ca provisioner remove CCAT-GitHub ccat ca provisioner sync # re-add with new secret ``` ### A ccatobs/datacenter team member has gone rogue Remove them from the `ccatobs/datacenter` GitHub team. Dex checks team membership live on every authentication, so any new `step ssh login` will fail at the Dex layer. Their existing SSH user cert expires within 16 hours. For faster eviction from active sessions, force-terminate their SSH connections on the target hosts. ### input-b has been compromised The blast radius depends on the phase. **Phase 1 (file-based intermediate key).** The intermediate signing key must be assumed compromised — it sits on disk. Response: full intermediate rotation via an offline root ceremony, redeploy trust bundles (but since Phase 1 = throwaway, the simpler answer is to wipe and restart the ceremony from scratch). **Phase 2+ (HSM-backed intermediate key).** The key itself cannot be extracted from the HSM, but the attacker could have *used* it while they had access to input-b. Response: rotate the intermediate (ceremony with HSM #1, no root rotation needed), audit the step-ca issuance log for anything signed during the exposure window, and revoke or actively expire any suspicious certs. The root stays intact; clients do not need to re-bootstrap. In both phases, Dex's state on input-b is also within blast radius, but Dex has no user database to leak — its entire config is in git and the only secrets it holds are the GitHub OAuth client secret and the static step-ca client secret, both in the Ansible vault. Rotate both as part of the same response. ## Further reading - {doc}`tls-and-pki` — PKI fundamentals, cert chains, key material - {doc}`ca-architecture` — design context for the CCAT CA - {doc}`ca-provisioner-set` — provisioner reference tables - {doc}`../ca-client-onboarding` — laptop setup for `step ssh login` - {doc}`../ca-day-to-day` — routine ops (stack lifecycle, issuance, expiry monitoring, backup) - {doc}`../ca-provisioner-management` — provisioner add/update/remove, JWK password rotation - {doc}`../ca-rotation-and-recovery` — intermediate / root rotation, disaster recovery - {doc}`patch-management-and-supply-chain` — the upgrade cadence and supply-chain story that keeps step-ca itself patched - [Smallstep step-ca documentation](https://smallstep.com/docs/step-ca/) — upstream reference - [RFC 8555 — Automatic Certificate Management Environment (ACME)](https://datatracker.ietf.org/doc/html/rfc8555)