Skip to content

docs: EKS cluster governance at scale — status, roadmap, and AI agents #2

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 100 additions & 27 deletions design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ module "eks_deployment" {

With `repo_name` and `cluster_name` explicit, the fleet map is self-documenting:
a CSVD engineer can `grep -r repo_name clusters/` and instantly see every account repo
CSVD has ever written cluster HCL into. This is the "map of injections" Manuel referenced
CSVD has ever written cluster HCL into. This is the "cluster injection map" David raised on the call
no external database, no spreadsheet. The source files are the inventory.

The cluster config is **fully declarative**. The diff between "what CSVD intended" and
Expand All @@ -209,6 +209,9 @@ The cluster config is **fully declarative**. The diff between "what CSVD intende
| No branch protection set at provisioning time | Same — governance is not enforced at the repo level |
| No operator-level fleet view | No single place for a CSVD engineer to see all cluster configs side-by-side without navigating multiple repos |
| `clusters/` has no README | New CSVD engineers don't know how to run a single-cluster or fleet update |
| `clusters/` is a flat directory | All clusters in one level — no way to reason by lifecycle (dev/prod) or owning team; will become unnavigable at 50+ clusters |
| Lambda/CodeBuild cannot place a new cluster entry at the right path | When provisioning creates the `clusters/<name>/main.tf` entry in `terraform-eks-deployment`, it has no parameter for lifecycle or team — every new cluster lands at the top level |
| `clusters/` lives inside the module repo it instantiates | `terraform-eks-deployment` is a **Terraform module** (library code). Embedding operational workspaces inside it conflates module versioning with fleet operations — you can't tag a module release without also tagging cluster state; you can't give fleet operators write access without giving them module write access |

That's the entire gap. **The architecture is sound. The tooling just needs to be completed.**

Expand All @@ -221,7 +224,7 @@ The target state, described simply:
1. **New cluster provisioned via SC** → repo created with correct folder structure + governance baked in
2. **Version bump or config change** → one command → PRs open across all affected clusters
3. **Customer wants to change something** → they open a PR in their account repo → CSVD reviews it
4. **CSVD wants to see fleet state** → one place (`terraform-eks-deployment/clusters/`) shows every cluster's config
4. **CSVD wants to see fleet state** → one place (`terraform-eks-fleet/clusters/`) shows every cluster's config, calling the versioned `terraform-eks-deployment` module

No new architecture is needed. These are operational completions of what's already built.

Expand All @@ -239,31 +242,99 @@ This is the prerequisite for account repo injection. Currently the module uses `
the GitHub repo name and the cluster directory name. They need to be separate:

```hcl
# clusters/adsd-tools-dev/main.tf — after the change
# clusters/dev/csvd/csvd-dev-mcm/main.tf — after all Step 0/0a/0b changes
module "eks_deployment" {
source = "../../"
repo_name = "533109815932-adsd-tools-nonprod-gov_apps-adsd-eks" # ← existing _apps-adsd-eks repo
cluster_name = "adsd-tools-dev" # ← folder written inside it
source = "github.e.it.census.gov/SCT-Engineering/terraform-eks-deployment///?ref=v1.2.0"
# ↑ pinned module version — decoupled from fleet repo history

repo_name = "229685449397-csvd-dev-gov_apps-adsd-eks" # ← existing _apps-adsd-eks repo
cluster_name = "csvd-dev-mcm" # ← subfolder written inside applications/
repository_mode = "update"
...
}
```

This is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes.
Every cluster entry in `clusters/` gets updated to specify both fields.
`repo_name` / `cluster_name` is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes.
Every cluster entry gets updated to specify both fields.
After this change, cluster HCL lands in `applications/<cluster-name>/` inside the existing
account infrastructure repo — exactly where the `repo-layout.md` convention says it should go.

### Step 0b — Extract `clusters/` to a dedicated `terraform-eks-fleet` repo

`terraform-eks-deployment` is a **module** — it should be versioned, tagged, and treated like
a library. The `clusters/*/main.tf` workspaces are **instantiations** of that module — they
change constantly and should have their own commit history, CI, CODEOWNERS, and access controls.
Embedding them inside the module repo conflates two completely different maintenance concerns.

**What to do:** Create a new repo `SCT-Engineering/terraform-eks-fleet`. Move all `clusters/`
content there. Each workspace references `terraform-eks-deployment` as an external versioned
module (see Step 0 example above) rather than via `../../`.

Benefits:
- Module can be tagged and released (`v1.2.0`) without touching fleet state
- Fleet operators get write access to `terraform-eks-fleet` without touching the module
- `update_all_clusters.py`, the GitHub Action, and the `eks-fleet.code-workspace` all live here
- CodeBuild clones `terraform-eks-fleet` (not `terraform-eks-deployment`) when writing new cluster entries
- CODEOWNERS in `terraform-eks-fleet` governs who can land changes to production clusters

This is a repo creation + file move + change to the module source path. No logic changes.
Estimated: **half a day** including wiring CodeBuild to clone the new repo.

### Step 0a — Restructure into a lifecycle/team hierarchy (done inside `terraform-eks-fleet`)

With `clusters/` in its own repo, restructure it into a navigable hierarchy:

```
clusters/
├── dev/
│ ├── adsd/
│ │ └── adsd-tools-dev/main.tf
│ └── csvd/
│ ├── csvd-dev-mcm/main.tf
│ ├── csvd-lab-dja/main.tf
│ └── csvd-lab-mcm/main.tf
├── prod/
│ ├── ois/
│ │ └── eks-ois-cribl-prod/main.tf
│ └── csvd/
│ └── csvd-mcm-common/main.tf
└── ...
```

Two dimensions:
- **Lifecycle** (`dev` / `prod` / `sandbox`) — controls which clusters `update_all_clusters.py`
targets by default (safe to run `--lifecycle dev` freely; `--lifecycle prod` requires `--force`)
- **Team** (`adsd` / `ois` / `csvd` / etc.) — matches the owning division; makes CODEOWNERS
scoping trivial (`clusters/prod/ois/` → `@SCT-Engineering/ois-eks-admins`)

**The `update_all_clusters.py` script walks this tree recursively** — the restructure requires
no logic changes in the script, only that it use `glob('clusters/**/**/main.tf')` instead of
`glob('clusters/*/main.tf')`.

**SC parameters to add:** The SC product form needs two new optional fields — `team` and
`lifecycle` — which the Lambda threads through to CodeBuild as `TF_VAR_team` and
`TF_VAR_lifecycle`. CodeBuild uses them to write the new `clusters/<lifecycle>/<team>/<cluster>/main.tf`
entry back into `terraform-eks-deployment` as part of provisioning. If omitted, they default
to `dev` and the `cluster_name` prefix heuristic (e.g. a name starting with `csvd-` → team `csvd`).

This is a one-time migration of the existing five entries, then all new provisioned clusters
land in the right place automatically. Estimated: **2 hours** (migration) + **30 min** (SC/Lambda param).

---

### Step 1 — `update_all_clusters.py` (the highest-value deliverable)

A simple Python script, ~100 lines, that loops through `clusters/*/` and runs `tf apply`:

```python
# scripts/update_all_clusters.py
# Usage:
# python scripts/update_all_clusters.py # all clusters
# python scripts/update_all_clusters.py --filter csvd # clusters matching 'csvd'
# python scripts/update_all_clusters.py --dry-run # tf plan only
# python scripts/update_all_clusters.py # all clusters
# python scripts/update_all_clusters.py --lifecycle dev # dev clusters only (safe default)
# python scripts/update_all_clusters.py --lifecycle prod --force # prod clusters (requires --force)
# python scripts/update_all_clusters.py --team adsd # single team across all lifecycles
# python scripts/update_all_clusters.py --filter csvd-lab # clusters matching a name substring
# python scripts/update_all_clusters.py --dry-run # tf plan only, no apply
```

This is the fleet management mechanism. When a version needs to be bumped across 20 clusters,
Expand Down Expand Up @@ -612,25 +683,27 @@ These are the concrete deliverables, ordered by impact-to-effort ratio.

| Priority | Task | Effort | Owner | Unlocks |
|----------|------|--------|-------|---------|
| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all `clusters/*/main.tf` | 30 min | David | Cluster HCL lands in correct account repos |
| 2 | `scripts/update_all_clusters.py` with `--dry-run`, maintenance window gating, workspace gen | 1 day | David | Fleet-wide updates in one command + operator view + calendar respect |
| 3 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos |
| 4 | `clusters/README.md` | 30 min | David | CSVD onboarding for cluster ops |
| 5 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target |
| 6 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not | 1 day | David | Multi-cluster accounts, clean fleet |
| 7 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied |
| 8 | Add `maintenance_window` local block to each `clusters/*/main.tf` | 1 hour | David | Calendar-gated fleet updates |
| 9 | GitHub Action in `terraform-eks-deployment` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync |
| 10 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code |
| 11 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries |
| 12 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification |
| 13 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates |

**Items 1–4 can be done this sprint. Item 5 is a conversation. Items 6–13 follow.**
| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all cluster entries | 30 min | David | Cluster HCL lands in correct account repos |
| 2 | Extract `clusters/` to new `SCT-Engineering/terraform-eks-fleet` repo; update module source to versioned ref | 0.5 day | David | Module and fleet decoupled; clean access controls; prerequisite for all fleet tooling |
| 3 | Restructure into `clusters/<lifecycle>/<team>/<name>/` hierarchy within `terraform-eks-fleet` | 2 hours | David | Navigable fleet; lifecycle-gated updates; team-scoped CODEOWNERS |
| 4 | `scripts/update_all_clusters.py` with `--lifecycle`, `--team`, `--dry-run`, `--force`, maintenance window gating | 1 day | David | Fleet-wide updates in one command + calendar respect |
| 5 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos |
| 6 | `clusters/README.md` in `terraform-eks-fleet` | 30 min | David | CSVD onboarding for cluster ops |
| 7 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target |
| 8 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not; write new cluster entry to `terraform-eks-fleet` | 1 day | David | Multi-cluster accounts, clean fleet |
| 9 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied |
| 10 | Add `maintenance_window` local block to each cluster entry | 1 hour | David | Calendar-gated fleet updates |
| 11 | GitHub Action in `terraform-eks-fleet` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync |
| 12 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code |
| 13 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries |
| 14 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification |
| 15 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates |

**Items 1–6 can be done this sprint. Item 7 is a conversation. Items 8–15 follow.**

---

## What This Looks Like After Items 1–4
## What This Looks Like After Items 1–6

A CSVD engineer doing a fleet-wide EKS version bump:

Expand Down