diff --git a/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md b/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md index 596f5ef..74ff70d 100644 --- a/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md +++ b/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md @@ -190,7 +190,7 @@ module "eks_deployment" { With `repo_name` and `cluster_name` explicit, the fleet map is self-documenting: a CSVD engineer can `grep -r repo_name clusters/` and instantly see every account repo -CSVD has ever written cluster HCL into. This is the "map of injections" Manuel referenced — +CSVD has ever written cluster HCL into. This is the "cluster injection map" David raised on the call — no external database, no spreadsheet. The source files are the inventory. The cluster config is **fully declarative**. The diff between "what CSVD intended" and @@ -209,6 +209,9 @@ The cluster config is **fully declarative**. The diff between "what CSVD intende | No branch protection set at provisioning time | Same — governance is not enforced at the repo level | | No operator-level fleet view | No single place for a CSVD engineer to see all cluster configs side-by-side without navigating multiple repos | | `clusters/` has no README | New CSVD engineers don't know how to run a single-cluster or fleet update | +| `clusters/` is a flat directory | All clusters in one level — no way to reason by lifecycle (dev/prod) or owning team; will become unnavigable at 50+ clusters | +| Lambda/CodeBuild cannot place a new cluster entry at the right path | When provisioning creates the `clusters//main.tf` entry in `terraform-eks-deployment`, it has no parameter for lifecycle or team — every new cluster lands at the top level | +| `clusters/` lives inside the module repo it instantiates | `terraform-eks-deployment` is a **Terraform module** (library code). Embedding operational workspaces inside it conflates module versioning with fleet operations — you can't tag a module release without also tagging cluster state; you can't give fleet operators write access without giving them module write access | That's the entire gap. **The architecture is sound. The tooling just needs to be completed.** @@ -221,7 +224,7 @@ The target state, described simply: 1. **New cluster provisioned via SC** → repo created with correct folder structure + governance baked in 2. **Version bump or config change** → one command → PRs open across all affected clusters 3. **Customer wants to change something** → they open a PR in their account repo → CSVD reviews it -4. **CSVD wants to see fleet state** → one place (`terraform-eks-deployment/clusters/`) shows every cluster's config +4. **CSVD wants to see fleet state** → one place (`terraform-eks-fleet/clusters/`) shows every cluster's config, calling the versioned `terraform-eks-deployment` module No new architecture is needed. These are operational completions of what's already built. @@ -239,21 +242,86 @@ This is the prerequisite for account repo injection. Currently the module uses ` the GitHub repo name and the cluster directory name. They need to be separate: ```hcl -# clusters/adsd-tools-dev/main.tf — after the change +# clusters/dev/csvd/csvd-dev-mcm/main.tf — after all Step 0/0a/0b changes module "eks_deployment" { - source = "../../" - repo_name = "533109815932-adsd-tools-nonprod-gov_apps-adsd-eks" # ← existing _apps-adsd-eks repo - cluster_name = "adsd-tools-dev" # ← folder written inside it + source = "github.e.it.census.gov/SCT-Engineering/terraform-eks-deployment///?ref=v1.2.0" + # ↑ pinned module version — decoupled from fleet repo history + + repo_name = "229685449397-csvd-dev-gov_apps-adsd-eks" # ← existing _apps-adsd-eks repo + cluster_name = "csvd-dev-mcm" # ← subfolder written inside applications/ repository_mode = "update" ... } ``` -This is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes. -Every cluster entry in `clusters/` gets updated to specify both fields. +`repo_name` / `cluster_name` is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes. +Every cluster entry gets updated to specify both fields. After this change, cluster HCL lands in `applications//` inside the existing account infrastructure repo — exactly where the `repo-layout.md` convention says it should go. +### Step 0b — Extract `clusters/` to a dedicated `terraform-eks-fleet` repo + +`terraform-eks-deployment` is a **module** — it should be versioned, tagged, and treated like +a library. The `clusters/*/main.tf` workspaces are **instantiations** of that module — they +change constantly and should have their own commit history, CI, CODEOWNERS, and access controls. +Embedding them inside the module repo conflates two completely different maintenance concerns. + +**What to do:** Create a new repo `SCT-Engineering/terraform-eks-fleet`. Move all `clusters/` +content there. Each workspace references `terraform-eks-deployment` as an external versioned +module (see Step 0 example above) rather than via `../../`. + +Benefits: +- Module can be tagged and released (`v1.2.0`) without touching fleet state +- Fleet operators get write access to `terraform-eks-fleet` without touching the module +- `update_all_clusters.py`, the GitHub Action, and the `eks-fleet.code-workspace` all live here +- CodeBuild clones `terraform-eks-fleet` (not `terraform-eks-deployment`) when writing new cluster entries +- CODEOWNERS in `terraform-eks-fleet` governs who can land changes to production clusters + +This is a repo creation + file move + change to the module source path. No logic changes. +Estimated: **half a day** including wiring CodeBuild to clone the new repo. + +### Step 0a — Restructure into a lifecycle/team hierarchy (done inside `terraform-eks-fleet`) + +With `clusters/` in its own repo, restructure it into a navigable hierarchy: + +``` +clusters/ +├── dev/ +│ ├── adsd/ +│ │ └── adsd-tools-dev/main.tf +│ └── csvd/ +│ ├── csvd-dev-mcm/main.tf +│ ├── csvd-lab-dja/main.tf +│ └── csvd-lab-mcm/main.tf +├── prod/ +│ ├── ois/ +│ │ └── eks-ois-cribl-prod/main.tf +│ └── csvd/ +│ └── csvd-mcm-common/main.tf +└── ... +``` + +Two dimensions: +- **Lifecycle** (`dev` / `prod` / `sandbox`) — controls which clusters `update_all_clusters.py` + targets by default (safe to run `--lifecycle dev` freely; `--lifecycle prod` requires `--force`) +- **Team** (`adsd` / `ois` / `csvd` / etc.) — matches the owning division; makes CODEOWNERS + scoping trivial (`clusters/prod/ois/` → `@SCT-Engineering/ois-eks-admins`) + +**The `update_all_clusters.py` script walks this tree recursively** — the restructure requires +no logic changes in the script, only that it use `glob('clusters/**/**/main.tf')` instead of +`glob('clusters/*/main.tf')`. + +**SC parameters to add:** The SC product form needs two new optional fields — `team` and +`lifecycle` — which the Lambda threads through to CodeBuild as `TF_VAR_team` and +`TF_VAR_lifecycle`. CodeBuild uses them to write the new `clusters////main.tf` +entry back into `terraform-eks-deployment` as part of provisioning. If omitted, they default +to `dev` and the `cluster_name` prefix heuristic (e.g. a name starting with `csvd-` → team `csvd`). + +This is a one-time migration of the existing five entries, then all new provisioned clusters +land in the right place automatically. Estimated: **2 hours** (migration) + **30 min** (SC/Lambda param). + +--- + ### Step 1 — `update_all_clusters.py` (the highest-value deliverable) A simple Python script, ~100 lines, that loops through `clusters/*/` and runs `tf apply`: @@ -261,9 +329,12 @@ A simple Python script, ~100 lines, that loops through `clusters/*/` and runs `t ```python # scripts/update_all_clusters.py # Usage: -# python scripts/update_all_clusters.py # all clusters -# python scripts/update_all_clusters.py --filter csvd # clusters matching 'csvd' -# python scripts/update_all_clusters.py --dry-run # tf plan only +# python scripts/update_all_clusters.py # all clusters +# python scripts/update_all_clusters.py --lifecycle dev # dev clusters only (safe default) +# python scripts/update_all_clusters.py --lifecycle prod --force # prod clusters (requires --force) +# python scripts/update_all_clusters.py --team adsd # single team across all lifecycles +# python scripts/update_all_clusters.py --filter csvd-lab # clusters matching a name substring +# python scripts/update_all_clusters.py --dry-run # tf plan only, no apply ``` This is the fleet management mechanism. When a version needs to be bumped across 20 clusters, @@ -612,25 +683,27 @@ These are the concrete deliverables, ordered by impact-to-effort ratio. | Priority | Task | Effort | Owner | Unlocks | |----------|------|--------|-------|---------| -| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all `clusters/*/main.tf` | 30 min | David | Cluster HCL lands in correct account repos | -| 2 | `scripts/update_all_clusters.py` with `--dry-run`, maintenance window gating, workspace gen | 1 day | David | Fleet-wide updates in one command + operator view + calendar respect | -| 3 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos | -| 4 | `clusters/README.md` | 30 min | David | CSVD onboarding for cluster ops | -| 5 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target | -| 6 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not | 1 day | David | Multi-cluster accounts, clean fleet | -| 7 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied | -| 8 | Add `maintenance_window` local block to each `clusters/*/main.tf` | 1 hour | David | Calendar-gated fleet updates | -| 9 | GitHub Action in `terraform-eks-deployment` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync | -| 10 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code | -| 11 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries | -| 12 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification | -| 13 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates | - -**Items 1–4 can be done this sprint. Item 5 is a conversation. Items 6–13 follow.** +| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all cluster entries | 30 min | David | Cluster HCL lands in correct account repos | +| 2 | Extract `clusters/` to new `SCT-Engineering/terraform-eks-fleet` repo; update module source to versioned ref | 0.5 day | David | Module and fleet decoupled; clean access controls; prerequisite for all fleet tooling | +| 3 | Restructure into `clusters////` hierarchy within `terraform-eks-fleet` | 2 hours | David | Navigable fleet; lifecycle-gated updates; team-scoped CODEOWNERS | +| 4 | `scripts/update_all_clusters.py` with `--lifecycle`, `--team`, `--dry-run`, `--force`, maintenance window gating | 1 day | David | Fleet-wide updates in one command + calendar respect | +| 5 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos | +| 6 | `clusters/README.md` in `terraform-eks-fleet` | 30 min | David | CSVD onboarding for cluster ops | +| 7 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target | +| 8 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not; write new cluster entry to `terraform-eks-fleet` | 1 day | David | Multi-cluster accounts, clean fleet | +| 9 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied | +| 10 | Add `maintenance_window` local block to each cluster entry | 1 hour | David | Calendar-gated fleet updates | +| 11 | GitHub Action in `terraform-eks-fleet` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync | +| 12 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code | +| 13 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries | +| 14 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification | +| 15 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates | + +**Items 1–6 can be done this sprint. Item 7 is a conversation. Items 8–15 follow.** --- -## What This Looks Like After Items 1–4 +## What This Looks Like After Items 1–6 A CSVD engineer doing a fleet-wide EKS version bump: