CSVD · arnol377 · May 15, 2026 · May 15, 2026 · May 15, 2026
diff --git a/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md b/design-docs/EKS_GOVERNANCE_STATUS_AND_ROADMAP.md
@@ -190,7 +190,7 @@ module "eks_deployment" {
 
 With `repo_name` and `cluster_name` explicit, the fleet map is self-documenting:
 a CSVD engineer can `grep -r repo_name clusters/` and instantly see every account repo
-CSVD has ever written cluster HCL into. This is the "map of injections" Manuel referenced —
+CSVD has ever written cluster HCL into. This is the "cluster injection map" David raised on the call —
 no external database, no spreadsheet. The source files are the inventory.
 
 The cluster config is **fully declarative**. The diff between "what CSVD intended" and
@@ -209,6 +209,9 @@ The cluster config is **fully declarative**. The diff between "what CSVD intende
 | No branch protection set at provisioning time | Same — governance is not enforced at the repo level |
 | No operator-level fleet view | No single place for a CSVD engineer to see all cluster configs side-by-side without navigating multiple repos |
 | `clusters/` has no README | New CSVD engineers don't know how to run a single-cluster or fleet update |
+| `clusters/` is a flat directory | All clusters in one level — no way to reason by lifecycle (dev/prod) or owning team; will become unnavigable at 50+ clusters |
+| Lambda/CodeBuild cannot place a new cluster entry at the right path | When provisioning creates the `clusters/<name>/main.tf` entry in `terraform-eks-deployment`, it has no parameter for lifecycle or team — every new cluster lands at the top level |
+| `clusters/` lives inside the module repo it instantiates | `terraform-eks-deployment` is a **Terraform module** (library code). Embedding operational workspaces inside it conflates module versioning with fleet operations — you can't tag a module release without also tagging cluster state; you can't give fleet operators write access without giving them module write access |
 
 That's the entire gap. **The architecture is sound. The tooling just needs to be completed.**
 
@@ -221,7 +224,7 @@ The target state, described simply:
 1. **New cluster provisioned via SC** → repo created with correct folder structure + governance baked in
 2. **Version bump or config change** → one command → PRs open across all affected clusters
 3. **Customer wants to change something** → they open a PR in their account repo → CSVD reviews it
-4. **CSVD wants to see fleet state** → one place (`terraform-eks-deployment/clusters/`) shows every cluster's config
+4. **CSVD wants to see fleet state** → one place (`terraform-eks-fleet/clusters/`) shows every cluster's config, calling the versioned `terraform-eks-deployment` module
 
 No new architecture is needed. These are operational completions of what's already built.
 
@@ -239,31 +242,99 @@ This is the prerequisite for account repo injection. Currently the module uses `
 the GitHub repo name and the cluster directory name. They need to be separate:
 
 ```hcl
-# clusters/adsd-tools-dev/main.tf — after the change
+# clusters/dev/csvd/csvd-dev-mcm/main.tf — after all Step 0/0a/0b changes
 module "eks_deployment" {
-  source          = "../../"
-  repo_name       = "533109815932-adsd-tools-nonprod-gov_apps-adsd-eks"  # ← existing _apps-adsd-eks repo
-  cluster_name    = "adsd-tools-dev"                                      # ← folder written inside it
+  source = "github.e.it.census.gov/SCT-Engineering/terraform-eks-deployment///?ref=v1.2.0"
+  # ↑ pinned module version — decoupled from fleet repo history
+
+  repo_name       = "229685449397-csvd-dev-gov_apps-adsd-eks"  # ← existing _apps-adsd-eks repo
+  cluster_name    = "csvd-dev-mcm"                              # ← subfolder written inside applications/
   repository_mode = "update"
   ...
 }
 ```
 
-This is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes.
-Every cluster entry in `clusters/` gets updated to specify both fields.
+`repo_name` / `cluster_name` is a small variable rename in `variables.tf` + `main.tf`. Maybe 30 minutes.
+Every cluster entry gets updated to specify both fields.
 After this change, cluster HCL lands in `applications/<cluster-name>/` inside the existing
 account infrastructure repo — exactly where the `repo-layout.md` convention says it should go.
 
+### Step 0b — Extract `clusters/` to a dedicated `terraform-eks-fleet` repo
+
+`terraform-eks-deployment` is a **module** — it should be versioned, tagged, and treated like
+a library. The `clusters/*/main.tf` workspaces are **instantiations** of that module — they
+change constantly and should have their own commit history, CI, CODEOWNERS, and access controls.
+Embedding them inside the module repo conflates two completely different maintenance concerns.
+
+**What to do:** Create a new repo `SCT-Engineering/terraform-eks-fleet`. Move all `clusters/`
+content there. Each workspace references `terraform-eks-deployment` as an external versioned
+module (see Step 0 example above) rather than via `../../`.
+
+Benefits:
+- Module can be tagged and released (`v1.2.0`) without touching fleet state
+- Fleet operators get write access to `terraform-eks-fleet` without touching the module
+- `update_all_clusters.py`, the GitHub Action, and the `eks-fleet.code-workspace` all live here
+- CodeBuild clones `terraform-eks-fleet` (not `terraform-eks-deployment`) when writing new cluster entries
+- CODEOWNERS in `terraform-eks-fleet` governs who can land changes to production clusters
+
+This is a repo creation + file move + change to the module source path. No logic changes.
+Estimated: **half a day** including wiring CodeBuild to clone the new repo.
+
+### Step 0a — Restructure into a lifecycle/team hierarchy (done inside `terraform-eks-fleet`)
+
+With `clusters/` in its own repo, restructure it into a navigable hierarchy:
+
+```
+clusters/
+├── dev/
+│   ├── adsd/
+│   │   └── adsd-tools-dev/main.tf
+│   └── csvd/
+│       ├── csvd-dev-mcm/main.tf
+│       ├── csvd-lab-dja/main.tf
+│       └── csvd-lab-mcm/main.tf
+├── prod/
+│   ├── ois/
+│   │   └── eks-ois-cribl-prod/main.tf
+│   └── csvd/
+│       └── csvd-mcm-common/main.tf
+└── ...
+```
+
+Two dimensions:
+- **Lifecycle** (`dev` / `prod` / `sandbox`) — controls which clusters `update_all_clusters.py`
+  targets by default (safe to run `--lifecycle dev` freely; `--lifecycle prod` requires `--force`)
+- **Team** (`adsd` / `ois` / `csvd` / etc.) — matches the owning division; makes CODEOWNERS
+  scoping trivial (`clusters/prod/ois/` → `@SCT-Engineering/ois-eks-admins`)
+
+**The `update_all_clusters.py` script walks this tree recursively** — the restructure requires
+no logic changes in the script, only that it use `glob('clusters/**/**/main.tf')` instead of
+`glob('clusters/*/main.tf')`.
+
+**SC parameters to add:** The SC product form needs two new optional fields — `team` and
+`lifecycle` — which the Lambda threads through to CodeBuild as `TF_VAR_team` and
+`TF_VAR_lifecycle`. CodeBuild uses them to write the new `clusters/<lifecycle>/<team>/<cluster>/main.tf`
+entry back into `terraform-eks-deployment` as part of provisioning. If omitted, they default
+to `dev` and the `cluster_name` prefix heuristic (e.g. a name starting with `csvd-` → team `csvd`).
+
+This is a one-time migration of the existing five entries, then all new provisioned clusters
+land in the right place automatically. Estimated: **2 hours** (migration) + **30 min** (SC/Lambda param).
+
+---
+
 ### Step 1 — `update_all_clusters.py` (the highest-value deliverable)
 
 A simple Python script, ~100 lines, that loops through `clusters/*/` and runs `tf apply`:
 
 ```python
 # scripts/update_all_clusters.py
 # Usage:
-#   python scripts/update_all_clusters.py              # all clusters
-#   python scripts/update_all_clusters.py --filter csvd  # clusters matching 'csvd'
-#   python scripts/update_all_clusters.py --dry-run      # tf plan only
+#   python scripts/update_all_clusters.py                          # all clusters
+#   python scripts/update_all_clusters.py --lifecycle dev          # dev clusters only (safe default)
+#   python scripts/update_all_clusters.py --lifecycle prod --force # prod clusters (requires --force)
+#   python scripts/update_all_clusters.py --team adsd              # single team across all lifecycles
+#   python scripts/update_all_clusters.py --filter csvd-lab        # clusters matching a name substring
+#   python scripts/update_all_clusters.py --dry-run                # tf plan only, no apply
 ```
 
 This is the fleet management mechanism. When a version needs to be bumped across 20 clusters,
@@ -612,25 +683,27 @@ These are the concrete deliverables, ordered by impact-to-effort ratio.
 
 | Priority | Task | Effort | Owner | Unlocks |
 |----------|------|--------|-------|---------|
-| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all `clusters/*/main.tf` | 30 min | David | Cluster HCL lands in correct account repos |
-| 2 | `scripts/update_all_clusters.py` with `--dry-run`, maintenance window gating, workspace gen | 1 day | David | Fleet-wide updates in one command + operator view + calendar respect |
-| 3 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos |
-| 4 | `clusters/README.md` | 30 min | David | CSVD onboarding for cluster ops |
-| 5 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target |
-| 6 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not | 1 day | David | Multi-cluster accounts, clean fleet |
-| 7 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied |
-| 8 | Add `maintenance_window` local block to each `clusters/*/main.tf` | 1 hour | David | Calendar-gated fleet updates |
-| 9 | GitHub Action in `terraform-eks-deployment` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync |
-| 10 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code |
-| 11 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries |
-| 12 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification |
-| 13 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates |
-
-**Items 1–4 can be done this sprint. Item 5 is a conversation. Items 6–13 follow.**
+| 1 | Decouple `repo_name` from `cluster_name` in `variables.tf` + update all cluster entries | 30 min | David | Cluster HCL lands in correct account repos |
+| 2 | Extract `clusters/` to new `SCT-Engineering/terraform-eks-fleet` repo; update module source to versioned ref | 0.5 day | David | Module and fleet decoupled; clean access controls; prerequisite for all fleet tooling |
+| 3 | Restructure into `clusters/<lifecycle>/<team>/<name>/` hierarchy within `terraform-eks-fleet` | 2 hours | David | Navigable fleet; lifecycle-gated updates; team-scoped CODEOWNERS |
+| 4 | `scripts/update_all_clusters.py` with `--lifecycle`, `--team`, `--dry-run`, `--force`, maintenance window gating | 1 day | David | Fleet-wide updates in one command + calendar respect |
+| 5 | Add CODEOWNERS + branch protection to `managed_extra_files` in `main.tf` | 2 hours | David | Governance on all future provisioned repos |
+| 6 | `clusters/README.md` in `terraform-eks-fleet` | 30 min | David | CSVD onboarding for cluster ops |
+| 7 | Decide: inject into existing `_apps-adsd-eks` repo vs always create new | 0 (discussion) | Manuel | SC provisioning target |
+| 8 | Lambda: check for existing `_apps-adsd-eks` repo → inject if exists, create if not; write new cluster entry to `terraform-eks-fleet` | 1 day | David | Multi-cluster accounts, clean fleet |
+| 9 | Retrofit branch protection on existing cluster repos | 2 hours | David | Governance retroactively applied |
+| 10 | Add `maintenance_window` local block to each cluster entry | 1 hour | David | Calendar-gated fleet updates |
+| 11 | GitHub Action in `terraform-eks-fleet` to regenerate `eks-fleet.code-workspace` on `clusters/**` push | 2 hours | David | Always-current operator workspace, no manual sync |
+| 12 | Write `~/.copilot/skills/eks-fleet-query` skill | 2 hours | David | AI fleet queries from VS Code |
+| 13 | Write `~/.copilot/skills/eks-maintenance-check` skill | 1 hour | David | AI maintenance window queries |
+| 14 | `eks-upgrade` + `eks-pr-reviewer` agent instructions committed to relevant repos | 2 hours | David | AI upgrade planning + automated PR review classification |
+| 15 | `update_all_clusters.py` wired to CodeBuild (headless fleet ops) | 1 day | David | Fully automated fleet updates |
+
+**Items 1–6 can be done this sprint. Item 7 is a conversation. Items 8–15 follow.**
 
 ---
 
-## What This Looks Like After Items 1–4
+## What This Looks Like After Items 1–6
 
 A CSVD engineer doing a fleet-wide EKS version bump: