diff --git a/docs/HOW-IT-WORKS.md b/docs/HOW-IT-WORKS.md index fe6d117..4be7d99 100644 --- a/docs/HOW-IT-WORKS.md +++ b/docs/HOW-IT-WORKS.md @@ -6,29 +6,34 @@ to a Terraform plan or apply running inside an AWS account repository. --- -## Design Overview: Two-Product Model +## Design Overview: Proposer Product + Webhook Auto-Apply -The system is split into **two distinct Service Catalog products** with a human -review gate between them: +The system uses a **single user-facing Service Catalog product** with a human +review gate before Terraform runs any infrastructure changes: -| Product | CodeBuild Project | What It Does | -|---------|------------------|--------------| -| `tf-run-proposer` | `tf-run-proposer` | Clone repo → render templates → commit → open PR | -| `tf-run-executor` | `tf-run-executor` | Clone `main` → assume role → run `tf-run apply` | +| Component | CodeBuild Project | What It Does | +|-----------|------------------|--------------| +| SC Product: `tf-run-proposer` | `tf-run-proposer` | Clone repo → render templates → commit → open PR | +| Webhook (automatic) | `tf-run-executor` | Clone `main` → assume role → run `tf-run apply` | -**Why two products?** +**Why not two SC products?** -An earlier single-product design ran `tf-run apply` first and then opened a PR -as a trailing artifact. This made the PR meaningless as a review gate — Terraform -had already changed real infrastructure before anyone saw the diff. +An earlier design exposed the executor as a second Service Catalog product, +requiring a human to return to the SC console after merging the PR, re-enter the +same parameters, and click Launch. This is pure operational overhead — the review +already happened at PR merge time, and the parameters needed to run the apply are +already recorded in `.sc-automation.yml` in the repo. -The two-product model restores the PR as a genuine gate: +The current design restores the PR as a genuine gate with no extra manual steps: -1. A team provisions the **Proposer** → changes are committed to a branch and a PR - is opened. No infrastructure is touched. CFN stack completes quickly (< 60s). +1. A team provisions the **Proposer** product → changes are committed to a branch + and a PR is opened. No infrastructure is touched. 2. A human reviews the diff, approves, and merges the PR. -3. The team provisions the **Executor** → CodeBuild checks out `main` (post-merge), - assumes the target account role, and runs `tf-run apply`. +3. The GHE push-to-main webhook fires automatically → Lambda reads + `.sc-automation.yml` → starts `tf-run-executor` CodeBuild. No SC product, + no CFN stack, no user action required. + +See [ADR-001](decisions/001-webhook-auto-apply.md) for the full decision record. --- @@ -58,14 +63,13 @@ The two-product model restores the PR as a genuine gate: ↕ Human reviews PR, approves, merges ↕ ┌─────────────────────────────────────────────────────────────────────┐ -│ APPLY FLOW │ +│ AUTO-APPLY (webhook — no user action required) │ │ │ -│ User fills SC form → CFN Custom Resource │ -│ └─> Lambda (tf-run-executor-trigger) │ -│ • Validates inputs (action=apply) │ -│ • Starts tf-run-executor CodeBuild build │ -│ • Polls CodeBuild until completion │ -│ • Returns apply status + repo URL to CFN │ +│ GHE push to main → Lambda Function URL (HMAC verified) │ +│ └─> Lambda (tf-run-webhook-handler) │ +│ • Reads .sc-automation.yml from default branch │ +│ • Starts tf-run-executor CodeBuild (fire-and-forget) │ +│ • Posts pending commit status to GHE │ │ └─> CodeBuild: tf-run-executor │ │ • Installs: Terraform binary (from S3), tf-run │ │ toolchain, Census CA cert, gh CLI, Python deps │ @@ -73,7 +77,7 @@ The two-product model restores the PR as a genuine gate: │ • Optionally assumes cross-account IAM role │ │ • cd {LAYER}/{REGION_DIR} │ │ • tf-run apply (respects TF_RUN_START_TAG) │ -│ • POST_BUILD emits BUILD_RESULT= │ +│ • POST_BUILD writes commit status ✅/❌ to GHE │ └─────────────────────────────────────────────────────────────────────┘ ``` @@ -88,9 +92,9 @@ The two-product model restores the PR as a genuine gate: | CodeBuild (executor) | `tf-run-executor` | csvd-dev | | SC Portfolio | `{prefix}-tf-run` | csvd-dev | | SC Product (propose) | `{prefix}-tf-run-proposer` | csvd-dev | -| SC Product (apply) | `{prefix}-tf-run-executor` | csvd-dev | | CFN Template (propose) | `service-catalog/proposer-template.yaml` | S3 artifacts bucket | -| CFN Template (apply) | `service-catalog/executor-template.yaml` | S3 artifacts bucket | +| Lambda Function URL | `tf-run-webhook-handler` HTTPS endpoint | csvd-dev | +| GHE Webhook | Org-level push webhook → Lambda Function URL | GHE (manual one-time setup) | | Launch Role | `{prefix}-sc-launch-role` | csvd-dev | | GHE PAT | `ghe-runner/github-token` in Secrets Manager | csvd-dev | | Cross-account role | `sc-automation-codebuild-role` | **Target** account | @@ -166,38 +170,37 @@ The CFN stack completes and the output panel shows the PR URL. --- -## Step-by-Step: Apply Flow +## Auto-Apply on Merge (Webhook) ### 1. Prerequisites - The Proposer has run and its PR has been **reviewed and merged** to `main` +- `.sc-automation.yml` was committed by the Proposer alongside the rendered files - The target account has the `sc-automation-codebuild-role` IAM role with a trust policy allowing assume-role from the CodeBuild execution role in csvd-dev +- The GHE org webhook is configured once: push events → Lambda Function URL -### 2. User fills the SC form - -The user opens the **tf-run-executor** product and provides: - -- **AccountRepo** — same repo name as the Proposer -- **Layer** and **RegionDir** — same as the Proposer -- **TargetAccountId** _(optional)_ — if set, CodeBuild assumes the cross-account role -- **TfRunStartTag** _(optional)_ — start tf-run from a specific `TAG` step -- **DryRun** — `true` for plan-only, `false` to apply - -### 3. CloudFormation invokes the Lambda +### 2. GHE fires the push webhook -CFN creates a `Custom::TerraformApply` resource with `action: apply`. +On merge to `main`, GHE sends a `push` event to the Lambda Function URL with +an HMAC-SHA256 signature (`X-Hub-Signature-256` header). The Lambda verifies +the signature against the `ghe-runner/webhook-secret` Secrets Manager secret. -### 4. Lambda validates and starts CodeBuild +### 3. Lambda reads `.sc-automation.yml` and starts CodeBuild -Lambda starts `tf-run-executor` with: +The Lambda (webhook handler mode): +1. Fetches `.sc-automation.yml` from the default branch of the pushed repo +2. Extracts `account_repo`, `layer`, `region_dir`, `target_account_id`, + `dry_run`, and optional `tf_run_start_tag` +3. Calls `codebuild:StartBuild` on `tf-run-executor` with override env vars: + ``` + ACCOUNT_REPO, LAYER, REGION_DIR, + TARGET_ACCOUNT_ID, TF_RUN_START_TAG, DRY_RUN, GITHUB_TOKEN + ``` +4. Posts a `pending` commit status to the merge commit on GHE +5. Returns HTTP 200 immediately — the webhook call is fire-and-forget -``` -ACCOUNT_REPO, LAYER, REGION_DIR, -TARGET_ACCOUNT_ID, TF_RUN_START_TAG, DRY_RUN, GITHUB_TOKEN -``` - -### 5. CodeBuild - INSTALL phase +### 4. CodeBuild - INSTALL phase - Clones `github.e.it.census.gov/terraform/support` for version governance - Downloads Terraform binary from S3 (version governed by `VERSION_TF`) @@ -206,22 +209,39 @@ TARGET_ACCOUNT_ID, TF_RUN_START_TAG, DRY_RUN, GITHUB_TOKEN - Downloads and installs `gh` CLI - `pip3 install python-dateutil pyyaml` -### 6. CodeBuild - BUILD phase +### 5. CodeBuild - BUILD phase 1. Rewrite git remotes; `git clone` account repo; `git checkout main` 2. If `TARGET_ACCOUNT_ID` is set: `aws sts assume-role` → - `arn:aws:iam::{TARGET_ACCOUNT_ID}:role/sc-automation-codebuild-role` + `arn:${AWS::Partition}:iam::{TARGET_ACCOUNT_ID}:role/sc-automation-codebuild-role` and export the temporary credentials 3. `cd ${LAYER}/${REGION_DIR}` -4. If `DRY_RUN=true`: `tf-run plan`; else: `tf-run apply` (with optional `--start-tag ${TF_RUN_START_TAG}`) +4. If `DRY_RUN=true`: `tf-run plan`; else: `tf-run apply` + (with optional `--start-tag ${TF_RUN_START_TAG}`) -### 7. Lambda polls and returns +### 6. CodeBuild - POST_BUILD phase -On `SUCCEEDED`: -- Sends CFN `SUCCESS` with: - - `ApplyStatus: SUCCEEDED` - - `RepositoryUrl` / `repository_url` - - `CodeBuildBuildId` +Writes a `success` or `failure` commit status to GHE on the merge commit, +linking to the CodeBuild log. Platform engineers see ✅/❌ on the commit +without checking CloudWatch directly. + +### Manual One-Off Runs + +For re-apply, dry-run, or partial runs (start from a TAG), trigger the executor +build directly: + +```bash +export AWS_DEFAULT_REGION=us-gov-west-1 +aws codebuild start-build \ + --project-name tf-run-executor \ + --environment-variables-override \ + name=ACCOUNT_REPO,value=229685449397-csvd-dev-platform-dev-gov,type=PLAINTEXT \ + name=LAYER,value=infrastructure,type=PLAINTEXT \ + name=REGION_DIR,value=west,type=PLAINTEXT \ + name=DRY_RUN,value=true,type=PLAINTEXT +``` + +No Service Catalog product is needed. --- @@ -310,6 +330,6 @@ mishandles acronyms (`AWSAccountId` → `a_w_s_account_id`). | `deploy/codebuild.tf` | Terraform: `aws_codebuild_project.tf_run_proposer` + `tf_run_executor` | | `deploy/lambda.tf` | Terraform: Lambda function with `PROPOSER_PROJECT_NAME` + `EXECUTOR_PROJECT_NAME` | | `deploy/iam.tf` | Terraform: IAM roles for Lambda, CodeBuild (with `sts:AssumeRole`), SC launch | -| `deploy/service_catalog.tf` | Terraform: Portfolio, two products, two launch constraints | +| `deploy/service_catalog.tf` | Terraform: Portfolio, single Proposer product, launch constraint | +| `deploy/webhook.tf` | Terraform: Lambda Function URL, HMAC secret, GHE webhook IAM | | `service-catalog/proposer-template.yaml` | CFN template for the Propose product | -| `service-catalog/executor-template.yaml` | CFN template for the Apply product | diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..9052243 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,145 @@ +# sc-lambda-ghactions Documentation + +This directory contains the design, operating model, and rollout guidance for +`sc-lambda-ghactions` — the centralized Lambda + CodeBuild system that provisions +and manages Terraform-backed account repo changes through AWS Service Catalog. + +## What This System Does + +At a high level, the platform supports this workflow: + +1. A user launches a Service Catalog product +2. CloudFormation invokes a centralized Lambda in `csvd-dev` +3. The Lambda validates inputs and starts a CodeBuild build +4. CodeBuild clones a template repo, renders Terraform/HCL/YAML content, and opens a PR +5. After merge, the executor path can run Terraform against the target workload +6. CSVD can also operate the full managed fleet centrally + +## How to Read This Documentation + +This doc set currently contains both: + +- **Current or near-term implementation guidance** for the CodeBuild-based rollout +- **Proposed design evolution** for auto-apply, generalized product types, and fleet-scale operations + +Because of that, the best entry point depends on what you need. + +## Recommended Reading Paths + +### 1. "I need the quickest overview" + +Start with: + +- [HOW-IT-WORKS.md](HOW-IT-WORKS.md) — end-to-end explanation of the proposer/executor model, the main infrastructure components, and the current CodeBuild execution flow +- [workflow-flowcharts.md](workflow-flowcharts.md) — visual walkthrough of provisioning, apply-on-merge, and fleet update flows + +### 2. "I need to understand the target generalized architecture" + +Start with: + +- [generalized-terraform-product-architecture.md](generalized-terraform-product-architecture.md) — explains how the system expands from EKS-only into a reusable pattern for any Terraform workload +- [template-management.md](template-management.md) — explains how template repos, Jinja2 rendering, `.sc-automation.yml`, and repo injection work +- [repo-vars-and-secrets.md](repo-vars-and-secrets.md) — explains how SSM and Secrets Manager values are injected into CodeBuild builds + +### 3. "I need to onboard a new Service Catalog product" + +Read in this order: + +- [generalized-terraform-product-architecture.md](generalized-terraform-product-architecture.md) — required moving parts for a new `product_type` +- [template-management.md](template-management.md) — template repo structure and rendering expectations +- [service-catalog-census-integration.md](service-catalog-census-integration.md) — how to register the product in `terraform-service-catalog-census` +- [repo-vars-and-secrets.md](repo-vars-and-secrets.md) — how product-scoped configuration and secrets reach the build + +### 4. "I need to understand operations and governance at scale" + +Start with: + +- [fleet-governance-at-scale.md](fleet-governance-at-scale.md) — the `terraform-sc-fleet` operating model, workload inventory structure, maintenance windows, and governance controls +- [cross-account-visibility.md](cross-account-visibility.md) — hub-and-spoke IAM model and options for centralized visibility across accounts +- [workflow-flowcharts.md](workflow-flowcharts.md) — visual summary of fleet-wide operations + +### 5. "I need to understand the webhook auto-apply proposal" + +Read: + +- [decisions/001-webhook-auto-apply.md](decisions/001-webhook-auto-apply.md) — ADR for triggering executor builds automatically from GitHub Enterprise webhook events +- [workflow-flowcharts.md](workflow-flowcharts.md) — flow-level view of the apply-on-merge path +- [template-management.md](template-management.md) — `.sc-automation.yml` schema and executor behavior + +## Document Guide + +### Core system overview + +- [HOW-IT-WORKS.md](HOW-IT-WORKS.md) + - Best for understanding the end-to-end proposer/executor model + - Covers the centralized Lambda, CodeBuild projects, SC products, and step-by-step runtime behavior + - Use this as the main operational baseline + +- [workflow-flowcharts.md](workflow-flowcharts.md) + - Best for stakeholder demos and quick architectural orientation + - Includes flows for provisioning, apply-on-merge, and fleet-wide updates + +### Generalization and product onboarding + +- [generalized-terraform-product-architecture.md](generalized-terraform-product-architecture.md) + - Explains how the platform generalizes to any Terraform workload + - Defines the core onboarding units: template repo, Jinja2 templates, Pydantic model, CFN product template, census registration + +- [template-management.md](template-management.md) + - Canonical guide for template repo usage + - Covers full-repo vs subdirectory templates, Jinja2 rendering, `.sc-automation.yml`, proposer behavior, and executor re-rendering into existing account repos + +- [repo-vars-and-secrets.md](repo-vars-and-secrets.md) + - Canonical guide for runtime config injection + - Covers AWS Parameter Store layout, Secrets Manager layout, Lambda IAM, and CodeBuild `environmentVariablesOverride` + +- [service-catalog-census-integration.md](service-catalog-census-integration.md) + - Canonical guide for enterprise product registration + - Covers central vs StackSet vs census-managed resources, launch roles, portfolio/product YAML, and rollout into `terraform-service-catalog-census` + +### Operations, governance, and visibility + +- [fleet-governance-at-scale.md](fleet-governance-at-scale.md) + - Defines the `terraform-sc-fleet` model for operating many workloads across many repos + - Covers workload entry files, account repo layout, update scripts, maintenance windows, CODEOWNERS, and branch protection + +- [cross-account-visibility.md](cross-account-visibility.md) + - Covers read-only access patterns for viewing managed resources across accounts + - Describes the hub-and-spoke IAM role chain and Resource Explorer-first UI approach + +### Architecture decisions + +- [decisions/001-webhook-auto-apply.md](decisions/001-webhook-auto-apply.md) + - ADR for the proposed webhook-triggered executor path + - Useful for understanding why the manual post-merge step should disappear and how `.sc-automation.yml` participates in the design + +## Suggested Canonical Interpretation + +Where multiple docs overlap, use this interpretation: + +- [HOW-IT-WORKS.md](HOW-IT-WORKS.md) is the best **runtime/system overview** +- [template-management.md](template-management.md) is the best **template repo and account repo injection** reference +- [repo-vars-and-secrets.md](repo-vars-and-secrets.md) is the best **config/secrets injection** reference +- [service-catalog-census-integration.md](service-catalog-census-integration.md) is the best **enterprise rollout** reference +- [fleet-governance-at-scale.md](fleet-governance-at-scale.md) is the best **day-2 fleet operations** reference +- [decisions/001-webhook-auto-apply.md](decisions/001-webhook-auto-apply.md) is the best **design rationale** for auto-apply on merge + +## Current Gaps and Notes + +This doc set is now broad enough to explain: + +- how template repos are leveraged +- how rendered content is injected into new and existing account repos +- how CodeBuild receives configuration and secrets +- how new products are registered in Census +- how CSVD governs and operates the resulting fleet + +A few documents are still explicitly marked **Proposed** or **Draft**, so treat them as design intent unless and until the code and deployment match them. + +## If You Only Read Three Docs + +Read these first: + +1. [HOW-IT-WORKS.md](HOW-IT-WORKS.md) +2. [template-management.md](template-management.md) +3. [service-catalog-census-integration.md](service-catalog-census-integration.md) diff --git a/docs/cross-account-visibility.md b/docs/cross-account-visibility.md new file mode 100644 index 0000000..91cf421 --- /dev/null +++ b/docs/cross-account-visibility.md @@ -0,0 +1,353 @@ +# Cross-Account Fleet Visibility — Credentials and Console UI + +**Date:** 2026-05-19 +**Status:** Proposed +**Scope:** Read-only visibility across all accounts managed by sc-lambda-ghactions + +--- + +## Problem + +The `terraform-sc-fleet` manifest and `update_fleet.py` give CSVD a single operational +view of all managed workloads at the Terraform / GHE layer. But engineers also need to +locate and inspect those resources in the **AWS console** — CloudFormation stacks, +Service Catalog provisioned products, Lambda functions, S3 buckets, EKS clusters — +across all accounts simultaneously, without switching console sessions or holding +long-lived credentials for each account. + +--- + +## Credential Model — Hub-and-Spoke IAM Role Chain + +The UI server and any tooling that reads across accounts **never holds long-lived +credentials**. It uses `sts:AssumeRole` to obtain temporary credentials scoped to +each target account on demand. + +``` +csvd-dev (229685449397) — hub + └─> sc-fleet-ui-server role (instance profile / ECS task role) + └─> sts:AssumeRole ─────────────────────────────────────────────┐ + ▼ + Any spoke account + └─> sc-fleet-readonly role + └─> ReadOnlyAccess (AWS managed policy) +``` + +Temporary credentials are cached for up to 1 hour (the STS session duration). +Rotation is automatic. No keys are stored in environment variables, SSM, or Secrets Manager. + +--- + +## Infrastructure + +### 1. Spoke role — deployed to every target account via StackSet + +One role per account, deployed automatically via the existing +`CensusServiceCatalog-RoleAndAction` StackSet alongside the SC launch roles. + +**CFN role template** (`templates/role-templates/sc-fleet-readonly-role.yaml`): + +```yaml +Type: AWS::IAM::Role +Properties: + RoleName: sc-fleet-readonly + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + AWS: !Sub "arn:${AWS::Partition}:iam::${HubAccountId}:role/sc-fleet-ui-server" + Action: sts:AssumeRole + Condition: + StringEquals: + "sts:ExternalId": !Ref ExternalId # optional but recommended + ManagedPolicyArns: + - !Sub "arn:${AWS::Partition}:iam::aws:policy/ReadOnlyAccess" + Tags: + - Key: managed-by + Value: sc-lambda-ghactions +``` + +**`roles.yaml.tftpl` entry** (census repo): + +```yaml +- template: sc-fleet-readonly-role.yaml + parameters: + - parameter: HubAccountId + value: "229685449397" + - parameter: ExternalId + value: "sc-fleet-ui" +``` + +This propagates to all OU-shared accounts automatically. New accounts joining the OU +receive the role via `auto_deployment { enabled = true }`. + +### 2. Hub role — deployed in csvd-dev + +Lives in `sc-lambda-ghactions/deploy/iam.tf`. This is the role assumed by the UI server +(ECS task, Lambda, or EC2 instance profile). + +```hcl +resource "aws_iam_role" "sc_fleet_ui_server" { + name = "sc-fleet-ui-server" + + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Principal = { Service = "ecs-tasks.amazonaws.com" } + Action = "sts:AssumeRole" + }] + }) + + tags = { + managed-by = "sc-lambda-ghactions" + } +} + +resource "aws_iam_role_policy" "assume_spoke_roles" { + name = "assume-sc-fleet-readonly" + role = aws_iam_role.sc_fleet_ui_server.id + + policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Action = "sts:AssumeRole" + Resource = "arn:${data.aws_partition.current.partition}:iam::*:role/sc-fleet-readonly" + # Restrict to org accounts only + Condition = { + StringEquals = { + "aws:ResourceOrgID" = var.org_id + } + } + }] + }) +} +``` + +### 3. Python helper — per-account session factory + +Used by the fleet dashboard, `update_fleet.py`, and any other tooling that needs +cross-account AWS API access: + +```python +# scripts/aws_session.py +import boto3 +from functools import lru_cache + +READONLY_ROLE = "sc-fleet-readonly" +PARTITION = "aws-us-gov" +REGION = "us-gov-west-1" + +@lru_cache(maxsize=64) +def session_for(account_id: str) -> boto3.Session: + """Return a boto3 Session scoped to account_id via sts:AssumeRole. + Credentials are cached for the lifetime of the process. + For long-running processes, evict the cache before the 1-hour STS expiry. + """ + sts = boto3.client("sts", region_name=REGION) + assumed = sts.assume_role( + RoleArn=f"arn:{PARTITION}:iam::{account_id}:role/{READONLY_ROLE}", + RoleSessionName="sc-fleet-ui", + ExternalId="sc-fleet-ui", + DurationSeconds=3600, + ) + creds = assumed["Credentials"] + return boto3.Session( + aws_access_key_id=creds["AccessKeyId"], + aws_secret_access_key=creds["SecretAccessKey"], + aws_session_token=creds["SessionToken"], + region_name=REGION, + ) + +def sc_client(account_id: str): + return session_for(account_id).client("servicecatalog") + +def cfn_client(account_id: str): + return session_for(account_id).client("cloudformation") +``` + +--- + +## Centralized UI Options + +Three options in order of implementation cost: + +### Option A — AWS Resource Explorer (recommended first step) + +Resource Explorer with a **multi-account aggregator index** provides a single search +across all accounts with built-in console deep-links. No custom UI to build or maintain. + +#### Setup + +Enable Resource Explorer org-wide with an aggregator in the management (or delegated +admin) account: + +```hcl +# In the management/delegated-admin account +resource "aws_resourceexplorer2_index" "aggregator" { + type = "AGGREGATOR" +} + +resource "aws_resourceexplorer2_view" "sc_fleet" { + name = "sc-fleet" + default_view = true + + filters { + filter_string = "tag:managed-by:sc-lambda-ghactions" + } +} +``` + +Each member account needs a local index (can be enabled via AWS Organizations policy +or Terraform deployed via StackSet): + +```hcl +resource "aws_resourceexplorer2_index" "local" { + type = "LOCAL" +} +``` + +#### Tagging convention + +Every resource provisioned through sc-lambda-ghactions must carry these tags so +Resource Explorer can surface them: + +| Tag key | Example value | Purpose | +|---------|--------------|---------| +| `managed-by` | `sc-lambda-ghactions` | Scope the aggregator view | +| `product-type` | `eks_cluster` | Filter by workload type | +| `workload-name` | `csvd-dev-mcm` | Find a specific workload | +| `team` | `csvd` | Filter by owning team | +| `lifecycle` | `dev` | Filter by environment tier | +| `account-repo` | `229685449397-csvd-dev-gov_apps-adsd-eks` | Trace back to GHE repo | + +The Proposer CodeBuild buildspec applies these tags when rendering HCL files that +create tagged resources. For resources that don't support tags (e.g. some IAM), the CFN +stack itself is tagged and the stack's console link is sufficient. + +#### Example Resource Explorer queries + +``` +# All sc-lambda-ghactions resources +tag:managed-by=sc-lambda-ghactions + +# All EKS provisioned products +tag:managed-by=sc-lambda-ghactions tag:product-type=eks_cluster + +# Specific workload across all resource types +tag:workload-name=csvd-dev-mcm + +# Failed CloudFormation stacks managed by the system +resourcetype:AWS::CloudFormation::Stack tag:managed-by=sc-lambda-ghactions +``` + +Results include a direct "Open in console" link to each resource in its native account. + +--- + +### Option B — Custom Fleet Dashboard + +A lightweight read-only web app when Resource Explorer is insufficient — e.g. you need +to show fleet diff state (pending PRs, last apply status, maintenance windows) alongside +AWS resource state. + +#### Architecture + +``` +csvd-dev + └─> ECS Fargate task (or Lambda + Function URL) + ├─> Assumes sc-fleet-ui-server hub role + ├─> Reads terraform-sc-fleet workloads/** (GHE API) + ├─> Calls sts:AssumeRole per account → reads SC/CFN/resource state + └─> Renders HTML dashboard with console deep-links +``` + +#### Console deep-link construction + +Direct links into the GovCloud console for each resource type: + +```python +BASE = "https://console.amazonaws-us-gov.com" + +def cfn_stack_link(region: str, stack_name: str) -> str: + return f"{BASE}/cloudformation/home?region={region}#/stacks?filteringText={stack_name}" + +def sc_product_link(region: str, product_id: str) -> str: + return f"{BASE}/servicecatalog/home?region={region}#/provisioned-products/{product_id}" + +def lambda_link(region: str, function_name: str) -> str: + return f"{BASE}/lambda/home?region={region}#/functions/{function_name}" + +def eks_link(region: str, cluster_name: str) -> str: + return f"{BASE}/eks/home?region={region}#/clusters/{cluster_name}" +``` + +#### Fleet status aggregation + +```python +from scripts.aws_session import sc_client, cfn_client + +def fleet_status(accounts: list[str]) -> list[dict]: + """Return provisioned product status across all accounts.""" + results = [] + for account_id in accounts: + sc = sc_client(account_id) + products = sc.search_provisioned_products( + Filters={"SearchQuery": ["tag:managed-by:sc-lambda-ghactions"]} + )["ProvisionedProducts"] + for p in products: + results.append({ + "account_id": account_id, + "product_name": p["Name"], + "product_type": p.get("Tags", {}).get("product-type"), + "status": p["Status"], + "status_message": p.get("StatusMessage"), + "console_link": sc_product_link(p["LastProvisioningRecordId"], p["Id"]), + }) + return results +``` + +--- + +### Option C — AWS Systems Manager Explorer + +SSM Fleet Manager and Explorer aggregate resource data, OpsItems, and compliance across +accounts out of the box — zero custom code, built-in console UI. Less flexible than +Options A/B but worth evaluating before building anything custom. + +Enable via AWS Organizations in the SSM console of the management account. No Terraform +changes needed beyond ensuring SSM is activated in all member accounts (already required +for StackSet operations). + +--- + +## Recommended Rollout + +| Phase | Work | Outcome | +|-------|------|---------| +| **1** | Add tags to all sc-lambda-ghactions provisioned resources (Proposer GHA templates) | Every resource carries `managed-by`, `product-type`, `workload-name`, `team`, `lifecycle` | +| **2** | Deploy `sc-fleet-readonly` spoke role via StackSet entry in census repo | CSVD hub can assume into any org account with one `sts:AssumeRole` call | +| **3** | Enable Resource Explorer aggregator index via management account | Single console search across all accounts with deep-links; zero custom UI | +| **4** | Add `aws_session.py` session factory to `terraform-sc-fleet/scripts/` | `update_fleet.py` and any future tooling can query any account with one helper call | +| **5** | *(optional)* Build fleet dashboard if Resource Explorer + GHE PR state is insufficient | Custom ECS task with per-account SC/CFN reads + console deep-link generation | + +Phases 1–3 are the minimum viable set. Phase 4 is a development convenience. Phase 5 +is only needed if the built-in console tools don't cover the operational queries CSVD +actually needs to make. + +--- + +## Security Notes + +- The `sc-fleet-readonly` spoke role grants `ReadOnlyAccess` — it cannot create, modify, + or delete any resource in any spoke account +- The `ExternalId` condition on `sts:AssumeRole` prevents confused-deputy attacks — only + callers that know the external ID can assume the role +- The hub role `sc-fleet-ui-server` is scoped to `sts:AssumeRole` on `*/sc-fleet-readonly` + only — it cannot assume any other role in spoke accounts +- The org condition (`aws:ResourceOrgID`) on the hub policy prevents the server from + assuming the role name in accounts outside the Census org +- No long-lived credentials are stored anywhere; STS temporary credentials expire + automatically after at most 1 hour diff --git a/docs/decisions/001-webhook-auto-apply.md b/docs/decisions/001-webhook-auto-apply.md index 369a01e..3f86962 100644 --- a/docs/decisions/001-webhook-auto-apply.md +++ b/docs/decisions/001-webhook-auto-apply.md @@ -9,26 +9,30 @@ We want to change this so that when a PR is merged to the main branch, our syste This paper describes how this automatic process will work, what files and settings are needed, and what changes we have to make to our system. The goal is to make things smoother, faster, and less error-prone for everyone who uses our platform. -**Status:** Proposed +**Status:** Accepted **Date:** 2026-05-11 -**Branch:** feature/template-repo-rendering +**Supersedes:** the two-product model (proposer SC product + executor SC product) --- ## Context -The current two-product model requires a human to manually provision the -`tf-run-executor` Service Catalog product after a Proposer PR is reviewed and -merged. This adds unnecessary friction to the apply step: +An earlier design split the workflow into two Service Catalog products — a +**Proposer** product to render templates and open a PR, and a separate +**Executor** product to run `tf-run apply` after the PR was merged. While the +Proposer SC product is a natural fit for self-service provisioning (users fill +a form, get a PR URL back), the Executor SC product is not: it requires a +platform engineer to return to Service Catalog, find the product, re-enter the +same parameters already specified at propose time, and click Launch. -1. Platform engineer reviews and merges the PR opened by the Proposer -2. Platform engineer opens Service Catalog, finds the executor product, fills in - the same parameters they already specified during the Propose step, and - clicks Launch +This step is pure operational overhead with no review value — the review already +happened when the PR was merged to `main`. The information needed to start the +executor build (account repo, layer, region dir, target account) is already +recorded in `.sc-automation.yml` in the repo itself. -Step 2 is pure operational overhead. The information needed to start the executor -build (account repo, layer, region dir, target account) is already known at merge -time and could be stored in the repo itself. +**The Executor SC product is removed.** Apply is triggered automatically by a +GHE webhook on merge to `main`. The only user-facing Service Catalog product +remains the Proposer. --- @@ -137,8 +141,10 @@ in the PR history — no CloudWatch required. - **No polling.** The webhook handler starts builds and returns immediately. Build results are visible in CodeBuild logs and CloudWatch. There is no CFN stack to signal. -- **No CFN resource.** The executor product is still available for manual use, - but webhook-triggered runs bypass Service Catalog entirely. +- **No CFN resource.** Webhook-triggered executor runs bypass Service Catalog + entirely. For manual one-off runs (re-apply from a TAG, dry-run), the executor + build can be started directly via the CodeBuild console or AWS CLI — no SC + product is needed or maintained. - **Idempotent.** If GHE retries the webhook (network blip), a duplicate build is started. This is acceptable — `tf-run apply` on an already-applied state is a no-op. @@ -180,8 +186,8 @@ in the PR history — no CloudWatch required. - GHE commit status writeback gives teams ✅/❌ feedback directly on the merge commit - No new infrastructure services (no EventBridge, no SQS, no API Gateway) - No repo→callback URL map to maintain — repo identity comes from the webhook payload -- The executor SC product remains available for manual one-off runs and - day-2 operations (re-run from a specific tag, dry-run, etc.) +- Manual one-off executor runs (re-apply from a TAG, dry-run) are done directly + via `aws codebuild start-build` — no separate SC product is needed or maintained ### Trade-offs diff --git a/docs/decisions/002-vault-aws-secrets-engine.md b/docs/decisions/002-vault-aws-secrets-engine.md new file mode 100644 index 0000000..1d1bd4c --- /dev/null +++ b/docs/decisions/002-vault-aws-secrets-engine.md @@ -0,0 +1,345 @@ +# ADR-002: HashiCorp Vault AWS Secrets Engine for Dynamic Cross-Account Credentials + +## In Plain Language + +Right now, when our automation runs Terraform in an account repo it needs AWS +credentials to assume a role in the target account. Those credentials come from +a long-lived IAM role attached to the CodeBuild service role — a role that +exists permanently and can be used at any time. + +This document proposes replacing those static, always-on IAM credentials with +**short-lived, on-demand credentials** issued by a HashiCorp Vault cluster +running the [AWS Secrets Engine](https://developer.hashicorp.com/vault/docs/secrets/aws). +When a build starts, it authenticates to Vault (using its own AWS identity), +asks for credentials scoped to the target account and the specific Vault role +defined in the product workspace, gets back temporary AWS keys that expire in +minutes, and then runs Terraform. There are no long-lived keys to rotate or +accidentally expose. + +Because the Vault role is a Terraform resource declared inside the product +workspace, the exact IAM permissions granted to any automation run are visible +as a reviewable diff in the same PR that makes the infrastructure change. Review +the code, review the access policy — one approval covers both. + +**Status:** Proposed +**Date:** 2026-05-19 + +--- + +## Context + +The current cross-account credential model works as follows: + +1. The CodeBuild service role (`sc-automation-codebuild-role` in csvd-dev) has + `sts:AssumeRole` permission for `*:role/sc-automation-codebuild-role`. +2. A matching role with the same name is pre-created in each target account and + trusts the csvd-dev CodeBuild role. +3. The executor buildspec calls `aws sts assume-role` and exports `AWS_*` env + vars before running Terraform. + +This works but has the following drawbacks: + +- **Static trust relationship.** The csvd-dev CodeBuild role can assume the + target-account role at any time, not just during a sanctioned automation run. + If the CodeBuild service role or its credentials were ever misused, an attacker + could assume any target-account role without any build being underway. +- **No per-run scope.** Every executor build gets the same level of access, + regardless of what the product workspace actually needs. There is no way to + restrict a build to, say, only VPC-layer permissions. +- **Permission review is disconnected.** The IAM role in the target account is + managed separately from the product workspace. A reviewer approving a product + PR has no visibility into what IAM permissions the automation will use. +- **Static role pre-creation.** Every new account requires a platform engineer + to pre-create the `sc-automation-codebuild-role` role before the first + automation run can succeed. + +The Vault AWS Secrets Engine addresses all four of these gaps. + +--- + +## Decision + +Deploy a HashiCorp Vault cluster (or use an existing Census-managed Vault) with +the **AWS Secrets Engine** enabled. Each SC product workspace declares a +`vault_aws_secret_backend_role` Terraform resource specifying the exact IAM +permissions the automation run requires. The executor buildspec authenticates to +Vault using the **AWS auth method** (the CodeBuild task's own IAM identity) and +requests short-lived STS credentials scoped to that role before running Terraform. + +--- + +## Proposed Design + +### Vault AWS Secrets Engine — how it works + +``` +Vault cluster (csvd-dev or shared platform) + └── secrets/aws/ (AWS Secrets Engine mount) + └── roles/ + └── {vault_aws_role} + credential_type = assumed_role + role_arns = ["arn:aws-us-gov:iam::{account_id}:role/{role}"] + default_ttl = 900s + max_ttl = 1800s +``` + +When the executor build calls `vault read aws/creds/{vault_aws_role}`, Vault +calls `sts:AssumeRole` on its own behalf and returns temporary +`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` that +expire when the TTL elapses. The credentials are scoped to exactly the role ARNs +listed in the Vault role — nothing wider. + +### `.sc-automation.yml` — new field + +```yaml +apply_on_merge: + - layer: infrastructure + region_dir: west + target_account_id: "229685449397" + vault_aws_role: "sc-infra-west-229685449397" # ← new +``` + +The `vault_aws_role` value is the name of the Vault role to read credentials +from. It is written by the Proposer (derived from the product workspace) and +committed to the account repo alongside the rendered HCL files. + +### Product workspace — Vault role as a Terraform resource + +Each SC product workspace (e.g. a VPC product, an EKS product) declares the +Vault role it needs alongside its other infrastructure: + +```hcl +# vault_role.tf — committed inside the product workspace, reviewed in the PR + +resource "vault_aws_secret_backend_role" "automation" { + backend = "aws" + name = "sc-infra-west-${var.target_account_id}" + credential_type = "assumed_role" + + role_arns = [ + "arn:${var.aws_partition}:iam::${var.target_account_id}:role/sc-automation-infra-west" + ] + + default_ttl = 900 + max_ttl = 1800 +} +``` + +**Why this matters for review:** The Proposer PR diff includes `vault_role.tf`. +A reviewer can see exactly which IAM role the automation will assume and in which +account. Access policy and infrastructure change are approved in the same PR +— there is no separate IAM role PR to chase down or forget. + +### CodeBuild authentication to Vault — AWS auth method + +The executor CodeBuild task authenticates to Vault using the +[AWS auth method](https://developer.hashicorp.com/vault/docs/auth/aws). The +CodeBuild service role's IAM identity is used as the authentication credential +— no long-lived Vault token is stored anywhere. + +```bash +# executor buildspec — PRE_BUILD phase +vault login -method=aws \ + -path=auth/aws \ + role=sc-automation-executor \ + header_value=vault.example.census.gov + +# Read dynamic credentials for this specific run +CREDS=$(vault read -format=json aws/creds/${VAULT_AWS_ROLE}) +export AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r .data.access_key) +export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r .data.secret_key) +export AWS_SESSION_TOKEN=$(echo $CREDS | jq -r .data.security_token) +``` + +The CodeBuild task's IAM role is added to a Vault auth policy that permits only +`read` on `aws/creds/*` — it cannot create new Vault roles, modify policies, or +read credentials for roles it is not permitted to access. + +### Vault Terraform resources — managed in deploy/ + +```hcl +# deploy/vault.tf + +resource "vault_aws_secret_backend" "aws" { + path = "aws" + # Vault's own IAM user/role that calls sts:AssumeRole on behalf of requestors + # must have sts:AssumeRole on the target roles. +} + +resource "vault_auth_backend" "aws" { + type = "aws" + path = "auth/aws" +} + +resource "vault_aws_auth_backend_role" "codebuild_executor" { + backend = vault_auth_backend.aws.path + role = "sc-automation-executor" + auth_type = "iam" + bound_iam_principal_arns = [aws_iam_role.codebuild_service_role.arn] + token_policies = ["sc-automation-executor"] + token_ttl = 900 +} + +resource "vault_policy" "codebuild_executor" { + name = "sc-automation-executor" + + policy = <<-EOT + path "aws/creds/*" { + capabilities = ["read"] + } + EOT +} +``` + +### Infrastructure summary + +| Component | Location | Purpose | +|---|---|---| +| Vault cluster | Census-managed or csvd-dev | Issues dynamic AWS credentials | +| AWS Secrets Engine | `aws/` mount on Vault | Calls `sts:AssumeRole` and returns short-lived keys | +| AWS auth method | `auth/aws/` mount on Vault | Lets CodeBuild authenticate using its own IAM identity | +| `vault_aws_secret_backend_role` | Product workspace Terraform | Per-product IAM scope, reviewed in the Proposer PR | +| Vault endpoint env var | `deploy/codebuild.tf` | `VAULT_ADDR` set on the executor CodeBuild project | +| Vault IAM user | `deploy/vault.tf` | Has `sts:AssumeRole` on all target-account roles | +| Target-account IAM roles | Per-account Terraform | Trust Vault IAM user; scoped to minimum permissions | + +--- + +## Integration with the Proposer Flow + +The key insight is that the Vault role declaration is **part of the product +workspace**, not managed out-of-band. + +When the Proposer CodeBuild build runs Terraform (`tf apply`) to render and +commit files to the account repo, it also applies `vault_role.tf`. The result: + +1. User fills SC product form → Proposer starts. +2. Proposer runs `terraform apply` in the product workspace → creates + `vault_aws_secret_backend_role` in Vault. +3. Proposer renders HCL templates → opens PR on the account repo. +4. PR includes `.sc-automation.yml` with `vault_aws_role: sc-infra-west-{account_id}`. +5. Reviewer merges PR. +6. Webhook fires executor build with `VAULT_AWS_ROLE=sc-infra-west-{account_id}`. +7. Executor authenticates to Vault, reads credentials for that role, runs Terraform. + +The Vault role and the target-account IAM role both exist by the time the +executor runs because the Proposer created them before the PR was even opened. + +### Account baseline prerequisite + +For the Proposer to create the target-account IAM role, it needs an initial +foothold in that account. A single **proposer-access role** must exist in each +target account before the first product is provisioned into it: + +```hcl +# Created once per account as part of account baseline / landing-zone +resource "aws_iam_role" "sc_automation_proposer" { + name = "sc-automation-proposer" + + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Principal = { AWS = "arn:${var.aws_partition}:iam::229685449397:role/tf-run-proposer-codebuild" } + Action = "sts:AssumeRole" + }] + }) +} + +# Permissions boundary keeps this role from creating anything outside +# the sc-automation-* namespace regardless of what policy is attached +resource "aws_iam_role_policy" "sc_automation_proposer" { + role = aws_iam_role.sc_automation_proposer.name + policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Action = ["iam:CreateRole", "iam:PutRolePolicy", "iam:AttachRolePolicy", + "iam:DeleteRole", "iam:DeleteRolePolicy", "iam:GetRole"] + Resource = "arn:${var.aws_partition}:iam::*:role/sc-automation-*" + }] + }) +} +``` + +This role is **not** a Vault-specific concept — it is the account-level trust +grant that allows the automation platform (csvd-dev) to manage its own IAM +footprint in a target account. It belongs in the account vending / landing-zone +baseline alongside other platform roles (e.g. Break-Glass, Config recorder, +SSO permission sets). Once created at account birth it never needs to change. + +--- + +## Consequences + +### Benefits + +- **Short-lived credentials.** Dynamic STS credentials expire within the TTL + (default 15 min). A leaked credential is useless after expiry. +- **Per-run scope.** Each executor build reads credentials for the specific + Vault role defined in `.sc-automation.yml`. A build cannot access credentials + for a role it was not explicitly given. +- **Review parity.** IAM permissions (`vault_role.tf`) are changed in the same + PR as infrastructure. No separate IAM PR; no forgotten permission cleanup. +- **No static cross-account trust.** The existing "CodeBuild role can assume + any `sc-automation-codebuild-role` at any time" is replaced with "CodeBuild + can only read credentials for Vault roles it is permitted to access, and only + during an active build." +- **Automatic Vault role and IAM role provisioning.** The Proposer's + `terraform apply` creates both the Vault role and the target-account IAM + role the Vault secrets engine will assume — in the same apply, before the + PR is opened. No manual per-product setup in the target account. +- **Audit log.** Vault logs every credential issuance with the requesting + entity, timestamp, and lease ID. Each executor build's credential request is + independently auditable in Vault audit logs, separate from CloudTrail. + +### Trade-offs + +- **Vault dependency.** The automation chain now requires a healthy Vault + cluster. If Vault is unavailable, executor builds cannot obtain credentials + and will fail. Mitigation: Vault HA, periodic health checks, runbook for + Vault outage. +- **Vault provider version pinning.** The product workspace requires the + `hashicorp/vault` Terraform provider. This must be available via the Census + proxy (or mirrored in the internal provider cache) and pinned to a tested + version. +- **One landing-zone role required per account.** The Proposer needs a + `sc-automation-proposer` role in each target account (see _Account baseline + prerequisite_ above) to create the per-product executor IAM role. This is a + one-time setup per account, lives in the account vending baseline, and is + narrower than today's equivalent (`iam:CreateRole` on `sc-automation-*` only). +- **Executor buildspec changes required.** `vault login` and `vault read` + calls must be added to the PRE_BUILD phase and the prior + `aws sts assume-role` pattern removed. + +### Out of scope for this ADR + +- Vault cluster sizing, HA topology, and DR strategy — tracked separately +- Census Vault namespace design (shared cluster vs. dedicated) — tracked separately +- Migration path for existing accounts already using the static-role model — tracked separately +- Slack / SNS notification on Vault credential issuance failures — tracked separately + +--- + +## Alternatives Considered + +**AWS IAM Roles Anywhere:** Lets workloads outside AWS obtain short-lived +credentials by presenting a certificate signed by a private CA. Requires +managing a private CA and distributing certificates to CodeBuild tasks. +More complex than Vault AWS auth (which reuses the existing IAM identity +already on the CodeBuild task) with no meaningful benefit in this context. +Rejected. + +**Keep static cross-account role assumption + add SCPs to restrict usage to +CodeBuild source IPs:** SCPs cannot restrict by source service (CodeBuild vs +an operator workstation with the same credentials), only by IP range. IP +ranges for CodeBuild are not stable or exclusive. Rejected. + +**AWS Secrets Manager dynamic secrets plugin:** AWS Secrets Manager does not +natively generate STS-assumed-role credentials on demand. The only supported +dynamic rotation pattern is for database passwords. Rejected. + +**OIDC federation (GitHub Actions model):** GHE on-prem does not expose an +OIDC discovery endpoint compatible with the AWS IAM OIDC provider without +additional infrastructure. Vault AWS auth with the existing CodeBuild IAM +identity is simpler and requires no GHE configuration changes. Rejected. diff --git a/docs/fleet-governance-at-scale.md b/docs/fleet-governance-at-scale.md new file mode 100644 index 0000000..92941cc --- /dev/null +++ b/docs/fleet-governance-at-scale.md @@ -0,0 +1,403 @@ +# Infrastructure Fleet Governance at Scale + +**Ported and generalized from:** `lambda-template-repo-generator/design-docs/EKS_CLUSTER_GOVERNANCE_AT_SCALE.md` +**Generalized from:** EKS-only → any Terraform workload managed through sc-lambda-ghactions +**Date:** 2026-05-19 +**Status:** DRAFT + +--- + +## Summary + +This document defines the governance model and work plan for operating the +sc-lambda-ghactions system at scale — across many provisioned workloads, many account +repos, and many product types (EKS clusters, S3 buckets, RDS instances, VPCs, etc.). + +The three requirements that drive the design: + +1. **Individual workloads** can be modified and updated granularly, without touching others. +2. **All workloads** can be managed centrally by CSVD — CSVD retains governance even as + provisioning is self-service for customers. +3. **Workload state lives in the customer's account repo**, in a dedicated folder per workload, + using a consistent Terragrunt structure. + +The overarching constraint: customers cannot realistically manage complex Terraform +infrastructure themselves. If CSVD gives up governance, they will be called in to +remediate. The solution must scale to many workloads while keeping CSVD in control of +configuration correctness and lifecycle. + +--- + +## The Fleet Repository: `terraform-sc-fleet` + +### Why a dedicated fleet repo + +The sc-lambda-ghactions Lambda and CodeBuild builds are the _provisioning plane_ — they +create repos and open initial PRs. GitHub Actions workflows are planned for a later +rollout phase and will replace the CodeBuild executor builds at that point. The _operations plane_ — applying ongoing changes, +fleet-wide version bumps, governance policy enforcement — belongs in a separate repo +that CSVD controls directly. + +**`SCT-Engineering/terraform-sc-fleet`** is this operations plane. It contains one +folder per managed workload instance, each of which is a Terraform module call pointing +at the relevant product workspace. + +### Fleet repo structure + +``` +terraform-sc-fleet/ +├── workloads/ +│ ├── eks_cluster/ +│ │ ├── dev/ +│ │ │ ├── csvd/ +│ │ │ │ ├── csvd-dev-mcm/main.tf +│ │ │ │ └── csvd-lab-dja/main.tf +│ │ │ └── adsd/ +│ │ │ └── adsd-tools-dev/main.tf +│ │ └── prod/ +│ │ └── ois/ +│ │ └── ois-cribl-prod/main.tf +│ ├── s3_bucket/ +│ │ ├── dev/ +│ │ │ └── csvd/ +│ │ │ └── csvd-artifacts/main.tf +│ │ └── prod/ +│ └── {product_type}/ +│ └── {lifecycle}/ +│ └── {team}/ +│ └── {workload-name}/main.tf +├── scripts/ +│ ├── update_fleet.py # Fleet-wide apply runner +│ ├── maintenance_check.py # Window-aware update eligibility +│ └── fleet_query.py # Structured inventory queries +├── .github/ +│ └── workflows/ +│ └── regenerate-workspace.yml # Auto-updates fleet.code-workspace on push +├── fleet.code-workspace # Auto-generated VS Code workspace (all managed repos) +└── README.md +``` + +The directory tree encodes two dimensions: +- **Product type** (`eks_cluster`, `s3_bucket`, etc.) — matches `product_type` in the SC form +- **Lifecycle / team** (`dev/csvd`, `prod/ois`, etc.) — controls blast radius of fleet operations + +--- + +## Per-Workload Entry Files + +Each `workloads/{product_type}/{lifecycle}/{team}/{name}/main.tf` calls the corresponding +Terraform product workspace as a versioned external module: + +```hcl +# workloads/eks_cluster/dev/csvd/csvd-dev-mcm/main.tf +module "workload" { + source = "github.e.it.census.gov/SCT-Engineering/terraform-eks-deployment///?ref=v1.2.0" + + repo_name = "229685449397-csvd-dev-gov_apps-adsd-eks" # account repo + cluster_name = "csvd-dev-mcm" # folder inside that repo + account_name = "csvd-dev-gov" + aws_account_id = "229685449397" + aws_region = "us-gov-west-1" + vpc_name = "csvd-dev-ew-vpc-01" + # ... cluster-specific overrides +} + +locals { + maintenance_window = { + allowed_days = ["Tuesday", "Wednesday"] + allowed_hours = { start = 2, end = 6 } # UTC + blackout_dates = [] + } +} +``` + +Each entry file serves two roles simultaneously: +1. **Workload metadata** — authoritative record of the configuration CSVD intends for + this workload instance (versions, account, region, VPC, overrides) +2. **Injection location map** — specifies which account repo this workload's rendered HCL + was written into, and under which subfolder + +The `workloads/` tree as a whole is the **fleet map**: every workload CSVD manages has +an entry here. No external database, no spreadsheet. The source files are the inventory. + +--- + +## Account Repo Layout + +Each provisioned workload writes its rendered HCL into a folder inside a per-account +GHE repo. The folder path follows the account repo layer conventions: + +``` +{account-id}-{account-name}_apps-{team}/ +└── {product_type}/ + └── {workload-name}/ + ├── .sc-automation.yml # Written by Proposer; drives webhook executor + ├── config.json # Workload metadata (product_type, version pinned) + └── {region}/ + ├── remote_state.yml + └── {rendered HCL files} +``` + +**One account repo per account per team prefix** (e.g. `_apps-adsd-eks`, `_apps-csvd-platform`). +Multiple workload types and multiple instances of the same type can coexist in the same +account repo in separate subdirectories. + +--- + +## Separation of Concerns + +| Layer | Owner | Purpose | +|-------|-------|---------| +| Account repo (`{account}_apps-{team}/`) | Tenant team (read), CSVD (write via PR) | Source of truth for workload HCL config | +| `terraform-sc-fleet/workloads/` | CSVD | Central manifest; drives `tf apply` per workload | +| Product workspace (`terraform-eks-deployment`, etc.) | CSVD | Shared rendering logic and version defaults per product type | +| sc-lambda-ghactions Lambda + CodeBuild | CSVD | Provisioning UI; creates repo + initial config; webhook executor (initial rollout) | + +--- + +## Fleet Operations + +### Single-workload update + +```bash +cd terraform-sc-fleet/workloads/eks_cluster/dev/csvd/csvd-dev-mcm +source ~/aws-creds && tf apply +``` + +Opens a PR in that workload's account repo with the updated rendered HCL. Zero blast +radius to other workloads. + +### Fleet-wide update (`update_fleet.py`) + +```bash +# All workloads (dry run first) +python scripts/update_fleet.py --dry-run + +# All EKS clusters, dev lifecycle only +python scripts/update_fleet.py --product-type eks_cluster --lifecycle dev + +# Production workloads (requires --force) +python scripts/update_fleet.py --lifecycle prod --force + +# Filter by team +python scripts/update_fleet.py --team adsd + +# Filter by name substring +python scripts/update_fleet.py --filter csvd-lab +``` + +The script: +1. Walks `workloads/**/**/main.tf` recursively +2. Applies `--product-type`, `--lifecycle`, `--team`, `--filter` selectors +3. Checks `maintenance_window` locals — skips workloads outside their window unless `--force` +4. Runs `tf apply` (or `tf plan` for `--dry-run`) per workload +5. Reports per-workload success/failure with PR URLs + +**A version bump across 20 clusters is a one-liner.** Every additional workload costs CSVD +zero marginal effort for fleet-wide operations. + +### Maintenance windows + +Each entry file declares an optional `maintenance_window` local: + +```hcl +locals { + maintenance_window = { + allowed_days = ["Tuesday", "Wednesday"] + allowed_hours = { start = 2, end = 6 } # UTC + blackout_dates = ["2026-06-15", "2026-06-16"] + } +} +``` + +`update_fleet.py` reads this before each apply and skips out-of-window workloads. +Customers request a blackout window by opening a PR to their account repo modifying +`.sc-automation.yml` or by contacting CSVD to update the entry file. No out-of-band +emails or calendar coordination required. + +--- + +## Governance Controls + +### CODEOWNERS in provisioned account repos + +The Proposer build commits a `CODEOWNERS` file into every account repo it creates, +via `managed_extra_files` in the Terraform product workspace: + +``` +# CSVD owns all managed workload configuration +{product_type}/ @SCT-Engineering/csvd-platform-admins +``` + +Platform engineers in other teams may open PRs but cannot merge without CSVD approval. + +### Branch protection + +Branch protection (require PR, require CSVD review, no direct push to `main`) is set +at provisioning time via the `CSVD/terraform-github-repo` module call in each product +workspace. Every repo provisioned through sc-lambda-ghactions automatically gets these +rules at creation. + +### CODEOWNERS in `terraform-sc-fleet` + +The fleet repo itself uses a hierarchy-aware CODEOWNERS: + +``` +# Production workloads require senior review +workloads/*/prod/ @SCT-Engineering/csvd-senior-platform-admins + +# Dev/sandbox workloads can be approved by any CSVD engineer +workloads/*/dev/ @SCT-Engineering/csvd-platform-admins +``` + +--- + +## Fleet Workspace (`fleet.code-workspace`) + +A VS Code workspace file that includes all managed account repos and the fleet manifest +gives a CSVD engineer a full fleet view in a single editor window: + +```json +{ + "folders": [ + { "name": "fleet-manifest", "path": "." }, + { "name": "eks: csvd-dev-mcm", "path": "~/git/account-repos/229685449397-csvd-dev-gov_apps-adsd-eks" }, + { "name": "eks: adsd-tools-dev", "path": "~/git/account-repos/066884702657-ent-gov-shared-sa_apps-adsd-eks" }, + { "name": "s3: csvd-artifacts", "path": "~/git/account-repos/229685449397-csvd-dev-gov_apps-csvd-platform" } + // ... one entry per managed workload + ] +} +``` + +**This file is auto-generated** by a script in `terraform-sc-fleet` that is triggered +on every push to `main`. The script walks `workloads/**/**/main.tf`, extracts `repo_name` +and `workload_name`, and writes `fleet.code-workspace`. No operator ever edits it manually. + +> In the initial rollout this is a CodeBuild project triggered by a webhook. GHA +> workflows will replace it when the GHA executor rollout phase is complete. + +With this workspace open, a CSVD engineer can: +- See all workload configs side-by-side in the Explorer without navigating repos +- Ask Copilot fleet questions across all files at once: + _"Which EKS clusters are not on version 1.31?"_ + _"Show me all prod workloads and their maintenance windows"_ +- Grep across all workload configs simultaneously +- Open PRs to specific workload folders directly from the editor + +--- + +## AI Agents for Fleet Operations + +Because all workload config is declarative files in structured repos, AI agents can answer +operational questions without any custom database or API — **the workspace is the inventory**. + +### `sc-fleet` — Fleet Operator Agent + +Scoped to `fleet.code-workspace`. Answers operational questions across all managed workloads. + +Representative prompts: +- _"Which EKS clusters are not on version 1.31?"_ +- _"Show me all workloads in us-gov-east-1 and their account names"_ +- _"What's the maintenance window for adsd-tools-dev?"_ +- _"Which workloads have a pending update PR open right now?"_ + +### `sc-upgrade` — Version Bump Planning Agent + +Scoped to the relevant product workspace (e.g. `terraform-eks-deployment`). Plans and +validates fleet-wide or targeted version changes before applying. + +Representative prompts: +- _"Plan an upgrade of EKS to 1.31 for all dev clusters"_ +- _"Which workloads can receive an update today based on their maintenance windows?"_ +- _"Show me the tf plan diff for bumping the S3 module version fleet-wide"_ + +### `sc-pr-reviewer` — Customer PR Review Agent + +Injected into each account repo via `managed_extra_files` as a `.github/copilot-instructions.md`. +Automatically summarizes incoming customer PRs and flags governance violations before +a CSVD engineer reviews. + +Representative uses (triggered by a CodeBuild build on PR open, or invoked manually): +- Classifies all changed fields and flags any that are CSVD-owned +- Determines if the change requires a maintenance window +- Produces a one-sentence plain-English summary for the CSVD reviewer + +### `sc-provisioner` — Provisioning Debug Agent + +Scoped to `sc-lambda-ghactions`. Helps debug provisioning failures and validate SC inputs. + +Representative prompts: +- _"The SC product failed — here's the CFN error. What went wrong?"_ +- _"Validate these SC input parameters before I submit"_ +- _"What HCL files would be generated for this cluster config?"_ + +--- + +## Proposed Skills (for `~/.copilot/skills/`) + +| Skill | Trigger phrases | What it does | +|-------|----------------|-------------| +| `sc-fleet-query` | "fleet status", "which workloads", "show me all" | Parses `workloads/**/**/main.tf`, returns structured inventory; accepts `--product-type`, `--filter`, `--field` | +| `sc-maintenance-check` | "maintenance window", "can I update", "what's due today" | Reads `maintenance_window` locals, returns workloads eligible for update on a given date | +| `sc-upgrade-planner` | "plan upgrade", "bump version" | Calls `update_fleet.py --dry-run`, returns per-workload plan summary; flags closed maintenance windows | +| `sc-pr-summary` | "review PR", "summarize this diff" | Fetches PR diff via GHE API, classifies changed fields, returns one-sentence summary + governance flag list | + +--- + +## Work Plan + +### Phase 1 — Create `terraform-sc-fleet` repo + +- [ ] Create `SCT-Engineering/terraform-sc-fleet` +- [ ] Move existing `terraform-eks-deployment/clusters/` entries into + `workloads/eks_cluster/{lifecycle}/{team}/{name}/main.tf` +- [ ] Update module source paths from `../../` to versioned external module reference +- [ ] Add `README.md`, `scripts/update_fleet.py` skeleton +- [ ] Add CodeBuild project to regenerate `fleet.code-workspace` on push to `main` *(GHA workflow planned for later rollout)* + +### Phase 2 — Wire sc-lambda-ghactions Proposer to write fleet entries + +- [ ] After Proposer creates the account repo and opens the PR, also commit a new + `workloads/{product_type}/{lifecycle}/{team}/{name}/main.tf` entry to `terraform-sc-fleet` +- [ ] SC form adds optional `team` and `lifecycle` parameters (default: `dev` + name-prefix heuristic) +- [ ] Lambda threads `team` and `lifecycle` to the Proposer CodeBuild build as `environmentVariablesOverride` + +### Phase 3 — Governance controls at provisioning time + +- [ ] Add `CODEOWNERS` and branch protection to every provisioned account repo + via `managed_extra_files` in each product workspace +- [ ] Add CODEOWNERS to `terraform-sc-fleet` scoped by lifecycle + +### Phase 4 — Fleet-wide update automation + +- [ ] Complete `scripts/update_fleet.py` with `--product-type`, `--lifecycle`, `--team`, + `--filter`, `--dry-run`, `--force` flags +- [ ] Add maintenance window parsing (`maintenance_window` locals) +- [ ] Add `scripts/maintenance_check.py` for window-aware eligibility reporting +- [ ] Wire a CodeBuild project as headless fleet runner (optional) + +### Phase 5 — AI agents and skills + +- [ ] Add `fleet.code-workspace` auto-generation CodeBuild project *(GHA workflow planned for later rollout)* +- [ ] Add copilot instructions to `terraform-sc-fleet` scoped for fleet operator queries +- [ ] Define `sc-fleet-query` and `sc-maintenance-check` skills under `~/.copilot/skills/` +- [ ] Add `.github/copilot-instructions.md` to provisioned account repos via `managed_extra_files` + +--- + +## Open Questions + +| # | Question | Owner | +|---|----------|-------| +| 1 | One account repo per workload type, or one per account? | Manuel / Don | +| 2 | Auto-merge for fleet version bumps in dev lifecycle, or always require review? | Matthew / Manuel | +| 3 | Who is CODEOWNER on `main` for each product type — a team or named individuals? | Manuel | +| 4 | Fleet-wide updates: CodeBuild headless runner, or CSVD engineer runs `update_fleet.py` manually? | David / Matthew | + +--- + +## Non-Goals + +- Customers self-managing Terraform — CSVD owns all Terraform execution +- Per-customer forks of product workspaces — single central workspace per product type +- Moving workload config to a database or external registry — `workloads/**` is the registry diff --git a/docs/generalized-terraform-product-architecture.md b/docs/generalized-terraform-product-architecture.md new file mode 100644 index 0000000..71b9772 --- /dev/null +++ b/docs/generalized-terraform-product-architecture.md @@ -0,0 +1,246 @@ +# Generalized Terraform Product Architecture + +**Date:** 2026-05-19 +**Status:** Proposed +**Audience:** Platform Engineering stakeholders +**Context:** Expanding the Service Catalog automation system beyond EKS to support any arbitrary Terraform template repo + +--- + +## Summary + +The Service Catalog (SC) automation system was originally built to create EKS cluster +GitHub repositories. This document describes a path to generalize that system so that +**any Terraform workload** — S3 buckets, RDS databases, VPCs, IAM roles, etc. — can be +onboarded as a new SC product with minimal engineering effort. + +The core Lambda infrastructure, webhook handler, and CodeBuild executor are already +workload-agnostic. The changes required to support a new product type are scoped to: + +1. A **template repo** on GitHub Enterprise +2. A set of **Jinja2 HCL/YAML templates** for the rendered files +3. A **Pydantic config model** describing the product's inputs +4. A **CloudFormation product template** for the Service Catalog form +5. A **census config YAML** to register the product in the portfolio + +No changes to the Lambda runtime, CodeBuild projects, or webhook infrastructure +are needed after the initial generalization work is complete. + +--- + +## Current State (EKS-only) + +``` +SC Console (user fills EKS form) + └─> CFN Stack (Custom::GitHubRepository) + └─> Lambda (eks-terragrunt-repo-gen-template-automation) + ├─> Validates EKS-specific inputs (Pydantic model) + ├─> Fetches GHE token from Secrets Manager + ├─> Triggers executor CodeBuild build + └─> Polls build → returns repo URL + PR URL to CFN +``` + +The Lambda and CodeBuild executor are tightly coupled to EKS field names +(`cluster_name`, `vpc_name`, `vpc_domain_name`, etc.) and the +`template-eks-cluster` template repo. + +--- + +## Target State (Any Terraform Workload) + +``` +SC Console (user fills product form — any workload type) + └─> CFN Stack (Custom::TerraformRepo) + └─> Lambda (sc-template-automation) [shared, central] + ├─> Reads product_type from CFN properties + ├─> Routes to the correct Pydantic model + template set + ├─> Triggers executor CodeBuild build + └─> Returns repo URL + PR URL to CFN + +GitHub Enterprise (any account repo) + └─> push to main + └─> Lambda webhook handler (existing, already generic) + └─> Reads .sc-automation.yml → starts executor build +``` + +The Lambda becomes a **dispatcher**: `product_type` is a single new field in the +CFN `Properties` block that routes the request to the correct handler. + +--- + +## What Is Already Generic + +The following components require **no changes** to support new product types: + +| Component | Why it is already generic | +|---|---| +| Webhook handler | Reads `.sc-automation.yml` from any repo; no workload awareness | +| `.sc-automation.yml` schema | `layer`, `region_dir`, `target_account_id` are workload-agnostic | +| Executor CodeBuild project | Runs `tf apply` in any Terraform workspace; env vars are injected at build time | +| HMAC signature verification | Workload-agnostic GHE push event handling | +| GHE commit status writeback | Writes ✅/❌ to any repo's merge commit | +| Lambda Function URL | Single entry point; no per-product URLs needed | + +--- + +## What Changes for Each New Product + +### 1. Template repo on GHE + +Create a new repo under `SCT-Engineering/` (e.g. `template-s3-bucket`) that follows +the standard account repo directory layout. This repo is cloned by the executor +CodeBuild build and serves as the starting point for rendered files. + +The template repo must contain: +- Standard `.tf-control`, `.tf-control.tfrc`, `region.tf`, `credentials.d/`, `variables.d/` +- Layer directories (`common/`, `infrastructure/`, `vpc/`) as applicable +- `remote_state.yml` stubs that the Proposer build will populate + +### 2. Jinja2 templates + +Add a new subdirectory under `lambda/templates/{product_type}/` containing the +`.tf.j2` and `.hcl.j2` files that are rendered by the Proposer build before being +committed to the new repo branch. + +``` +lambda/templates/ +├── eks_cluster/ # existing +│ ├── infrastructure/west/cluster.tf.j2 +│ └── ... +├── s3_bucket/ # new +│ ├── infrastructure/west/s3.tf.j2 +│ └── ... +└── {future_product}/ # pattern +``` + +### 3. Pydantic config model + +Add a new model in `lambda/models/{product_type}.py`: + +```python +class S3BucketConfig(BaseModel): + """Input model for S3 bucket SC product.""" + bucket_name: str + account_name: str + aws_account_id: str + environment: Literal["dev", "test", "prod"] + aws_region: str = "us-gov-west-1" + versioning_enabled: bool = True + lifecycle_days: int = 90 + team: str + workload: str + tier: str + partition: str = "gov" +``` + +The model enforces required fields and default values before any CodeBuild build is started. + +### 4. Lambda dispatcher + +A single routing table maps `product_type` to the correct handler: + +```python +PRODUCT_HANDLERS = { + "eks_cluster": handle_eks, + "s3_bucket": handle_s3, + # future: "rds_postgres": handle_rds +} + +def handle_create(props: dict): + product_type = props.get("product_type", "eks_cluster") # default: backward-compat + handler = PRODUCT_HANDLERS.get(product_type) + if not handler: + raise ValueError(f"Unknown product_type: {product_type}") + return handler(props) +``` + +This is a **one-time change** to `lambda/app.py`. After it is in place, adding a new +product type requires only a new entry in the table and a new handler function — no +other Lambda changes. + +### 5. CloudFormation product template + +Create a new `service-catalog/{product_type}-product-template.yaml`. The template +follows the same pattern as the EKS product template: + +- Parameters for user-facing form fields +- A single `Custom::TerraformRepo` resource +- Properties passed in `snake_case` to avoid the PascalCase normalizer issue +- `product_type` included as a static string property +- `aws_account_id` and `aws_region` resolved via `!Sub` — not user-facing parameters + +```yaml +Properties: + ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" + product_type: s3_bucket + bucket_name: !Ref BucketName + account_name: !Ref AccountName + aws_account_id: !Sub "${AWS::AccountId}" + environment: !Ref Environment + team: !Ref Team + workload: !Ref Workload + tier: !Ref Tier +``` + +### 6. Census config YAML (portfolio registration) + +Add a new YAML file under `terraform-service-catalog-census/templates/products/{product_type}/` +to register the product in the SC portfolio. This follows the same structure as the +existing EKS product config. + +--- + +## Onboarding Checklist for a New Product Type + +The following checklist can be handed to a product team or platform engineer to +onboard any new Terraform workload without Lambda or CodeBuild changes: + +- [ ] Create `SCT-Engineering/template-{product_type}` repo from the standard account repo skeleton +- [ ] Add `lambda/templates/{product_type}/` with Jinja2 templates for each rendered file +- [ ] Add `lambda/models/{product_type}.py` with a Pydantic model defining required inputs +- [ ] Register the handler in `lambda/app.py` `PRODUCT_HANDLERS` table +- [ ] Create `service-catalog/{product_type}-product-template.yaml` CFN template +- [ ] Add census config YAML and SC portfolio registration in `terraform-service-catalog-census` +- [ ] Test end-to-end via `scripts/test_service_catalog.py` with the new product type +- [ ] Confirm `.sc-automation.yml` is written correctly by the Proposer build + +--- + +## Example: S3 Bucket Product + +An S3 bucket product would work as follows end-to-end: + +1. Platform engineer opens Service Catalog, selects **S3 Bucket Repository Creator** +2. Fills in: `bucket_name`, `team`, `workload`, `environment`, `tier` +3. CloudFormation creates a `Custom::TerraformRepo` stack with `product_type: s3_bucket` +4. Lambda validates inputs against `S3BucketConfig`, renders S3 Jinja2 templates +5. Proposer CodeBuild clones `template-s3-bucket`, commits rendered HCL, opens PR +6. CFN stack outputs: `repository_url`, `pull_request_url` +7. Platform engineer reviews and merges PR +8. Webhook fires → Lambda reads `.sc-automation.yml` → starts executor build +9. Executor applies S3 Terragrunt config; posts ✅ commit status on merge commit + +The platform engineer never leaves GitHub or Service Catalog — there is no manual executor step. + +--- + +## Migration Path for Existing EKS Product + +The EKS product continues to work without modification. The `product_type` field defaults +to `eks_cluster` when absent, preserving backward compatibility with any existing +CloudFormation stacks or SC provisioned products. + +--- + +## Infrastructure Cost of Generalization + +| Resource | Current | After generalization | +|---|---|---| +| Lambda functions | 1 (EKS-only) | 1 (shared dispatcher) | +| CodeBuild projects | 2 (builder + creator) | 2 (no change) | +| Secrets Manager secrets | 2 (GHE tokens) + 1 (webhook) | No change | +| Lambda Function URL | 1 | No change | +| ECR repositories | 1 | No change | + +There is **no additional AWS infrastructure cost** to add new product types. Each new +product type is purely a code and configuration change. diff --git a/docs/repo-vars-and-secrets.md b/docs/repo-vars-and-secrets.md new file mode 100644 index 0000000..96686fc --- /dev/null +++ b/docs/repo-vars-and-secrets.md @@ -0,0 +1,250 @@ +# Repository Variables and Secrets Management + +**Ported from:** `lambda-template-repo-generator/design-docs/REPO_VARS_AND_SECRETS.md` +**Updated for:** sc-lambda-ghactions (CodeBuild-based initial rollout; GHA planned for later) + +This document describes how environment variables and secrets are made available +to CodeBuild builds started by the sc-lambda-ghactions Lambda. + +In the initial CodeBuild-based rollout, secrets and configuration values are +injected directly as CodeBuild environment variable overrides at build-start time +(via `environmentVariablesOverride` in the `StartBuild` API call). AWS Parameter +Store and Secrets Manager values are fetched by the Lambda and passed through, or +read directly by the CodeBuild buildspec at runtime. + +> **Later rollout (GHA):** When GitHub Actions workflows replace CodeBuild as the +> executor, the mechanism shifts to GitHub Actions secrets and variables set via +> the GHE API. The SSM/Secrets Manager parameter structure described below is +> designed to support both models. + +--- + +## Overview + +The Proposer CodeBuild build has access to: + +1. **Secrets** — read from AWS Secrets Manager; injected as CodeBuild env var overrides at build-start time or fetched in the buildspec via `aws secretsmanager get-secret-value` +2. **Configuration values** — read from AWS Parameter Store; fetched in the buildspec via `aws ssm get-parameter` + +Both are scoped by: +- **Global** — applied to every account repo regardless of product type +- **By product type** — applied only to repos of a specific `product_type` + +--- + +## AWS Parameter Store Structure + +``` +/sc-template-automation/ + ├── variables/ + │ ├── global/ # Variables set on every new repo + │ │ ├── AWS_REGION # e.g. us-gov-west-1 + │ │ └── TERRAFORM_VERSION # e.g. 1.9.1 + │ └── by-type/ # Variables by product_type + │ ├── eks_cluster/ + │ │ ├── CLUSTER_VERSION + │ │ └── NODE_TYPE + │ └── s3_bucket/ + │ └── ... +``` + +## AWS Secrets Manager Structure + +``` +sc-template-automation/ + ├── secrets/global/ # Secrets set on every new repo + │ └── AWS_ACCESS_KEY_ID # (if needed by CodeBuild buildspec) + └── secrets/by-type/ # Secrets by product_type + ├── eks_cluster/ + │ └── KUBECONFIG + └── s3_bucket/ + └── ... +``` + +--- + +## Lambda Infrastructure + +### IAM Permissions + +The Lambda execution role requires: + +```hcl +data "aws_iam_policy_document" "secrets_access" { + statement { + effect = "Allow" + actions = ["secretsmanager:GetSecretValue", "secretsmanager:ListSecrets"] + resources = [ + "arn:${data.aws_partition.current.partition}:secretsmanager:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:secret:sc-template-automation/*" + ] + } +} + +data "aws_iam_policy_document" "ssm_access" { + statement { + effect = "Allow" + actions = ["ssm:GetParameter", "ssm:GetParameters", "ssm:GetParametersByPath"] + resources = [ + "arn:${data.aws_partition.current.partition}:ssm:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:parameter/sc-template-automation/*" + ] + } +} +``` + +### Lambda Environment Variables + +```hcl +environment { + variables = { + PARAM_STORE_PREFIX = "/sc-template-automation" + SECRETS_PREFIX = "sc-template-automation" + } +} +``` + +--- + +## Implementation — Building CodeBuild `environmentVariablesOverride` + +> **Note:** In the CodeBuild-based rollout, there is **no GHE repo secrets/variables API involved**. +> Secrets and configuration values are fetched by the Lambda at invocation time and passed +> directly to CodeBuild as `environmentVariablesOverride`. The GHE repo secrets approach +> is only relevant to the planned later GHA-based rollout. + +The helper `build_env_overrides()` in `lambda/env_builder.py` assembles the override list: + +```python +import boto3 + +ssm = boto3.client("ssm", region_name="us-gov-west-1") +secretsmanager = boto3.client("secretsmanager", region_name="us-gov-west-1") + +PARAM_PREFIX = "/sc-template-automation" +SECRET_PREFIX = "sc-template-automation" + + +def _get_ssm_path(path: str) -> dict[str, str]: + """Return {name: value} for all SSM parameters under the given path.""" + paginator = ssm.get_paginator("get_parameters_by_path") + result = {} + for page in paginator.paginate(Path=f"{PARAM_PREFIX}/{path}", WithDecryption=True): + for p in page["Parameters"]: + name = p["Name"].split("/")[-1] + result[name] = p["Value"] + return result + + +def _get_secrets_path(path: str) -> dict[str, str]: + """Return {name: value} for all Secrets Manager secrets under the given prefix.""" + paginator = secretsmanager.get_paginator("list_secrets") + result = {} + for page in paginator.paginate(Filters=[{"Key": "name", "Values": [f"{SECRET_PREFIX}/{path}"]}]): + for s in page["SecretList"]: + name = s["Name"].split("/")[-1] + value = secretsmanager.get_secret_value(SecretId=s["Name"])["SecretString"] + result[name] = value + return result + + +def build_env_overrides(product_type: str) -> list[dict]: + """ + Return a list of CodeBuild environmentVariablesOverride dicts containing: + - All global SSM variables + - All product-type SSM variables + - All global Secrets Manager secrets (type=SECRETS_MANAGER passed by ref) + - All product-type Secrets Manager secrets + """ + overrides = [] + + # Plain-text variables from SSM (fetched by Lambda, passed as PLAINTEXT) + for name, value in { + **_get_ssm_path("variables/global"), + **_get_ssm_path(f"variables/by-type/{product_type}"), + }.items(): + overrides.append({"name": name, "value": value, "type": "PLAINTEXT"}) + + # Secrets — passed as a Secrets Manager ARN reference so CodeBuild fetches at build time + # This avoids the Lambda ever holding plaintext secret values in memory beyond SSM calls. + for name, arn in { + **_get_secrets_arns("secrets/global"), + **_get_secrets_arns(f"secrets/by-type/{product_type}"), + }.items(): + overrides.append({"name": name, "value": arn, "type": "SECRETS_MANAGER"}) + + return overrides +``` + +> **`SECRETS_MANAGER` type:** When CodeBuild receives an env var with `type=SECRETS_MANAGER`, +> it resolves the value (an ARN) at build-start time using the CodeBuild service role — +> the Lambda never sees the plaintext secret value. + +### Integration in the Lambda Handler + +```python +def handle_create(props: dict): + product_type = props["product_type"] + # ... validate inputs (Pydantic), identify template repo ... + + # Build env var overrides from SSM + Secrets Manager + env_overrides = build_env_overrides(product_type) + + # Add per-invocation values from CFN properties + env_overrides += [ + {"name": "PRODUCT_TYPE", "value": product_type, "type": "PLAINTEXT"}, + {"name": "REPO_NAME", "value": props["project_name"], "type": "PLAINTEXT"}, + {"name": "ENVIRONMENT", "value": props["environment"], "type": "PLAINTEXT"}, + {"name": "AWS_ACCOUNT_ID","value": props["aws_account_id"],"type": "PLAINTEXT"}, + {"name": "AWS_REGION", "value": props["aws_region"], "type": "PLAINTEXT"}, + ] + + codebuild.start_build( + projectName=PROPOSER_PROJECT, + environmentVariablesOverride=env_overrides, + ) +``` + +--- + +## Populating Secrets and Variables + +### Add a global variable (all repos) + +```bash +export AWS_DEFAULT_REGION=us-gov-west-1 +aws ssm put-parameter \ + --name "/sc-template-automation/variables/global/TERRAFORM_VERSION" \ + --value "1.9.1" \ + --type "String" +``` + +### Add a product-type-specific secret + +```bash +export AWS_DEFAULT_REGION=us-gov-west-1 +aws secretsmanager create-secret \ + --name "sc-template-automation/secrets/by-type/eks_cluster/KUBECONFIG" \ + --secret-string "..." +``` + +--- + +## Security Considerations + +- **Encryption at rest:** All secrets are AWS-managed encrypted in Secrets Manager +- **Least privilege:** Lambda role scoped to `sc-template-automation/*` prefix only +- **Audit trail:** CloudTrail records all `GetSecretValue` and `GetParameter` calls +- **Repository isolation:** Secrets are set per-repo via GHE API; they are not + stored in the Lambda or committed to the repo +- **No plaintext in Lambda env:** Secrets are fetched at runtime, not baked into + the container image or Lambda environment variables + +--- + +## Future Enhancements + +- **Secret rotation:** Implement automatic rotation for long-lived credentials +- **Environment-scoped secrets:** Dev/test/prod variants of secrets per repo +- **Organization-level variables:** Push shared variables once to org level instead + of per-repo, reducing GHE API call volume +- **Validation rules:** Reject variable names that conflict with CodeBuild reserved + names (e.g. `CODEBUILD_*`, `AWS_*` built-ins) diff --git a/docs/service-catalog-census-integration.md b/docs/service-catalog-census-integration.md new file mode 100644 index 0000000..3502c7a --- /dev/null +++ b/docs/service-catalog-census-integration.md @@ -0,0 +1,312 @@ +# Service Catalog Census Integration + +**Ported and generalized from:** `lambda-template-repo-generator/design-docs/SERVICE_CATALOG_CENSUS_INTEGRATION.md` +**Updated for:** sc-lambda-ghactions (CodeBuild-based initial rollout; GHA planned for later) +**Date:** 2026-05-19 +**Status:** DRAFT + +--- + +## Executive Summary + +This document covers how sc-lambda-ghactions products are registered in the +`terraform-service-catalog-census` repo, which manages all enterprise Service Catalog +portfolios and products via Terragrunt. Each new product type (EKS cluster, S3 bucket, +RDS instance, etc.) requires entries in the census repo to become available in the SC +console org-wide. + +The census integration is designed for **enterprise-wide deployment from the outset**. +Every resource is classified by deployment scope — central (Lambda, ECR), StackSet (launch +roles), or census-managed (portfolios, products, constraints) — and handled accordingly. + +--- + +## System Layout + +### sc-lambda-ghactions system (4 repos) + +``` +sc-lambda-ghactions/ ← Lambda + CodeBuild buildspecs + SC product templates +├── lambda/app.py ← Lambda handler (dispatcher by product_type) +├── lambda/models/{product_type}.py ← Pydantic input models per product type +├── lambda/templates/{product_type}/ ← Jinja2 HCL templates per product type +├── service-catalog/{product_type}-product-template.yaml ← CFN product template +└── deploy/ ← Terraform: Lambda, ECR, IAM, Function URL + +terraform-sc-fleet/ ← Fleet operations manifest (all managed workloads) +packer-pipeline/ ← Container build CLI +template-{product_type}/ ← Template repos (one per product type) +``` + +### `terraform-service-catalog-census` (census repo) + +``` +terraform-service-catalog-census/ +├── main-modules/service-catalog/ ← Main Terraform module +├── modules/ +│ ├── sc-portfolio/ ← Portfolio + principal association +│ ├── sc-product/ ← Product + S3 upload + versioning +│ └── cfn-roles-actions/ ← Launch roles via CFN StackSets +├── templates/ +│ ├── products/ +│ │ ├── eks-terragrunt-repo/ ← CFN product template (versioned YAMLs) +│ │ ├── s3-bucket-repo/ ← (planned) +│ │ └── {product-type}-repo/ ← pattern +│ └── role-templates/ ← IAM launch role CFN snippets +├── non-prod/csvd-dev/west/ +│ ├── configurations/ +│ │ ├── portfolios/*.yaml.tftpl ← Portfolio definitions +│ │ └── products/**/*.yaml.tftpl ← Product definitions +│ └── service-catalog/ +└── prod/operations-gov/ ← Prod (shares to org) +``` + +--- + +## Resource Classification + +Every resource falls into one of three deployment tiers: + +| Tier | What | Deployment mechanism | Scope | +|------|------|---------------------|-------| +| **Central** | Lambda, ECR, Secrets Manager, GHE token, Function URL | `sc-lambda-ghactions/deploy/` (`tf apply`) | csvd-dev only | +| **StackSet** | IAM launch role per product type | `cfn-roles-actions` StackSet via census repo | All OU-shared accounts | +| **Census-managed** | SC portfolio, product, provisioning artifact, constraints | YAML config in census repo → `terragrunt apply` | SC admin account + shared OUs | + +--- + +## Step 1 — Central Infrastructure (`sc-lambda-ghactions/deploy/`) + +The Lambda is centralized in csvd-dev. CloudFormation in any org account invokes +it cross-account via the `ServiceToken` ARN. + +**Lambda resource policy** — allows any account in the org: +```hcl +resource "aws_lambda_permission" "cloudformation_org" { + statement_id = "AllowCloudFormationOrgInvoke" + action = "lambda:InvokeFunction" + function_name = aws_lambda_function.sc_automation.function_name + principal = "cloudformation.amazonaws.com" + condition { + test = "StringEquals" + variable = "aws:PrincipalOrgID" + values = [var.org_id] + } +} +``` + +No per-account Lambda deployment is needed. Provisioners never need the Lambda locally — +their CloudFormation stack calls it cross-account via the `ServiceToken`. + +--- + +## Step 2 — IAM Launch Roles (StackSet) + +One IAM launch role is required **per product type** in every account that will +provision the product via SC. These are deployed via the `cfn-roles-actions` StackSet, +which auto-deploys to all accounts in shared OUs. + +### Launch role template (per product type) + +Add a file to `templates/role-templates/`: + +```yaml +# templates/role-templates/sc-{product_type}-launch-role.yaml +Type: AWS::IAM::Role +Properties: + RoleName: !Sub "r-ent-servicecatalog-${ProductType}-sc-launch-role" + AssumeRolePolicyDocument: + Statement: + - Effect: Allow + Principal: { Service: servicecatalog.amazonaws.com } + Action: sts:AssumeRole + Policies: + - PolicyName: InvokeCentralLambda + PolicyDocument: + Statement: + - Effect: Allow + Action: lambda:InvokeFunction + Resource: !Sub "arn:${AWS::Partition}:lambda:${LambdaRegion}:${CentralAccountId}:function:sc-template-automation" + - Effect: Allow + Action: [cloudformation:*, s3:GetObject] + Resource: "*" +``` + +### Registering in `roles.yaml.tftpl` + +```yaml +# non-prod/csvd-dev/west/configurations/roles.yaml.tftpl +- template: sc-eks-cluster-launch-role.yaml # existing + parameters: + - parameter: CentralAccountId + value: "229685449397" + - parameter: LambdaRegion + value: us-gov-west-1 + - parameter: ProductType + value: eks_cluster + +- template: sc-s3-bucket-launch-role.yaml # new product type + parameters: + - parameter: CentralAccountId + value: "229685449397" + - parameter: LambdaRegion + value: us-gov-west-1 + - parameter: ProductType + value: s3_bucket +``` + +**Only one `terragrunt apply`** is needed after adding a new role entry. The StackSet +propagates to all shared accounts automatically via `auto_deployment { enabled = true }`. + +--- + +## Step 3 — Census Portfolio and Product Config + +### Portfolio YAML + +Portfolios are defined in `configurations/portfolios/`. The sc-lambda-ghactions products +belong in a single shared portfolio (or alongside existing census portfolios): + +```yaml +# configurations/portfolios/sc-automation.yaml.tftpl +sc_automation: + name: "Service Catalog Automation Portfolio" + description: >- + Self-service infrastructure provisioning via sc-lambda-ghactions. + Supports any Terraform workload type. + provider_name: CSVD + products: + - eks_cluster_repo + - s3_bucket_repo + user_roles: + - /census/*/sc-end-user-role + share_ous: + - name: census-workload-accounts +``` + +### Product YAML (per product type) + +```yaml +# configurations/products/eks-cluster-repo/EKS_CLUSTER_REPO.yaml.tftpl +eks_cluster_repo: + name: "EKS Cluster Repository Creator" + description: >- + Creates a GitHub Enterprise repository with Terragrunt EKS cluster + configuration and opens a review PR. + type: CLOUD_FORMATION_TEMPLATE + distributor: CSVD + support_email: csvd.aws.service.catalog.team.list@census.gov + launch_role: r-ent-servicecatalog-eks-cluster-sc-launch-role + template_constraints: + Parameters: + # Lock the Lambda ARN — users cannot redirect to a different Lambda + ServiceToken: "arn:${Partition}:lambda:us-gov-west-1:229685449397:function:sc-template-automation" + versions: + - name: "1.0.0" + description: "Initial CodeBuild-based version" + file_path: products/eks-cluster-repo/1-0-0.yaml +``` + +### Product template location + +The CFN product template lives at: +``` +templates/products/{product_type}-repo/{version}.yaml +``` + +This is a copy of (or symlink to) `sc-lambda-ghactions/service-catalog/{product_type}-product-template.yaml`. +When a new version of the product template is released, add a new versioned file here +and bump the `versions` list in the product YAML. + +--- + +## Step 4 — Moving the Lambda to a Different Account + +If the central Lambda needs to move to a different AWS account, the following must be +updated. **All other components are account-agnostic.** + +| Resource | Location | What changes | +|----------|----------|-------------| +| Lambda + all central infra | `sc-lambda-ghactions/deploy/` | Re-deploy in new account | +| Launch role `lambda:InvokeFunction` ARN | `roles.yaml.tftpl` → `CentralAccountId` parameter | Update to new account ID — one change propagates to all shared accounts via StackSet | +| Template constraint `ServiceToken` | Product YAML `template_constraints` | Update ARN value | +| GitHub token secrets | Secrets Manager in new account | Recreate manually | + +**Migration order:** Update StackSet launch roles (step 3) → wait for propagation → update +template constraint (step 4). Reversing the order causes a `lambda:InvokeFunction` permission +denial window. + +### Why parameterizing `CentralAccountId` matters + +The account ID is only in `roles.yaml.tftpl` under the `CentralAccountId` parameter. The +role template YAML itself is static and account-agnostic. A single value change propagates +to all shared accounts via the StackSet — no role template file needs updating. + +--- + +## Adding a New Product Type to the Census Portfolio + +Checklist for each new product type: + +- [ ] Add CFN product template at `templates/products/{product_type}-repo/1-0-0.yaml` +- [ ] Add product YAML at `configurations/products/{product_type}-repo/{PRODUCT}.yaml.tftpl` +- [ ] Add launch role template at `templates/role-templates/sc-{product_type}-launch-role.yaml` +- [ ] Add launch role entry in `roles.yaml.tftpl` +- [ ] Add product key to portfolio YAML `products:` list +- [ ] Run `terragrunt apply` in `non-prod/csvd-dev/west/service-catalog/` +- [ ] Validate: product appears in SC console; end-to-end test from a workload account + +--- + +## Validation Checklist + +### After central Lambda deploy: +- [ ] Lambda resource policy allows org-wide CloudFormation invocation +- [ ] Cross-account test: invoke Lambda from a different account via CFN Custom Resource + +### After StackSet launch role deploy: +- [ ] StackSet instances show `CURRENT` in CloudFormation console for target OUs +- [ ] Launch role exists in at least 2-3 workload accounts (spot check) +- [ ] Role trust policy allows `servicecatalog.amazonaws.com` + +### After census product deploy: +- [ ] Portfolio visible in SC console in the admin account +- [ ] Portfolio shared to target OUs (verify in a workload account) +- [ ] Product associated with portfolio; launch constraint attached +- [ ] Template constraint locks `ServiceToken` to correct Lambda ARN +- [ ] End-to-end test: provision from a **workload account** (not csvd-dev) + +--- + +## Appendix: Census Config Format Reference + +### Portfolio YAML schema + +```yaml +: + name: string + description: string + provider_name: string + products: [, ...] + user_roles: [/path/pattern/*] + tags: {} + share_ous: [] # OU names; empty = inherit from terraform.tfvars +``` + +### Product YAML schema + +```yaml +: + name: string + description: string + type: CLOUD_FORMATION_TEMPLATE + launch_role: string # IAM role NAME (not ARN) — must exist in every target account + distributor: string + template_constraints: + Parameters: + ParamName: locked-value + versions: + - name: "1.0.0" + file_path: products/{product-dir}/{version}.yaml + actions: [] +``` diff --git a/docs/template-management.md b/docs/template-management.md new file mode 100644 index 0000000..5a20ec4 --- /dev/null +++ b/docs/template-management.md @@ -0,0 +1,271 @@ +# Template Management + +**Ported from:** `lambda-template-repo-generator/design-docs/CUSTOM_TEMPLATES.MD` +**Updated for:** sc-lambda-ghactions (CodeBuild-based initial rollout; GHA planned for later) + +This document describes how template repositories are structured and consumed by +the sc-lambda-ghactions system to create new account repos for any Terraform workload. + +--- + +## Template Sources + +### Full Repository Templates + +The standard approach: a GHE repository is used as the template. When the Lambda +Proposer build runs, it clones the template repo verbatim and renders Jinja2 +configuration files on top of it before committing to the new account repo branch. + +**Convention:** template repos are named `template-{product_type}` under `SCT-Engineering/`. + +| Product type | Template repo | +|---|---| +| `eks_cluster` | `SCT-Engineering/template-eks-cluster` | +| `s3_bucket` | `SCT-Engineering/template-s3-bucket` *(planned)* | +| `{any_type}` | `SCT-Engineering/template-{any_type}` | + +### Subdirectory Templates + +For product families that share significant infrastructure (e.g. multiple tiers +of the same workload), a single template repo can contain multiple subdirectory +templates. The Proposer build accepts a `source_path` parameter to clone only +the relevant subdirectory into the new account repo. + +Example: a `template-terraform-workloads` repo with: + +``` +template-terraform-workloads/ +├── eks-cluster/ # Standard EKS cluster template +├── eks-cluster-minimal/ # Reduced-footprint cluster variant +├── s3-standard/ # Standard S3 bucket configuration +└── s3-encrypted/ # S3 with custom KMS key configuration +``` + +A product that specifies `source_path: eks-cluster-minimal` will clone only that +subdirectory, stripped of the parent path prefix. + +--- + +## CFN Product Template Usage + +### Full repository (no source_path) + +```yaml +Resources: + MyAccountRepo: + Type: Custom::TerraformRepo + Properties: + ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" + product_type: eks_cluster + project_name: !Ref ProjectName + environment: !Ref Environment + aws_account_id: !Sub "${AWS::AccountId}" + aws_region: !Sub "${AWS::Region}" +``` + +### Subdirectory template + +```yaml +Properties: + ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" + product_type: s3_bucket + source_path: s3-encrypted # ← subdirectory within the template repo + project_name: !Ref ProjectName + environment: !Ref Environment + aws_account_id: !Sub "${AWS::AccountId}" + aws_region: !Sub "${AWS::Region}" +``` + +--- + +## Template Repository Structure + +Every template repo must follow the standard account repo layout so the rendered +output is compatible with the `tf-run` toolchain and `tf-directory-setup.py`: + +``` +template-{product_type}/ +├── .tf-control # tf-run toolchain version pin +├── .tf-control.tfrc # Terraform provider cache config +├── region.tf # locals { region = var.region } +├── credentials.d/ +│ ├── us-gov-east-1.credentials.tf +│ └── us-gov-west-1.credentials.tf +├── variables.d/ +│ ├── variables.common.tf +│ └── variables.tfstate.tf +├── infrastructure/ +│ ├── east/ +│ │ ├── remote_state.yml.j2 # ← Jinja2: rendered by Proposer +│ │ └── {workload}.tf.j2 # ← Jinja2: rendered by Proposer +│ └── west/ +│ ├── remote_state.yml.j2 +│ └── {workload}.tf.j2 +└── README.md +``` + +Files ending in `.j2` are Jinja2 templates. The Proposer CodeBuild build renders +them using the product inputs and commits the rendered result (without the `.j2` +extension) to the new account repo branch. + +--- + +## Jinja2 Template Organization in the Lambda + +Rendered templates are stored in the Lambda image under `lambda/templates/{product_type}/`: + +``` +lambda/templates/ +├── eks_cluster/ +│ ├── infrastructure/west/cluster.tf.j2 +│ ├── infrastructure/east/cluster.tf.j2 +│ └── ... +├── s3_bucket/ # ← new product type: add a directory here +│ ├── infrastructure/west/s3.tf.j2 +│ └── ... +└── {product_type}/ # ← pattern for future types +``` + +The Lambda dispatcher maps `product_type` → template directory automatically. +Adding a new product type requires only adding a new subdirectory here, a +Pydantic model, and a CFN product template — no Lambda plumbing changes. + +--- + +## Proposer Build — Template Copying Logic + +The Proposer CodeBuild build (started by the Lambda via `codebuild:StartBuild`) performs these steps: + +1. Clone the template repo (full repo or `source_path` subdirectory) +2. For each `.j2` file found: + - Render it using `jinja2.Environment` with the product input variables + - Write the rendered output alongside the source file (without `.j2` extension) + - Remove the `.j2` source file from the working tree +3. Add rendered `remote_state.yml` files using actual account/bucket values +4. Write `.sc-automation.yml` to the repo root if it does not already exist on `main` +5. Commit all rendered files to a new branch (`proposal/{timestamp}`) and open a PR + +The PR is reviewed by a platform engineer before merging. On merge, the webhook +handler reads `.sc-automation.yml` and automatically starts the executor CodeBuild build. + +--- + +## `.sc-automation.yml` — Automation Config File + +Every account repo that participates in sc-lambda-ghactions automation must have a +`.sc-automation.yml` file at the repo root. The Proposer writes this file when it +creates the initial PR if it does not already exist on `main`. + +### Schema + +```yaml +# .sc-automation.yml +product_type: eks_cluster # Must match a registered PRODUCT_HANDLERS key +executor_project: sc-executor # CodeBuild project name for the Executor build +dry_run: true # If true, Executor runs tf plan only (no apply) +template_repo: SCT-Engineering/template-eks-cluster # Source template repo +template_source_path: "" # Subdirectory within template repo; empty = root +fleet_entry: workloads/eks_cluster/prod/my-cluster/main.tf # Path in terraform-sc-fleet +variables: # Extra key/value pairs injected as CodeBuild env vars + CLUSTER_VERSION: "1.29" + NODE_TYPE: m5.xlarge +``` + +| Field | Required | Description | +|---|---|---| +| `product_type` | ✅ | Routes to the correct Pydantic model and template directory | +| `executor_project` | ✅ | CodeBuild project started by the webhook on PR merge | +| `dry_run` | ✅ | `true` → `tf plan` only; `false` → `tf apply` | +| `template_repo` | ✅ | GHE repo used as the Jinja2 template source | +| `template_source_path` | ❌ | Subdirectory within `template_repo`; omit for whole-repo templates | +| `fleet_entry` | ❌ | Relative path of this workload's entry in `terraform-sc-fleet` | +| `variables` | ❌ | Product-type-specific overrides; merged with SSM global defaults | + +> **Versioning:** The Executor reads `.sc-automation.yml` from `main` at build time, not from the +> PR branch, so changes to it take effect on the next automation run without requiring a re-render. + +--- + +## Executor Build — Injecting into an Existing Account Repo + +After a platform engineer merges the Proposer PR into `main`, the sc-lambda-ghactions +webhook fires and starts the **Executor** CodeBuild build. The Executor handles +both the initial `tf plan`/`tf apply` run and any subsequent re-render of existing repos. + +### What the Executor Does + +``` +webhook (PR merged to main) + └─> Lambda reads .sc-automation.yml from main + └─> Lambda starts Executor CodeBuild build via StartBuild + environmentVariablesOverride: + REPO_NAME, PRODUCT_TYPE, DRY_RUN, TEMPLATE_REPO, ... + +Executor buildspec: + INSTALL: + - Install Terraform from S3 assets bucket + - Install Census CA cert, set HTTPS_PROXY + - git clone {account_repo} (GHE token from Secrets Manager) + PRE_BUILD: + - Read .sc-automation.yml from cloned repo + - git clone {template_repo} into /tmp/template + BUILD: + - For each .j2 file in /tmp/template: + Render with Jinja2 using env vars as context + Write to account_repo at same relative path (no .j2 extension) + - git checkout -b update/{timestamp} + - git add -A && git commit + - git push + - gh pr create --title "Automated update: {product_type} {timestamp}" + - If dry_run == false: + tf init && tf apply -auto-approve + POST_BUILD: + - POST commit status to GHE (success/failure with CodeBuild log URL) +``` + +### Fleet Update (re-rendering an existing repo) + +When a **template repo itself changes** — for example, an upstream HCL pattern is +updated — the fleet update flow (Flow 3) re-renders all account repos of that +`product_type`: + +1. `terraform-sc-fleet` lists all `workloads/{product_type}/*/main.tf` entries +2. Lambda starts one Executor build **per account repo** (fan-out) +3. Each Executor clones its account repo, re-renders all `.j2` files from the + updated template, commits to a new branch, and opens a PR +4. Platform engineers review and merge the PRs individually + +The Executor **never force-pushes to `main`** — every change goes through a PR, +preserving review gates regardless of whether `dry_run` is set. + +### Idempotency + +The Executor is safe to re-run. If the rendered output is identical to `main` +(`git diff --quiet`), it exits with no PR opened and reports a `SKIPPED` status +back to the Lambda. + +--- + +## Security Considerations + +- **Source path validation:** The Proposer validates that `source_path` (if provided) + exists in the template repo before proceeding. Path traversal (`../`) is rejected. +- **File type restrictions:** Only `.tf`, `.hcl`, `.yml`, `.yaml`, `.md`, `.j2`, + and standard dotfiles are copied. Binary files and executables are rejected. +- **Template repo access:** The GHE token injected into the CodeBuild environment + has read-only access to `SCT-Engineering/template-*` repos and read-write access + only to the target account repo. + +--- + +## Adding a New Template Repository + +Checklist when onboarding a new product type: + +- [ ] Create `SCT-Engineering/template-{product_type}` with standard account repo layout +- [ ] Add `.j2` files for each rendered configuration file +- [ ] Add `lambda/templates/{product_type}/` with corresponding Jinja2 templates +- [ ] Add a Pydantic model in `lambda/models/{product_type}.py` +- [ ] Register the handler in `lambda/app.py` `PRODUCT_HANDLERS` table +- [ ] Create a CFN product template in `service-catalog/{product_type}-product-template.yaml` +- [ ] Add the product to `terraform-service-catalog-census` (see [service-catalog-census-integration.md](service-catalog-census-integration.md)) diff --git a/docs/workflow-flowcharts.md b/docs/workflow-flowcharts.md new file mode 100644 index 0000000..ef860d4 --- /dev/null +++ b/docs/workflow-flowcharts.md @@ -0,0 +1,135 @@ +# Service Catalog Automation — Workflow Flowcharts + +**Ported and updated from:** `lambda-template-repo-generator/docs/DEMO_FLOWCHART.md` +**Updated for:** sc-lambda-ghactions (CodeBuild-based initial rollout; GHA planned for later) + +Generic overview of all end-to-end flows for any Service Catalog product built +on the sc-lambda-ghactions pattern. Intended for stakeholder demos and onboarding +conversations. + +--- + +## Flow 1 — Provisioning (SC Form → New Account Repo + PR) + +```mermaid +flowchart TD + A([👤 Engineer]) -->|Fills out form & clicks Launch| B[AWS Service Catalog] + + B -->|Creates CloudFormation Stack| C[CloudFormation\nCustom Resource] + + C -->|Cross-account invocation\nvia ServiceToken| D[Lambda Function\ncsvd-dev] + + D -->|Fetches GHE token| E[(Secrets Manager\ncsvd-dev)] + + D -->|Starts CodeBuild build\nproduct_type + inputs as env vars| F[CodeBuild\nProposer — csvd-dev] + + F -->|Clones template repo| G[SCT-Engineering/template-{product_type}] + F -->|Renders Jinja2 templates\nCommits rendered HCL| H[New Branch\nproposal/timestamp] + F -->|Opens| I[Pull Request\nproposal → main] + F -->|Commits entry to| K[terraform-sc-fleet\nworkloads/{type}/{name}/main.tf] + + D -->|Polls CodeBuild build\nevery 20s until complete| F + D -->|Returns repo URL + PR URL| C + + C -->|Stack outputs| B + B -->|Status: AVAILABLE\n+ repo & PR links| A + + style A fill:#4a90d9,color:#fff + style B fill:#f5a623,color:#fff + style C fill:#f5a623,color:#fff + style D fill:#7ed321,color:#fff + style E fill:#9b59b6,color:#fff + style F fill:#27ae60,color:#fff + style G fill:#2c3e50,color:#fff + style H fill:#2c3e50,color:#fff + style I fill:#e74c3c,color:#fff + style K fill:#8e44ad,color:#fff + +``` + +--- + +## Flow 2 — Apply on Merge (Webhook → Auto-Executor) + +After a platform engineer reviews and merges the Proposer PR, the webhook handler +automatically starts the executor build — no manual SC provisioning step required. + +```mermaid +flowchart TD + A([👤 Platform Engineer]) -->|Reviews & merges PR| B[GitHub Enterprise\nmain branch] + + B -->|Push event| C[Lambda Function URL\nPOST /webhook] + + C -->|Verifies HMAC signature| C + C -->|Reads .sc-automation.yml\nfrom merged commit| D{layer / region_dir\nconfigured?} + + D -->|Yes| E[Starts CodeBuild build\nexecutor — csvd-dev] + D -->|No| Z([Skip — no automation config]) + + E -->|Reads .sc-automation.yml\nvia buildspec env var| G{dry_run: true?} + G -->|Yes| H[terraform plan only] + G -->|No| I[terraform apply] + + E -->|POST commit status via GitHub API| B + B -->|✅ or ❌ on merge commit| A + + style A fill:#4a90d9,color:#fff + style B fill:#2c3e50,color:#fff + style C fill:#7ed321,color:#fff + style D fill:#f5a623,color:#fff + style E fill:#27ae60,color:#fff + style G fill:#f5a623,color:#fff + style H fill:#9b59b6,color:#fff + style I fill:#e74c3c,color:#fff + style Z fill:#95a5a6,color:#fff +``` + +--- + +## Flow 3 — Fleet-Wide Update (CSVD Operations) + +CSVD-initiated update applied across all managed workloads — e.g. a version bump. +No Service Catalog involvement; runs directly from `terraform-sc-fleet`. + +```mermaid +flowchart TD + A([👤 CSVD Engineer]) -->|python update_fleet.py\n--product-type eks_cluster --lifecycle dev| B[terraform-sc-fleet\nscripts/update_fleet.py] + + B -->|Walks workloads/eks_cluster/dev/**| C{maintenance\nwindow open?} + + C -->|Yes| D[tf apply\nper workload folder] + C -->|No| E([Skip workload\nlog window info]) + + D -->|Starts CodeBuild build\nexecutor — csvd-dev| F[CodeBuild\nExecutor — csvd-dev] + + F -->|Renders + commits\nupdated HCL| G[Account Repo\nNew branch] + G -->|Opens| H[Pull Request\nfor CSVD or customer review] + + F -->|POST commit status via GitHub API| H + + B -->|Summary: N applied\nM skipped| A + + style A fill:#4a90d9,color:#fff + style B fill:#8e44ad,color:#fff + style C fill:#f5a623,color:#fff + style D fill:#7ed321,color:#fff + style E fill:#95a5a6,color:#fff + style F fill:#27ae60,color:#fff + style G fill:#2c3e50,color:#fff + style H fill:#e74c3c,color:#fff +``` + +--- + +## Key Design Points + +| # | Point | +|---|-------| +| 1 | **Self-service provisioning** — engineer fills a form; no CSVD involvement for the create path | +| 2 | **Centralized compute** — Lambda, CodeBuild projects, and GHE tokens all live in csvd-dev; the provisioner's account only sees a CFN stack with output URLs | +| 3 | **Lambda as thin orchestrator** — validates inputs, starts CodeBuild build, polls for completion, returns URLs to CFN | +| 4 | **CodeBuild runs the Terraform** — actual repo creation and HCL rendering logic lives in CodeBuild buildspecs, not bespoke Lambda Python. GHA workflows are planned for a later rollout phase. | +| 5 | **Auto-apply on merge** — webhook handler eliminates the manual executor step; merge = apply | +| 6 | **Fleet operations separate from provisioning** — `terraform-sc-fleet` + `update_fleet.py` give CSVD a single command for fleet-wide changes | +| 7 | **Works for any product type** — swap `product_type` in the SC form and the entire chain routes to a different template repo, Pydantic model, and Jinja2 templates, with no Lambda plumbing changes | +| 8 | **Governance via GHE** — branch protection and CODEOWNERS are baked into every provisioned repo at creation time; customers can propose changes but cannot merge without CSVD approval |