diff --git a/docs/template-management.md b/docs/template-management.md index d1092c4..4a08803 100644 --- a/docs/template-management.md +++ b/docs/template-management.md @@ -1,189 +1,215 @@ # Template Management -**Ported from:** `lambda-template-repo-generator/design-docs/CUSTOM_TEMPLATES.MD` -**Updated for:** sc-lambda-ghactions (CodeBuild-based initial rollout; GHA planned for later) - This document describes how template repositories are structured and consumed by -the sc-lambda-ghactions system to create new account repos for any Terraform workload. +the sc-lambda-ghactions system to add new workloads to existing account repos. --- -## Template Sources +## Core Principle: Templates are Delta Overlays + +Template repos do **not** contain a full account repo scaffold. Account repos +already carry all of the standard boilerplate from their initial setup: + +``` +{account-id}-{alias}/ +├── .tf-control # already there — toolchain version pin +├── .tf-control.tfrc # already there — plugin cache / provider mirror +├── .gitignore # already there +├── region.tf # already there +├── credentials.d/ # already there — per-region AWS credential files +├── variables.d/ # already there — profile + region auto.tfvars +├── common/ # existing layer with remote_state.yml, variables, etc. +├── infrastructure/ # existing layer ... +│ ├── remote_state.yml # already there — account-specific bucket/profile/account_id +│ ├── variables.common.tf # already there +│ └── west/ # existing workspace ... +└── vpc/ # existing layer ... +``` -### Full Repository Templates +A template repo provides **only the new files** the Proposer writes into +that existing structure. If the template were to include `.tf-control`, +`region.tf`, `credentials.d/`, or `variables.d/`, it would: -The standard approach: a GHE repository is used as the template. When the Lambda -Proposer build runs, it clones the template repo verbatim and renders Jinja2 -configuration files on top of it before committing to the new account repo branch. +- **Overwrite working account-specific values** with placeholders or wrong defaults +- Be **non-reusable** across accounts (different profiles, regions, account IDs) +- Duplicate governance already managed by the `terraform/support` repo -**Convention:** template repos are named `template-{product_type}` under `SCT-Engineering/`. +--- -| Product type | Template repo | -|---|---| -| `eks_cluster` | `SCT-Engineering/template-eks-cluster` | -| `s3_bucket` | `SCT-Engineering/template-s3-bucket` *(planned)* | -| `{any_type}` | `SCT-Engineering/template-{any_type}` | +## What Belongs in a Template Repo -### Subdirectory Templates +A template repo contains only the workload-specific delta: -For product families that share significant infrastructure (e.g. multiple tiers -of the same workload), a single template repo can contain multiple subdirectory -templates. The Proposer build accepts a `source_path` parameter to clone only -the relevant subdirectory into the new account repo. +``` +template-{product_type}/ +├── {layer}/ +│ └── {workspace}/ +│ ├── {workload}.tf.j2 # workload resources — rendered by Proposer +│ └── tf-run.data # apply step sequence for this workspace +└── .sc-automation.yml.j2 # optional: Proposer writes this if absent +``` -Example: a `template-terraform-workloads` repo with: +### Minimal real example — `template-s3-bucket` ``` -template-terraform-workloads/ -├── eks-cluster/ # Standard EKS cluster template -├── eks-cluster-minimal/ # Reduced-footprint cluster variant -├── s3-standard/ # Standard S3 bucket configuration -└── s3-encrypted/ # S3 with custom KMS key configuration +template-s3-bucket/ +├── infrastructure/ +│ └── west/ +│ ├── INF.s3-standard.tf.j2 # S3 bucket + policy resources +│ └── tf-run.data # REMOTE-STATE + tf-directory-setup + ALL +└── .sc-automation.yml.j2 ``` -A product that specifies `source_path: eks-cluster-minimal` will clone only that -subdirectory, stripped of the parent path prefix. +That is the entire template. Nothing else. The account repo already provides +the execution context: Terraform binary version, plugin cache, proxy settings, +provider config, region, credentials, and the layer-level `remote_state.yml` +from which the workspace `remote_state.yml` is derived. + +### When the target layer does not yet exist + +If the workload requires adding a **brand-new layer** to the account repo +(e.g. adding `infrastructure/` to an account that only has `common/`), the +template still does not provide the layer-level `remote_state.yml`. Instead, +the Lambda's Pydantic model builds it from SC form inputs and passes it via +`EXTRA_FILES`: + +```python +# Inside the Lambda handler for this product type: +extra_files = { + f"{layer}/remote_state.yml": render_remote_state_yml( + directory=layer, + account_id=req.aws_account_id, + account_alias=req.account_alias, + bucket=f"inf-tfstate-{req.aws_account_id}", + bucket_region="us-gov-east-1", + profile=f"{req.aws_account_id}-{req.account_alias}", + region=req.aws_region, + aws_environment="gov", + ) +} +``` + +`EXTRA_FILES` are written by the Proposer **after** template rendering, so they +can never be accidentally provided by the template repo. The account-specific +values come from the validated Pydantic model, not from a `.j2` file. --- -## CFN Product Template Usage +## Template Repository Conventions -### Full repository (no source_path) +### `tf-run.data` — required in every workspace the template touches + +Every workspace directory added by the template must include a `tf-run.data` +with at minimum: -```yaml -Resources: - MyAccountRepo: - Type: Custom::TerraformRepo - Properties: - ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" - product_type: eks_cluster - project_name: !Ref ProjectName - environment: !Ref Environment - aws_account_id: !Sub "${AWS::AccountId}" - aws_region: !Sub "${AWS::Region}" +``` +VERSION 1.0 +REMOTE-STATE +COMMAND tf-directory-setup.py --link none +TAG apply-start +ALL ``` -### Subdirectory template +- `REMOTE-STATE` instructs the Proposer to derive the workspace `remote_state.yml` + from the layer-level one (appending `/{workspace_name}` to `directory`). +- `COMMAND tf-directory-setup.py --link none` causes the Proposer to generate + `remote_state.backend.tf` + the three variant files. `--link none` is the + bootstrap state; the Executor re-links to `--link s3` after first apply. +- `TAG apply-start` lets an operator re-run from this point without re-running + the setup directives. -```yaml -Properties: - ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" - product_type: s3_bucket - source_path: s3-encrypted # ← subdirectory within the template repo - project_name: !Ref ProjectName - environment: !Ref Environment - aws_account_id: !Sub "${AWS::AccountId}" - aws_region: !Sub "${AWS::Region}" -``` +### `.sc-automation.yml.j2` — optional ---- +If the template includes `.sc-automation.yml.j2`, the Proposer renders and +commits it. If absent, the Proposer writes a default `.sc-automation.yml` +using the product type and executor project from the Lambda's `TfRunRequest` +model. Either way, the file ends up at the repo root on `main` after merge. -## Template Repository Structure +### `.terraform.lock.hcl` — include if possible -Every template repo must follow the standard account repo layout so the rendered -output is compatible with the `tf-run` toolchain and `tf-directory-setup.py`: +If the template is authored for a known provider set (e.g. `hashicorp/aws`), +include a pre-generated `.terraform.lock.hcl` in each workspace directory. +This avoids a from-scratch provider resolution on first `tf-init` and gives +reviewers visibility into the locked provider versions. -``` -template-{product_type}/ -├── .gitignore # must exclude logs/ .terraform/ terraform.tfstate* -├── .tf-control # tf-run toolchain version pin -├── .tf-control.tfrc # Terraform provider cache config -├── region.tf # locals { region = var.region } -├── credentials.d/ -│ ├── us-gov-east-1.credentials.tf -│ └── us-gov-west-1.credentials.tf -├── variables.d/ -│ ├── variables.common.tf -│ └── variables.tfstate.tf -│ └── {region}.variables.common.auto.tfvars.j2 # ← must emit profile + region keys -├── infrastructure/ -│ ├── remote_state.yml.j2 # ← layer-level; Proposer renders to remote_state.yml -│ ├── east/ -│ │ ├── tf-run.data # ← must contain REMOTE-STATE directive -│ │ ├── .terraform.lock.hcl # ← committed; Executor updates and pushes back to main -│ │ └── {workload}.tf.j2 # ← Jinja2: rendered by Proposer -│ └── west/ -│ ├── tf-run.data # ← must contain REMOTE-STATE directive -│ ├── .terraform.lock.hcl # ← committed; Executor updates and pushes back to main -│ └── {workload}.tf.j2 # ← Jinja2: rendered by Proposer -└── README.md +If omitted, the Executor generates it on first `tf-init` and commits it back +to `main` (tagged `[skip ci]`). + +--- + +## CFN Product Template Usage + +```yaml +Resources: + WorkloadRepo: + Type: Custom::TfRunPropose + Properties: + ServiceToken: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sc-template-automation" + product_type: s3_bucket + account_repo: !Ref AccountRepo # e.g. 229685449397-csvd-dev-platform-dev-gov + layer: infrastructure + region_dir: west + aws_account_id: !Sub "${AWS::AccountId}" + aws_region: !Sub "${AWS::Region}" + # product-type-specific inputs (vary by Pydantic model): + bucket_name: !Ref BucketName + versioning_enabled: "true" ``` -**Key layout rules:** - -- `remote_state.yml.j2` lives at the **layer level** (`infrastructure/`, `common/`, `vpc/`), **not** inside workspace subdirectories. The Proposer's REMOTE-STATE processor derives each workspace's `remote_state.yml` from the layer-level file by appending `/{workspace_name}` to the `directory` field — identical to what `tf-run.sh` does at apply time. -- Each workspace directory (`east/`, `west/`, `global/`) **must** include a `tf-run.data` file with a `REMOTE-STATE` directive so the Proposer knows to generate its `remote_state.yml`. -- The `.auto.tfvars.j2` file must render `profile = "..."` and `region = "..."` entries at the top level — `tf-run.sh` auto-discovers profile and region by grepping `*.tfvars`, so these values must be present for placeholder substitution (`%%REGION%%`, `%%PROFILE%%`, etc.) to work correctly. -- `.gitignore` **must** contain at minimum: - ``` - logs/ - .terraform/ - terraform.tfstate - terraform.tfstate.backup - ``` - `logs/` is where `tf-control.sh` writes every plan/apply log. These are ephemeral and must never be committed. `.terraform/` caches the provider plugins locally during a run and must not be committed (only `.terraform.lock.hcl` is committed). -- `.terraform.lock.hcl` is the [dependency lock file](https://developer.hashicorp.com/terraform/language/files/dependency-lock) and **must be committed**. The template should include an initial lock file generated from the workspace's required providers. The Executor runs `tf-init` which updates it if providers change, then commits the update directly back to `main` (bypassing the PR flow, tagged `[skip ci]`). -- `.tf-control` sets `TFCOMMAND=terraform_latest` (the Census workstation alias). The Executor buildspec creates a `terraform_latest` symlink pointing to the installed `terraform` binary so `tf-control.sh` resolves it correctly. -- `.tf-control.tfrc` sets `plugin_cache_dir = "/data/terraform/terraform.d/plugin-cache"` and a `filesystem_mirror` at `/data/terraform/terraform.d/providers`. The Executor buildspec creates both directories. The `filesystem_mirror` path starts empty so Terraform falls through to the `direct {}` block — providers are fetched via the Census proxy and then cached in `plugin_cache_dir` for the remainder of the build. The plugin cache directory is also configured as a CodeBuild S3 cache path so provider archives persist across builds. -- Files ending in `.j2` are Jinja2 templates. The Proposer renders them using the product input variables and commits the result (without the `.j2` extension) to the work branch. The `.j2` source files are **not** committed. +The Lambda's Pydantic model for `s3_bucket` validates the product-specific +inputs and builds `TEMPLATE_VARS` + any `EXTRA_FILES` (e.g. a new +`remote_state.yml` if the layer doesn't exist). The template repo supplies +only the generic `.tf.j2` and `tf-run.data`; the Lambda supplies all +environment-specific values. --- -## Jinja2 Template Organization in the Lambda +## Subdirectory Templates -Rendered templates are stored in the Lambda image under `lambda/templates/{product_type}/`: +A single template repo can contain multiple product variants as subdirectories. +The Lambda passes `source_path` to the Proposer to clone only the relevant subtree: ``` -lambda/templates/ -├── eks_cluster/ -│ ├── infrastructure/west/cluster.tf.j2 -│ ├── infrastructure/east/cluster.tf.j2 -│ └── ... -├── s3_bucket/ # ← new product type: add a directory here -│ ├── infrastructure/west/s3.tf.j2 -│ └── ... -└── {product_type}/ # ← pattern for future types +template-s3/ +├── standard/ +│ └── infrastructure/west/INF.s3-standard.tf.j2 +└── encrypted/ + └── infrastructure/west/INF.s3-encrypted.tf.j2 ``` -The Lambda dispatcher maps `product_type` → template directory automatically. -Adding a new product type requires only adding a new subdirectory here, a -Pydantic model, and a CFN product template — no Lambda plumbing changes. +A product that specifies `source_path: encrypted` copies only +`infrastructure/west/INF.s3-encrypted.tf.j2` into the account repo. --- -## Proposer Build — Template Copying Logic - -The Proposer CodeBuild build (started by the Lambda via `codebuild:StartBuild`) performs these steps: - -1. Clone the template repo (full repo or `source_path` subdirectory) -2. For each `.j2` file found: - - Render it using `jinja2.Environment` with the product input variables - - Write the rendered output to the same relative path (without `.j2` extension) -3. Write any `EXTRA_FILES` entries (direct path → content map; overrides template output) -4. **REMOTE-STATE processing** — for every `tf-run.data` with a `REMOTE-STATE` directive: - - Read the layer-level `remote_state.yml` (e.g. `infrastructure/remote_state.yml`) - - Append `/{workspace_basename}` to the `directory` field via regex substitution - - Write the result as `remote_state.yml` in the workspace directory (e.g. `infrastructure/west/remote_state.yml`) - - This is the same transformation `tf-run.sh` performs at apply time for the `REMOTE-STATE` directive -5. **`tf-directory-setup.py` bootstrap** — for every workspace directory that now has a `remote_state.yml`: - - Run `tf-directory-setup.py --link none` to generate: - - `remote_state.backend.tf` — the S3 backend configuration block - - `remote_state.{dir}.tf.s3` — S3-backed remote state variant - - `remote_state.{dir}.tf.local` — local state file variant - - `remote_state.{dir}.tf.none` — empty no-op stub (active on first propose) - - Symlink `remote_state.{dir}.tf → remote_state.{dir}.tf.none` - - `--link none` is the correct bootstrap value: Terraform state does not exist yet for a new workspace - - After a successful `tf apply` in the Executor, the `tf-run.data` `COMMAND tf-directory-setup.py --link s3` step re-links to `.s3` -6. Write `.sc-automation.yml` to the repo root if it does not already exist on `main` -7. Commit all files (rendered templates + generated state bootstrap files) to a work branch and open a PR - -> **Principle: the PR diff is the complete truth.** Every file the Executor will find at -> apply time must already be committed in the Proposer PR. Neither `REMOTE-STATE` nor -> `tf-directory-setup.py` should create new files during `tf-run apply` — those steps become -> idempotent re-generations of files already in the repo. - -The PR is reviewed by a platform engineer before merging. On merge, the webhook -handler reads `.sc-automation.yml` and automatically starts the executor CodeBuild build. +## Proposer Build — What It Does + +The Proposer CodeBuild build clones the **existing account repo** and writes the +template delta on top of it. Steps in order: + +1. `git clone` the account repo; `git checkout -B ${GIT_BRANCH}` +2. If `TEMPLATE_REPO` is set: clone it (optionally at `source_path`), render all `.j2` + files with Jinja2 `StrictUndefined`, copy non-`.j2` files as-is, all at the same + relative paths. Account repo files that the template does not touch are left unchanged. +3. Write any `EXTRA_FILES` entries (path → content map from the Lambda model; overrides + template output). Typical use: new layer-level `remote_state.yml` when the target + layer does not yet exist in the account repo. +4. **REMOTE-STATE bootstrap** — for every `tf-run.data` found that contains a `REMOTE-STATE` + directive: read the layer-level `remote_state.yml` already present in the account repo + (or just written via `EXTRA_FILES`), append `/{workspace_name}` to the `directory` field, + write the result as `remote_state.yml` in the workspace directory. This mirrors exactly + what `tf-run.sh` does at apply time. +5. **`tf-directory-setup.py --link none`** — for every workspace directory that now has a + `remote_state.yml`, run `tf-directory-setup.py --link none` to generate: + - `remote_state.backend.tf` — S3 backend block + - `remote_state.{dir}.tf.s3` / `.local` / `.none` variant files + - Symlink `remote_state.{dir}.tf → remote_state.{dir}.tf.none` (bootstrap state) +6. Write `.sc-automation.yml` at the repo root if absent on `main`. +7. `git add -A && git commit && git push && gh pr create` + +> **Principle: the PR diff is the complete truth.** Every file the Executor will see +> at apply time is committed in the Proposer PR. The Executor never silently creates +> files; its `REMOTE-STATE` and `tf-directory-setup.py` steps are idempotent overwrites. --- @@ -223,63 +249,25 @@ variables: # Extra key/value pairs injected as CodeBuild --- -## Executor Build — Injecting into an Existing Account Repo - -After a platform engineer merges the Proposer PR into `main`, the sc-lambda-ghactions -webhook fires and starts the **Executor** CodeBuild build. The Executor handles -both the initial `tf plan`/`tf apply` run and any subsequent re-render of existing repos. - -### What the Executor Does - -``` -webhook (PR merged to main) - └─> Lambda reads .sc-automation.yml from main - └─> Lambda starts Executor CodeBuild build via StartBuild - environmentVariablesOverride: - REPO_NAME, PRODUCT_TYPE, DRY_RUN, TEMPLATE_REPO, ... - -Executor buildspec: - INSTALL: - - Install Terraform from S3 assets bucket - - Install Census CA cert, set HTTPS_PROXY - - git clone {account_repo} (GHE token from Secrets Manager) - PRE_BUILD: - - Read .sc-automation.yml from cloned repo - - git clone {template_repo} into /tmp/template - BUILD: - - For each .j2 file in /tmp/template: - Render with Jinja2 using env vars as context - Write to account_repo at same relative path (no .j2 extension) - - git checkout -b update/{timestamp} - - git add -A && git commit - - git push - - gh pr create --title "Automated update: {product_type} {timestamp}" - - If dry_run == false: - tf init && tf apply -auto-approve - POST_BUILD: - - POST commit status to GHE (success/failure with CodeBuild log URL) -``` - -### Fleet Update (re-rendering an existing repo) +## Executor Build — What It Does -When a **template repo itself changes** — for example, an upstream HCL pattern is -updated — the fleet update flow (Flow 3) re-renders all account repos of that -`product_type`: +The Executor does **not** render templates or open PRs. It only runs Terraform. -1. `terraform-sc-fleet` lists all `workloads/{product_type}/*/main.tf` entries -2. Lambda starts one Executor build **per account repo** (fan-out) -3. Each Executor clones its account repo, re-renders all `.j2` files from the - updated template, commits to a new branch, and opens a PR -4. Platform engineers review and merge the PRs individually +After a platform engineer merges the Proposer PR to `main`: +1. GHE push webhook → Lambda reads `.sc-automation.yml` → starts `tf-run-executor` +2. Executor clones the account repo at `main` (all files already committed by the Proposer PR) +3. Optionally assumes cross-account IAM role (`TARGET_ACCOUNT_ID`) +4. `cd ${LAYER}/${REGION_DIR}`; runs `tf-run plan` or `tf-run apply` +5. After successful apply: commits `remote_state.{dir}.tf` symlink re-link + + `.terraform.lock.hcl` updates directly to `main` with `[skip ci]` -The Executor **never force-pushes to `main`** — every change goes through a PR, -preserving review gates regardless of whether `dry_run` is set. +The Executor does not touch any file that wasn't already committed in the PR. +It carries no template-repo knowledge and no Jinja2 dependencies. ### Idempotency -The Executor is safe to re-run. If the rendered output is identical to `main` -(`git diff --quiet`), it exits with no PR opened and reports a `SKIPPED` status -back to the Lambda. +The Executor is safe to re-run. If `tf-run apply` produces no infrastructure +changes and the post-apply file diff is empty, the commit step is skipped. --- @@ -295,14 +283,15 @@ back to the Lambda. --- -## Adding a New Template Repository +## Adding a New Product Type Checklist when onboarding a new product type: -- [ ] Create `SCT-Engineering/template-{product_type}` with standard account repo layout -- [ ] Add `.j2` files for each rendered configuration file -- [ ] Add `lambda/templates/{product_type}/` with corresponding Jinja2 templates -- [ ] Add a Pydantic model in `lambda/models/{product_type}.py` +- [ ] Create `SCT-Engineering/template-{product_type}` containing **only** the workload + delta: `{layer}/{workspace}/{workload}.tf.j2` + `tf-run.data` +- [ ] Add a Pydantic model in `lambda/models/{product_type}.py` that validates + product-specific inputs and builds `TEMPLATE_VARS` + any `EXTRA_FILES` + (e.g. layer-level `remote_state.yml` if the target layer may not exist yet) - [ ] Register the handler in `lambda/app.py` `PRODUCT_HANDLERS` table - [ ] Create a CFN product template in `service-catalog/{product_type}-product-template.yaml` - [ ] Add the product to `terraform-service-catalog-census` (see [service-catalog-census-integration.md](service-catalog-census-integration.md))