Skip to content

feat(CSVDIES-9980): pass ExternalId at assume-role; default to sc-automation-codebuild-role #2

Merged
merged 31 commits into from
Jun 11, 2026

Conversation

arnol377
Copy link
Collaborator

@arnol377 arnol377 commented Jun 8, 2026

Summary

Companion change to terraform-service-catalog-census PR #13, which deploys sc-automation-codebuild-role org-wide via CloudFormation StackSet.

Jira: CSVDIES-9980
Design: ADR-004

Changes to buildspec-executor.yml

  1. Default CROSS_ACCOUNT_ROLE changed from r-inf-terraform to sc-automation-codebuild-role — the new purpose-built role deployed org-wide by the StackSet PR above.

  2. --external-id "${TARGET_ACCOUNT_ID}" added to the aws sts assume-role call — required by the sts:ExternalId = AWS::AccountId condition on the new role per ADR-004 (confused-deputy protection).

Backward Compatibility

r-inf-terraform remains in the CodeBuild IAM policy and can still be used by passing CROSS_ACCOUNT_ROLE=r-inf-terraform as a per-build env var override.

Merge Order

  1. Merge terraform-service-catalog-census PR #13 first and run tf apply so the role exists in target accounts
  2. Then merge this PR

Dave Arnold added 30 commits May 11, 2026 16:23
…KS docs

- lambda/app.py: add template_repo + template_vars fields to TfRunRequest;
  merge field_validator to cover both extra_files and template_vars; pass
  both new fields in CodeBuild environmentVariablesOverride
- buildspec.yml: add TEMPLATE_REPO/TEMPLATE_VARS env defaults; new build step
  clones template repo, renders .j2 files via Jinja2 StrictUndefined, copies
  non-.j2 files verbatim; EXTRA_FILES step runs after and overrides
- service-catalog/product-template.yaml: add TemplateRepo + TemplateVars
  parameters and parameter group; wire to Lambda Custom Resource
- docs/HOW-IT-WORKS.md: full end-to-end documentation of the system
- .gitignore: exclude *.tfstate, *.tfvars, .terraform/, terraform_data_dirs/
…tions

- buildspec.yml:
  - Clone terraform/support at build time for version governance (no more
    hardcoded 1.9.1/2.49.0 version strings; reads VERSION files from support repo)
  - S3 env vars are now prefixes (TF_BINARY_S3_PREFIX, GH_CLI_S3_PREFIX);
    filenames constructed dynamically from support repo VERSION files
  - Add SSH->HTTPS git URL rewrite so Terraform module SSH sources work via PAT
  - Add conditional cross-account assume-role step (TARGET_ACCOUNT_ID)
  - Add 169.254.170.2 to NO_PROXY (AL2023 ECS credential provider)

- deploy/codebuild.tf: upgrade image to amazonlinux2023-x86_64-standard:4.0

- deploy/variables.tf: rename tf_binary_s3/gh_cli_s3 to *_prefix with updated
  descriptions and defaults

- lambda/app.py: add optional target_account_id field; pass TARGET_ACCOUNT_ID
  to CodeBuild environmentVariablesOverride

- service-catalog/product-template.yaml: add optional TargetAccountId parameter
  with AllowedPattern validation

- docs/HOW-IT-WORKS.md:
  - Document version governance via terraform/support
  - Note AL2023
  - Replace 'No SSH Git' constraint with SSH->HTTPS rewrite explanation
  - Add cross-account section explaining TARGET_ACCOUNT_ID and required role
  - Add 'Moving This System to a Different Account' section (Teams Q)
- Component overview BUILD phase: add missing step 6 (assume cross-account role)
  and renumber steps 7-8 accordingly
- Component overview POST_BUILD: correct description — Lambda calls GHE API
  directly; PR_URL= line in logs is informational only (not parsed by Lambda)
- Step 7 (Commit and Push): add git -c user.email/user.name config that is
  actually present in buildspec.yml
- Step 10 (Lambda Returns Results): note the informational-only nature of
  PR_URL= more clearly
- CFN outputs table: add snake_case aliases (pull_request_url, repository_url,
  branch_name) that Lambda actually emits alongside PascalCase variants
Code changes:
- lambda/app.py: add 'global' to valid region_dir values (line 546 comment)
- service-catalog/product-template.yaml: add 'global' to RegionDir AllowedValues

Doc changes (HOW-IT-WORKS.md):
- region_dir: document 'global' as valid value for non-regional resources (SSO, IAM)
- Step 8: add tf-run plan vs tf-plan distinction; tf-plan skips symlink setup
- Bootstrapping: rewrite section — remote_state.backend.tf is created by the
  REMOTE_STATE directive in tf-run.data; what must pre-exist is remote_state.yml.
  Running 'tf-run init' is the correct first-time setup path.
- git-secret: add note that CodeBuild cannot run 'git secret reveal' (no GPG key)
- Build Timeout: add EventBridge as future improvement over Lambda polling
- SSH section: add note about service-user SSH key as alternative approach
- Why csvd-dev: add note about future move to operations accounts
- Delete: add recommended decommission path (PR-based removal + manual apply)
- Rebuild Lambda: replace personal home path with generic, use tf apply instead
  of manual aws lambda update-function-code CLI call
- Deploy: remove personal ~/aws-creds and /home/a/arnol377 paths
The canonical versions of tf-run, tf-control.sh, and tf-directory-setup.py
live in github.e.it.census.gov/terraform/support (local-app/ subtree).
We already clone that repo during the INSTALL phase for VERSION governance,
so sourcing the scripts from there costs nothing extra.

- buildspec.yml: cp from /tmp/tf-support/local-app/ instead of CODEBUILD_SRC_DIR/scripts/
- scripts/: remove tf-run, tf-control.sh, tf-directory-setup.py (bundled copies)
  tf-run.py and tf-run.data remain (project-specific)
- docs/HOW-IT-WORKS.md: update both references (component overview + Step 3)
- Updated `TfRunRequest` model to differentiate between propose and apply actions, adding relevant fields for each.
- Refactored `start_codebuild_build` function to handle environment variable overrides based on the action type.
- Implemented logic in `lambda_handler` to manage responses for both propose and apply actions.
- Added new CloudFormation templates for the proposer and executor products, enabling structured Terraform change proposals and applications.
- The proposer template handles rendering templates and opening pull requests, while the executor template applies changes after PR approval.
- docs/README.md: high-level index with reading paths by use case
- docs/HOW-IT-WORKS.md: reframe from two-product to single Proposer +
  webhook auto-apply; remove executor SC product framing
- docs/decisions/001-webhook-auto-apply.md: status Proposed → Accepted;
  update context and consequences to reflect removal of executor SC product
- docs/decisions/002-vault-aws-secrets-engine.md: new ADR for Vault AWS
  Secrets Engine; dynamic cross-account credentials; per-product IAM scope
  via Proposer terraform apply; account baseline prerequisite pattern
- docs/generalized-terraform-product-architecture.md: new
- docs/template-management.md: Executor flow, .sc-automation.yml schema
- docs/repo-vars-and-secrets.md: CodeBuild environmentVariablesOverride pattern
- docs/workflow-flowcharts.md: Mermaid diagrams for propose/apply flows
- docs/fleet-governance-at-scale.md: new
- docs/service-catalog-census-integration.md: new
- docs/cross-account-visibility.md: new
- repo-vars-and-secrets.md: remove 'Later rollout (GHA)' callout block;
  the executor is CodeBuild triggered by webhook, not GitHub Actions
- fleet-governance-at-scale.md: remove 'GHA executor rollout phase' note;
  replace with accurate CodeBuild+webhook description
Adds a dedicated section clearly describing what each CodeBuild project does,
how it is triggered, what env vars it receives, and what it does/does not do —
so stakeholders have a single place to understand the two-build model.
…tory-setup.py)

- buildspec-proposer.yml: install tf-directory-setup.py from terraform/support in
  INSTALL phase; add python-dateutil + pyyaml pip deps.
  BUILD phase: after template rendering, run Python bootstrap step that:
    1. Processes REMOTE-STATE directives in tf-run.data files — derives workspace
       remote_state.yml from layer-level file (identical to tf-run.sh behavior)
    2. Runs tf-directory-setup.py --link none in each workspace with remote_state.yml —
       generates remote_state.backend.tf + .tf.s3/.local/.none variant files + symlink

- buildspec-executor.yml: add note that REMOTE-STATE and tf-directory-setup.py
  steps are idempotent — files already exist from Proposer PR, no new files created

- docs/HOW-IT-WORKS.md: expand BUILD phase step 5 to document the full file
  generation sequence including REMOTE-STATE and tf-directory-setup.py; add
  rationale explaining why all generation must happen in the Proposer

- docs/template-management.md: fix template repo structure diagram — workspace
  remote_state.yml.j2 files removed (wrong); layer-level remote_state.yml.j2 shown;
  workspace tf-run.data with REMOTE-STATE directive shown; add layout rules for
  auto.tfvars profile/region requirement and .j2 source file handling.
  Expand Proposer Build steps to cover REMOTE-STATE + tf-directory-setup.py.
  Add principle callout: PR diff is the complete truth.
…; plan gitignore

buildspec-executor.yml:
- INSTALL: create terraform_latest symlink -> terraform (account repos use TFCOMMAND=terraform_latest)
- INSTALL: mkdir /data/terraform/terraform.d/plugin-cache + providers (required by .tf-control.tfrc)
- BUILD: after tf-run apply, git add symlink re-link + .terraform.lock.hcl and push directly
  to main with [skip ci] to prevent webhook re-trigger
- Add CodeBuild cache block for /data/terraform/terraform.d/plugin-cache (persists
  provider archives across builds via S3)
- Add log note: logs/ is ephemeral, must be in .gitignore

docs/HOW-IT-WORKS.md:
- INSTALL phase: document terraform_latest alias and /data/terraform dir creation
- BUILD phase step 5: document symlink re-link + lock file commit-back with rationale

docs/template-management.md:
- Template structure: add .gitignore and .terraform.lock.hcl to workspace dirs
- Layout rules: add .gitignore required entries (logs/, .terraform/, tfstate*)
- Layout rules: explain .terraform.lock.hcl lifecycle (committed, Executor updates + pushes back)
- Layout rules: explain terraform_latest alias and plugin cache/.tf-control.tfrc behavior
Core principle: account repos already carry .tf-control, .tf-control.tfrc,
region.tf, credentials.d/, variables.d/ from initial setup. Template repos
provide only the workload-specific delta (new .tf.j2 files + tf-run.data).

Changes:
- Rewrite template-management.md opening to explain delta-overlay model
  and why duplicating standard files would break reusability
- Minimal real example: template-s3-bucket is 3 files total
- New-layer case: layer-level remote_state.yml provided via EXTRA_FILES
  (Lambda Pydantic model builds it from SC form inputs), not from template
- Remove .tf-control, .tf-control.tfrc, region.tf, credentials.d/,
  variables.d/ from template structure diagram (wrong/environment-specific)
- Remove outdated Lambda template organization section (old EKS-only model)
- Replace stale Executor section (was: renders templates + opens PRs)
  with correct model (runs tf-run apply only, commits lock+symlink back)
- Fix Adding checklist: delta files only + EXTRA_FILES note for new layers
Template repos no longer encode layer/workspace as directory nesting.
LAYER and REGION_DIR are already known env vars - the Proposer uses them
to determine the destination path in the account repo.

buildspec-proposer.yml:
- Add TEMPLATE_SOURCE_PATH env var (selects subdirectory variant within repo)
- Rewrite template rendering: dst_root = LAYER/REGION_DIR/ instead of '.'
- Dotfiles at template root (e.g. .sc-automation.yml) go to account repo root
- Document flat layout convention in comments

docs/template-management.md:
- Rewrite What Belongs section: flat structure, show where files land
- template-s3-bucket example is now 3 flat files (not nested infrastructure/west/)
- TEMPLATE_SOURCE_PATH explained inline with multi-variant example
- Remove old Subdirectory Templates section (replaced with inline example)
- tf-run.data, .sc-automation.yml.j2, .terraform.lock.hcl notes updated

docs/HOW-IT-WORKS.md:
- BUILD phase step 3: document flat layout + dotfile root exception
…emplate model

- Template repos are flat: just .tf.j2 + tf-run.data, no nested layer/region dirs
- LAYER and REGION_DIR are Proposer env vars; files are written to the correct
  path at copy time, not encoded in template directory structure
- Remove lambda/templates/{product_type}/ tree (templates live in the template repo)
- Layer-level remote_state.yml built by Lambda Pydantic model extra_files()
  from validated SC form inputs, not stored in template repo
- Pydantic model example updated with account_alias field + extra_files() method
- Onboarding checklist updated: no skeleton clone, no lambda/templates/ step
… it could either support or be changed to work better with this project
…s index; create ADR-003 for Vault cluster topology; create ADR-004 for account baseline IAM role; create ADR-005 for Service Catalog portfolio sharing strategy
… flow

- buildspec-executor.yml / buildspec.yml: default CROSS_ACCOUNT_ROLE=r-inf-terraform;
  replace hardcoded role name with ${CROSS_ACCOUNT_ROLE} in sts:AssumeRole block
  (interim scaffolding — will be replaced by vault read in CSC-1345)
- deploy/codebuild.tf: add CROSS_ACCOUNT_ROLE env var to executor project
- deploy/iam.tf: StsAssumeRoleCrossAccount allows r-inf-terraform, r-inf-terraform-eks,
  sc-automation-codebuild-role (backwards compat)
- lambda/app.py: add TfRunRequest.cross_account_role field (default: r-inf-terraform);
  pass CROSS_ACCOUNT_ROLE in CodeBuild env overrides for apply action
- docs/decisions/001-webhook-auto-apply.md: add cross_account_role to schema table
- design-docs/CHECKPOINT.md: update with Vault pivot and CSC-1344 blocked status

Jira: CSC-1344 (Blocked on CSC-1345)
Internal deck covering problem statement, architecture, security
benefits (NIST 800-53), government/compliance considerations (BSL 1.1,
OpenBao, FIPS 140-2), phased roadmap, and call to action.

Jira: CSC-1345 CSC-1346
ADR-002 (HashiCorp Vault AWS Secrets Engine) rejected after review with
Matt Morgan. Key reasons:
- CodeBuild already has an IAM role; direct sts:AssumeRole into a
  StackSet-provisioned target-account role is the correct pattern
- StackSets auto-propagate trust to new accounts at vending time and
  remove it at decommission — no extra per-account onboarding step
- Role assumption (no credential issuance) is strictly better security
- Vault adds cluster infrastructure cost with no proportionate benefit
- Note: OpenBao preferred over HashiCorp Vault if Vault is ever needed

ADR-003 (vault cluster topology) withdrawn — depends on ADR-002.
ADR-004 (sc-automation-codebuild-role via StackSet) confirmed as the
final design; Vault dependency caveat removed.

Jira: CSC-1345 → Done, CSC-1346 → Done, CSC-1344 → In Progress (unblocked)
…omation-codebuild-role

Two related changes to wire the executor to the new cross-account role
deployed by the terraform-service-catalog-census StackSet (PR #13):

1. CROSS_ACCOUNT_ROLE default changed from r-inf-terraform to
   sc-automation-codebuild-role — the new purpose-built role for this
   automation system, deployed org-wide via CFN StackSet.

2. --external-id "${TARGET_ACCOUNT_ID}" added to the aws sts assume-role
   call — required by the ExternalId condition on sc-automation-codebuild-role
   (sts:ExternalId = AWS::AccountId) per ADR-004 confused-deputy protection.

The r-inf-terraform role can still be used by passing CROSS_ACCOUNT_ROLE=r-inf-terraform
as an env var override; it is not removed from the CodeBuild IAM policy.

See ADR-004: docs/decisions/004-account-baseline-iam-role.md
Jira: CSVDIES-9980
@arnol377 arnol377 merged commit e6dae08 into main Jun 11, 2026
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant