diff --git a/README.md b/README.md index d2048da..be079e5 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,189 @@ # sc-lambda-ghactions -Service Catalog → Lambda → GitHub Actions automation. +Service Catalog → Lambda → CodeBuild → Account Repo (`tf-run` + PR) automation. + +This repo contains every piece of the next-generation SC automation platform: +the Lambda handler, the CodeBuild buildspec, the tf-run toolchain scripts, the +Service Catalog CFN product template, and all Terraform that deploys it. + +--- ## Architecture ``` -SC Console (user fills product form) - └─> CFN Stack (Custom::* resource) - └─> Lambda (cross-account, centralized in csvd-dev) - └─> GitHub Actions (repository_dispatch) - └─> Clone target account repo - └─> Operate on repo files (HCL, YAML, etc.) - └─> Open PR → account repo +User (SC Console) + └─> fills product form → submits + → CFN creates Custom::TerraformRun resource + → ServiceToken → Lambda tf-run-executor-trigger + │ (centralized in csvd-dev, us-gov-west-1) + ├─> Validates inputs (Pydantic v2) + ├─> Fetches GHE PAT from Secrets Manager (ghe-runner/github-token) + ├─> Starts CodeBuild tf-run-executor with env-var overrides + └─> Polls CodeBuild every 20 s → returns PR URL to CFN + +CodeBuild tf-run-executor (Amazon Linux 2, 60 min timeout) + ├─> INSTALL + │ ├─> Terraform binary from S3 (registry.terraform.io is blocked) + │ ├─> Census CA cert → update-ca-trust (for GHE TLS) + │ ├─> tf-run toolchain from scripts/ (tf-run, tf-control.sh, tf-directory-setup.py) + │ └─> gh CLI from S3 + └─> BUILD + ├─> git clone https://@github.e.it.census.gov/SCT-Engineering/ + ├─> git checkout -B (default: repo-init) + ├─> Write EXTRA_FILES (JSON map path→content) into the working tree + ├─> git add + commit + push + ├─> cd // + ├─> TFARGS=-auto-approve tf-run apply [tag:] + │ (or tf-plan if DRY_RUN=true) + └─> gh pr create --base main --head + POST_BUILD: emit PR_URL= line for Lambda to parse +``` + +--- + +## Repo Layout + +``` +sc-lambda-ghactions/ +├── buildspec.yml # CodeBuild build definition (source: this repo) +├── scripts/ +│ ├── tf-run # Bash tf-run orchestrator (v1.13.13) +│ ├── tf-control.sh # tf-{action} wrapper script (v1.11.0) +│ ├── tf-run.py # Python port of tf-run (v2.0.0) +│ └── tf-directory-setup.py # Generates remote_state.backend.tf from remote_state.yml +├── data/ +│ └── tf-run.data # Sample tf-run.data for reference +├── lambda/ +│ ├── app.py # Lambda handler (Python 3.12) +│ ├── requirements.txt # boto3, pydantic +│ └── Dockerfile # Python 3.12 Lambda container image +├── service-catalog/ +│ └── product-template.yaml # CFN template for the SC product +└── deploy/ + ├── provider.tf # AWS provider, Terraform version constraint + ├── variables.tf # All tunable inputs (with sensible defaults) + ├── codebuild.tf # aws_codebuild_project.tf_run_executor + GHE credential + ├── iam.tf # Lambda execution role + CodeBuild service role + ├── lambda.tf # ECR repo + aws_lambda_function + cross-account permission + └── service_catalog.tf # SC portfolio, product, launch constraint, S3 template upload +``` + +--- + +## Service Catalog Product Parameters + +| SC Form Field | Variable | Notes | +|---------------|----------|-------| +| Account Repo Name | `account_repo` | e.g. `229685449397-csvd-dev-platform-dev-gov` | +| Terraform Layer | `layer` | `common`, `infrastructure`, or `vpc` | +| Region Directory | `region_dir` | `east` or `west` | +| Git Branch | `git_branch` | branch to commit/PR from; default `repo-init` | +| Start Tag (optional) | `tf_run_start_tag` | tf-run.data TAG label; empty = run all steps | +| Dry Run | `dry_run` | `true` = tf plan only, no apply | +| Extra Config Files (JSON) | `extra_files` | `{"relative/path": "content"}` written before tf-run | + +`aws_account_id` and `aws_region` are **not** user-facing — resolved via `!Sub` in the CFN template. + +--- + +## Deploying + +### Prerequisites + +- Terraform ≥ 1.3 (via `tf` alias) +- AWS credentials for `csvd-dev` (`229685449397`, `us-gov-west-1`) +- GHE PAT already in Secrets Manager as `ghe-runner/github-token` +- An S3 bucket to hold the SC product template artifact + +### Required Terraform variables + +```hcl +# deploy/terraform.tfvars +source_repo_url = "https://github.e.it.census.gov/SCT-Engineering/sc-lambda-ghactions" +artifacts_bucket_name = "csvd-sc-product-templates" # your SC artifacts bucket +org_id = "o-abc123def4" # your AWS Org ID +``` + +### Deploy + +```bash +export AWS_DEFAULT_REGION=us-gov-west-1 +source ~/aws-creds + +cd sc-lambda-ghactions/deploy +tf init +tf apply +``` + +### Build and push the Lambda image + +After `tf apply` creates the ECR repo: + +```bash +aws ecr get-login-password --region us-gov-west-1 \ + | docker login --username AWS \ + --password-stdin 229685449397.dkr.ecr.us-gov-west-1.amazonaws.com + +docker build -t tf-run-executor/lambda:latest lambda/ +docker tag tf-run-executor/lambda:latest \ + 229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/tf-run-executor/lambda:latest +docker push 229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/tf-run-executor/lambda:latest +``` + +Then update the function to pick up the new image: + +```bash +aws lambda update-function-code \ + --function-name tf-run-executor-trigger \ + --image-uri 229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/tf-run-executor/lambda:latest \ + --region us-gov-west-1 +``` + +### Manual CodeBuild test (before SC wiring) + +```bash +export AWS_DEFAULT_REGION=us-gov-west-1 +source ~/aws-creds + +aws codebuild start-build \ + --project-name tf-run-executor \ + --environment-variables-override \ + "name=ACCOUNT_REPO,value=229685449397-csvd-dev-platform-dev-gov,type=PLAINTEXT" \ + "name=LAYER,value=infrastructure,type=PLAINTEXT" \ + "name=REGION_DIR,value=west,type=PLAINTEXT" \ + "name=DRY_RUN,value=true,type=PLAINTEXT" \ + "name=GITHUB_TOKEN,value=$(aws secretsmanager get-secret-value \ + --secret-id ghe-runner/github-token --query SecretString --output text),type=PLAINTEXT" ``` -## Status +--- + +## Key AWS Resources + +| Resource | Name | Purpose | +|----------|------|---------| +| Lambda | `tf-run-executor-trigger` | CFN Custom Resource handler | +| CodeBuild | `tf-run-executor` | Runs tf-run in the account repo | +| ECR | `tf-run-executor/lambda` | Lambda container image | +| Secrets Manager | `ghe-runner/github-token` | GHE PAT used by both Lambda (to start CodeBuild) and CodeBuild (to clone + gh CLI) | +| SC Portfolio | `sc-automation-tf-run` | Groups the tf-run product | +| SC Product | `sc-automation-tf-run-executor` | CFN template + launch constraint | + +--- + +## Census Network Notes + +- **Terraform registry** (`registry.terraform.io`) is blocked — binary pulled from `s3://csvd-packer-pipeline-assets/terraform/` +- **GHE TLS**: Census CA cert not in standard bundles → installed from `s3://csvd-packer-pipeline-assets/certs/census-ca.pem` via `update-ca-trust` +- **Proxy**: `HTTPS_PROXY=http://proxy.tco.census.gov:3128` for provider downloads; `NO_PROXY` includes `github.e.it.census.gov,169.254.169.254` +- **SSH blocked**: all git operations use HTTPS with token in URL -Early design / scaffolding phase. +--- ## Related Repos -- [`lambda-template-repo-generator`](https://github.e.it.census.gov/CSVD/lambda-template-repo-generator) — current CodeBuild-based Lambda -- [`terraform-service-catalog-census`](https://github.e.it.census.gov/SCT-Engineering/terraform-service-catalog-census) — SC product templates -- [`eks-automation-lambda`](https://github.e.it.census.gov/arnol377/eks-automation-lambda) — design docs +| Repo | Purpose | +|------|---------| +| [`lambda-template-repo-generator`](https://github.e.it.census.gov/CSVD/lambda-template-repo-generator) | Current CodeBuild-based EKS Lambda (predecessor) | +| [`terraform-service-catalog-census`](https://github.e.it.census.gov/SCT-Engineering/terraform-service-catalog-census) | Census-managed SC product templates (production deployment path) | +| [`terraform-eks-deployment`](https://github.e.it.census.gov/SCT-Engineering/terraform-eks-deployment) | EKS repo creation Terraform workspace (predecessor CodeBuild payload) | diff --git a/design-docs/README.md b/design-docs/README.md index aa54790..a27ee60 100644 --- a/design-docs/README.md +++ b/design-docs/README.md @@ -1,8 +1,243 @@ # Design Documents -Architecture decisions, flow diagrams, and planning notes for the +Architecture decisions, flow diagrams, and reference notes for the SC → Lambda → CodeBuild → Account Repo → tf-run automation. +> **Status**: Implementation complete (Phases 1–3). Phase 4 (polish) not started. + +--- + +## Architecture Overview + +**Runner: CodeBuild** (not GitHub Actions — GHA is blocked on OIDC setup at Census). +The buildspec steps are structured to port directly to a GHA workflow once that blocker clears. +No Lambda changes would be required for that migration. + +CodeBuild handles everything after the SC form submission: +- Clone the account repo over HTTPS with a GHE PAT +- Write config files (`EXTRA_FILES`) into the correct layer/region directory +- Commit + push to a branch (default: `repo-init`) +- Run `tf-run` in the correct `//` +- Open a PR via `gh` CLI + +--- + +## Full Flow + +``` +User (SC Console) + └─> fills product form → submits + → CFN creates Custom::TerraformRun resource + → ServiceToken → Lambda tf-run-executor-trigger + │ (centralized in csvd-dev, 229685449397, us-gov-west-1) + │ + ├─> Validates inputs (Pydantic v2 TfRunRequest model) + ├─> Fetches GHE PAT from Secrets Manager (ghe-runner/github-token) + ├─> Starts CodeBuild tf-run-executor with env-var overrides + └─> Polls CodeBuild every 20s until complete or Lambda deadline (900s) + │ + └─> Returns PR URL + result to CFN (SUCCESS / FAILED) + +CodeBuild tf-run-executor (Amazon Linux 2, 60 min timeout) + ├─> INSTALL phase + │ ├─> Terraform binary from S3 (registry.terraform.io is blocked) + │ ├─> Census CA cert → update-ca-trust (for GHE TLS) + │ ├─> tf-run toolchain from scripts/ in this repo: + │ │ tf-run, tf-control.sh, tf-directory-setup.py + │ │ + tf-{action} symlinks (tf-plan, tf-apply, tf-init, ...) + │ ├─> Python deps: jinja2, python-dateutil, pyyaml (for tf-directory-setup.py) + │ └─> gh CLI from S3 + │ + └─> BUILD phase + ├─> git clone https://@github.e.it.census.gov/SCT-Engineering/ + ├─> git checkout -B (default: repo-init) + ├─> Write EXTRA_FILES (JSON map of relative-path → content) into working tree + ├─> git add -A && git commit --allow-empty && git push origin + ├─> cd // + ├─> TFARGS=-auto-approve tf-run apply [tag:] + │ — or — tf-plan (if DRY_RUN=true) + ├─> gh pr create --base main --head (idempotent) + └─> POST_BUILD: emit "PR_URL=" line for Lambda to parse +``` + +--- + +## Service Catalog Product Form Parameters + +| CFN Parameter | SC Label | Lambda field | Notes | +|---------------|----------|--------------|-------| +| `AccountRepo` | Account Repo Name | `account_repo` | e.g. `229685449397-csvd-dev-platform-dev-gov` | +| `Layer` | Terraform Layer | `layer` | `common`, `infrastructure`, or `vpc` | +| `RegionDir` | Region Directory | `region_dir` | `east` or `west` | +| `GitBranch` | Git Branch | `git_branch` | default `repo-init` | +| `TfRunStartTag` | Start Tag (optional) | `tf_run_start_tag` | tf-run.data TAG label; empty = run all steps | +| `DryRun` | Dry Run | `dry_run` | `"true"` = tf plan only, no apply | +| `ExtraFiles` | Extra Config Files (JSON) | `extra_files` | JSON string `{"relative/path": "content"}` — parsed by validator | + +`aws_account_id` and `aws_region` are **not** user-facing — resolved via `!Sub` in the CFN template. + +--- + +## Lambda Design + +**Function name**: `tf-run-executor-trigger` +**Account**: `229685449397` (csvd-dev, `us-gov-west-1`) +**Timeout**: 900s — must exceed CodeBuild poll window +**Runtime**: Python 3.12 container image (`lambda/Dockerfile` → ECR `tf-run-executor/lambda`) + +### Input model (Pydantic v2) — `lambda/app.py` + +```python +class TfRunRequest(BaseModel): + account_repo: str + layer: Literal["common", "infrastructure", "vpc"] + region_dir: Literal["east", "west"] + tf_run_start_tag: str = "" + extra_files: dict = {} # field_validator parses JSON string from CFN + git_branch: str = "repo-init" + dry_run: bool = False +``` + +### Key responsibilities + +1. Normalize CFN `ResourceProperties` (PascalCase → snake_case; snake_case keys kept as-is) +2. Validate via `TfRunRequest` — rejects bad layer/region_dir values early +3. Fetch GHE PAT from Secrets Manager (`ghe-runner/github-token`) +4. Build `environmentVariablesOverride` with `ACCOUNT_REPO`, `LAYER`, `REGION_DIR`, + `GIT_BRANCH`, `TF_RUN_START_TAG`, `EXTRA_FILES` (JSON), `DRY_RUN`, `GITHUB_TOKEN` +5. Call `codebuild:StartBuild` +6. Poll every 20s via `codebuild:BatchGetBuilds` until `buildStatus != IN_PROGRESS` or Lambda deadline +7. On `SUCCEEDED`: parse `PR_URL=` from build log, call `GET /repos/{org}/{repo}/pulls?head=...` +8. Signal CFN `SUCCESS` (with `pull_request_url`, `repository_url`, `branch_name`) or `FAILED` + +### Physical resource ID + +`{account_repo}-{layer}-{region_dir}` — ensures idempotent Updates don't re-run if nothing changed. + +### Delete handling + +`Delete` events are no-ops (signal SUCCESS immediately) — tf-run changes are not automatically reversible. + +### Environment variables (set on the Lambda function) + +| Variable | Default | Purpose | +|----------|---------|---------| +| `CODEBUILD_PROJECT_NAME` | `tf-run-executor` | CodeBuild project to start | +| `GITHUB_TOKEN_SECRET_NAME` | `ghe-runner/github-token` | SM secret for GHE PAT | +| `GITHUB_API` | `https://github.e.it.census.gov/api/v3` | GHE REST API base URL | +| `GITHUB_ORG_NAME` | `SCT-Engineering` | GHE org that owns account repos | + +--- + +## CodeBuild Environment Variables + +### Static (set on the project, configurable via Terraform variables) + +| Variable | Default | Purpose | +|----------|---------|---------| +| `GITHUB_ORG` | `SCT-Engineering` | GHE org | +| `TF_BINARY_S3` | `s3://csvd-packer-pipeline-assets/terraform/terraform_1.9.1_linux_amd64.zip` | Terraform binary | +| `CENSUS_CA_S3` | `s3://csvd-packer-pipeline-assets/certs/census-ca.pem` | Census CA cert | +| `GH_CLI_S3` | `s3://csvd-packer-pipeline-assets/tools/gh_2.49.0_linux_amd64.tar.gz` | gh CLI | +| `HTTPS_PROXY` | `http://proxy.tco.census.gov:3128` | Outbound proxy | +| `NO_PROXY` | `github.e.it.census.gov,169.254.169.254` | Direct-connect targets | +| `GITHUB_TOKEN` | (from SM `ghe-runner/github-token`, type SECRETS_MANAGER) | GHE PAT | + +### Per-build overrides (injected by Lambda) + +`ACCOUNT_REPO`, `LAYER`, `REGION_DIR`, `GIT_BRANCH`, `TF_RUN_START_TAG`, `EXTRA_FILES`, `DRY_RUN`, `GITHUB_TOKEN` + +--- + +## tf-run Non-Interactive Behavior in CodeBuild + +`tf-run` prompts `continue [y/n]` between steps. In CodeBuild (non-TTY stdin): +- **Bash `tf-run`**: `read -n 1 -t $DURATION` returns non-zero immediately on non-TTY → falls through to `CONTINUE=$DEFAULT="y"` → auto-proceeds +- **Python `tf-run.py`**: `select.select([sys.stdin], ...)` with timeout → same auto-proceed + +`TFARGS=-auto-approve` is set so `terraform apply` doesn't prompt for confirmation. + +--- + +## Account Repo Pre-conditions + +For `tf-run` to succeed, the target `//` must already have: +- `remote_state.backend.tf` (generated by `tf-directory-setup.py`) +- `remote_state..tf` symlink pointing to the `.s3` variant +- `tf-run.data` with the correct step definitions +- `.tf-control` at the repo root (Terraform version pin) + +These exist in any bootstrapped account repo. A separate "init" mode (future work) +can run `tf-directory-setup.py -l none -f` for first-time setup. + +--- + +## Infrastructure (deploy/) + +| Resource | File | Notes | +|----------|------|-------| +| `aws_ecr_repository.lambda` | `lambda.tf` | Image: `tf-run-executor/lambda` | +| `aws_lambda_function.tf_run_trigger` | `lambda.tf` | 900s, 256 MB, Image package | +| `aws_lambda_permission.cfn_invoke` | `lambda.tf` | Cross-account; restricted to `var.org_id` | +| `aws_codebuild_project.tf_run_executor` | `codebuild.tf` | Source = this repo on GHE; 60 min | +| `aws_codebuild_source_credential.ghe` | `codebuild.tf` | GHE PAT from SM; one per account | +| `aws_iam_role.lambda_exec` | `iam.tf` | SM read + CodeBuild start/poll + CWL | +| `aws_iam_role.codebuild_exec` | `iam.tf` | S3 read + SM read + CWL | +| `aws_s3_object.product_template` | `service_catalog.tf` | Uploads `service-catalog/product-template.yaml` | +| `aws_servicecatalog_portfolio.this` | `service_catalog.tf` | | +| `aws_servicecatalog_product.tf_run` | `service_catalog.tf` | CLOUD_FORMATION_TEMPLATE type | +| `aws_servicecatalog_constraint.launch` | `service_catalog.tf` | LAUNCH type; uses `aws_iam_role.sc_launch` | +| `aws_iam_role.sc_launch` | `service_catalog.tf` | InvokeLambda + CloudFormation operations | + +--- + +## Implementation Status + +### Phase 1 — CodeBuild + buildspec ✅ +- `buildspec.yml` — full install + build + post_build +- `deploy/codebuild.tf` — project, GHE source credential +- `deploy/iam.tf` — CodeBuild service role +- `deploy/variables.tf`, `deploy/provider.tf` + +### Phase 2 — Lambda ✅ +- `lambda/app.py` — Pydantic model, poll loop, cfn-response, `extra_files` JSON validator +- `lambda/Dockerfile` — Python 3.12 Lambda container image +- `lambda/requirements.txt` — boto3, pydantic +- `deploy/lambda.tf` — ECR repo + Lambda function + cross-account permission +- `deploy/iam.tf` — Lambda execution role (SM, CodeBuild, CWL) + +### Phase 3 — Service Catalog ✅ +- `service-catalog/product-template.yaml` — 7 parameters, `Custom::TerraformRun`, 4 outputs +- `deploy/service_catalog.tf` — portfolio, product, S3 upload, launch constraint + +### Phase 4 — Polish (not started) +- CloudWatch dashboard (build history, PR links) +- SNS alert on FAILED builds +- GHA migration: replace CodeBuild with `repository_dispatch` once OIDC is unblocked + +--- + +## GHA Migration Path (deferred — blocked on OIDC) + +The CodeBuild buildspec steps map 1:1 to GHA `steps:` entries. +When OIDC is available, CodeBuild can be replaced by a `repository_dispatch` +trigger on the account repo with a `.github/workflows/tf-run.yml`. +**No Lambda changes required** — the Lambda would just call GitHub's +`POST /repos/{org}/{repo}/dispatches` instead of `codebuild:StartBuild`, +then poll the workflow run via the GHA REST API. + +--- + +## What NOT to Do + +- ❌ SSH clone — Census proxy blocks SSH; always HTTPS + token in URL +- ❌ Write temp files to `/tmp` — use the CodeBuild build directory +- ❌ Use `terraform` directly — always use the `tf` alias (symlink to `tf-control.sh`) +- ❌ Hardcode `aws-us-gov` in ARNs — use `${AWS::Partition}` +- ❌ Add `aws_account_id` or `aws_region` as SC form parameters — use `!Sub` +- ❌ Run tf-run from the repo root — always `cd //` first +- ❌ Pass PascalCase properties to the Lambda Custom Resource — use snake_case to avoid normalizer edge cases with acronyms + --- ## Architecture Overview