Skip to content

fix: EKS-only Lambda cleanup + SC template AwsRegion/AWSAccountId removal #1

Merged
merged 20 commits into from
Apr 21, 2026
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
803168a
fix: use Lambda-only approach for EKS repo creation; add Copilot inst…
Apr 2, 2026
0a74dd7
fix: public visibility by default; add collaborator support for repo …
Apr 2, 2026
528f4b3
fix: VERIFY_SSL=false; public repo visibility; add ec2:DescribeVpcs t…
Apr 2, 2026
a79cee4
feat: path_mapper for dynamic EKS repo structure (safe revert baseline)
Apr 6, 2026
ec54b54
feat: Lambda delegates EKS repos to CodeBuild + terraform-eks-deployment
Apr 6, 2026
52ebef0
chore: tf apply — add eks-terragrunt-repo-creator CodeBuild project +…
Apr 6, 2026
aee6987
fix: add CodeBuild VPC endpoint + IAM policy for Lambda→CodeBuild con…
Apr 6, 2026
8310ee1
fix: increase Lambda timeout to 900s to cover CodeBuild poll window
Apr 6, 2026
eb18463
fix: remove spurious '- ' prefix from additional_post_build_commands
Apr 7, 2026
5d3ff19
fix: use PAT (ghe-runner/github-token) for Terraform GitHub provider …
Apr 7, 2026
26c6fe9
fix: add pull_request_url and branch_name to CodeBuild success response
Apr 7, 2026
12a742a
docs: rewrite copilot-instructions to reflect CodeBuild+Terraform arc…
Apr 7, 2026
065d2f2
chore: update deploy Terraform state after tf apply
Apr 7, 2026
560a5ec
fix: address PR1 review comments — EKS-only Lambda + Terraform cleanup
Apr 14, 2026
dff9bfa
docs: clarify cross-account architecture + fix stale refs
Apr 14, 2026
e6547ed
docs: add ECA demo script with talking points and Q&A prep
Apr 14, 2026
ff2a6b5
fix(lambda): make EKS fields required; remove is_eks_deployment dead …
Apr 21, 2026
f37b6c6
fix(sc-template): remove AwsRegion/AWSAccountId as user-facing parame…
Apr 21, 2026
237ab9b
fix(deploy): add eks-repo-creator buildspec; fix partition refs in IA…
Apr 21, 2026
8b268ff
chore: update docs, scripts, and state to reflect current architecture
Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# GitHub Copilot Instructions — lambda-template-repo-generator

## Project Purpose

This repository contains the Lambda function that powers the EKS Cluster Automation (ECA) system.
When a team provisions the "EKS Terragrunt Repo" product via AWS Service Catalog, this Lambda:

1. Receives a CloudFormation Custom Resource event
2. Fetches a GitHub PAT from Secrets Manager (`ghe-runner/github-token`)
3. Triggers the `eks-terragrunt-repo-creator` CodeBuild project with EKS parameters as env vars
4. Polls CodeBuild every 20 seconds until the build completes or the Lambda deadline approaches
5. Fetches the open PR URL from the GitHub API after a successful build
6. Signals CloudFormation `SUCCESS`/`FAILED`

All actual repo creation runs inside **CodeBuild** via the `terraform-eks-deployment` workspace:
- Clones `template-eks-cluster` via `CSVD/terraform-github-repo` Terraform module
- Writes 8 rendered Terragrunt HCL files via `managed_extra_files`
- Opens a pull request (`repo-init` → `main`)

---

## Architecture: Lambda as Thin Orchestrator over CodeBuild + Terraform

```
SC Console (user fills form)
→ CFN Stack creates Custom::GitHubRepository resource
→ CFN calls Lambda (eks-terragrunt-repo-gen-template-automation) via ServiceToken
→ Lambda fetches PAT from Secrets Manager (ghe-runner/github-token)
→ Lambda starts CodeBuild project (eks-terragrunt-repo-creator) with TF_VAR_* env overrides
→ CodeBuild clones terraform-eks-deployment repo from GHE
→ CodeBuild runs: terraform init + terraform apply -auto-approve
→ Terraform (CSVD/terraform-github-repo module) creates GHE repo + writes HCL files + opens PR
→ Lambda polls CodeBuild, then fetches PR URL from GitHub API
→ Lambda sends cfn-response SUCCESS with repository_url + pull_request_url
→ CFN stack transitions to CREATE_COMPLETE
→ SC provisioned product shows as AVAILABLE
```

### CodeBuild Projects

There are **two** CodeBuild projects — do not confuse them:

| Project | Purpose |
|---------|--------|
| `eks-terragrunt-repo-generator-builder` | Builds the Lambda container image (packer + Docker → ECR) |
| `eks-terragrunt-repo-creator` | Creates EKS cluster repos (tf init + tf apply inside terraform-eks-deployment) |

The Lambda triggers **`eks-terragrunt-repo-creator`** at runtime. The **`eks-terragrunt-repo-generator-builder`** is triggered manually via `packer-pipeline` when the Lambda code changes.

---

## Key Files

| File | Purpose |
|------|--------|
| `template_automation/app.py` | Lambda entry point; CFN Custom Resource handler; `start_codebuild_build()` + `poll_codebuild_build()` |
| `template_automation/eks_config.py` | Pydantic models + `is_eks_deployment` check |
| `service-catalog/product-template.yaml` | CFN template for the SC product (canonical source) |
| `deploy/main.tf` | Terraform: Lambda, CodeBuild project, SC portfolio/product, IAM |
| `deploy/variables.tf` | Input variables including `codebuild_project_name`, `codebuild_role_arn` |
| `csvd_config_packer.hcl` | packer-pipeline config for building the Lambda container image |

The HCL rendering, repo creation, and PR opening logic lives in **`terraform-eks-deployment`**, not here.

---

## Service Catalog Integration

The Service Catalog product is defined by `service-catalog/product-template.yaml`.

## SC Product Deployment Methods

There are **two ways** to deploy the Service Catalog product. Both use the same
`service-catalog/product-template.yaml` CFN template — they must stay in sync.

### Method 1: Direct Terraform via `deploy/` (canonical, use for testing/debugging)

```bash
cd lambda-template-repo-generator/deploy
tf init
tf apply
```

Deploys the Lambda + CodeBuild project + SC portfolio/product + constraints directly.
Use this as the **reference deployment** when debugging issues with the census pipeline.
IDs after last apply: portfolio `port-h5qd63hw5yagq`, product `prod-lmua4oknugafg`.

### Method 2: `terraform-service-catalog-census` Terragrunt (production path)

```bash
cd terraform-service-catalog-census/non-prod/csvd-dev/west/service-catalog
tf apply # (via terragrunt)
```

Census-managed production deployment path. The live CFN template lives at:
`terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml`

Both `service-catalog/product-template.yaml` here and `2-0-0.yaml` in census must stay in sync
(same parameters, same Lambda property names).

---

## Lambda Runtime Details

- **Function name**: `eks-terragrunt-repo-gen-template-automation`
- **Account**: `229685449397` (csvd-dev-gov, `us-gov-west-1`)
- **Timeout**: 900s (15 min) — must exceed CodeBuild poll window
- **ServiceToken**: `arn:aws-us-gov:lambda:${AWS::Region}:${AWS::AccountId}:function:eks-terragrunt-repo-gen-template-automation`
- **GitHub Enterprise**: `https://github.e.it.census.gov`, org `SCT-Engineering`

### Key environment variables

| Variable | Value | Purpose |
|----------|-------|---------|
| `VERIFY_SSL` | `false` | Census CA cert not in the container's `certifi` bundle |
| `GITHUB_TOKEN_SECRET_NAME` | `/eks-cluster-deployment/github_token` | App installation token (`ghs_`) — used by Lambda for Python GitHub API calls |
| `TF_GITHUB_TOKEN_SECRET_NAME` | `ghe-runner/github-token` | PAT (`ghp_`) — passed to CodeBuild as `GITHUB_TOKEN` for the Terraform GitHub provider |
| `CODEBUILD_PROJECT_NAME` | `eks-terragrunt-repo-creator` | CodeBuild project to trigger |
| `GITHUB_API` | `https://github.e.it.census.gov` | GHE API base URL |
| `GITHUB_ORG_NAME` | `SCT-Engineering` | Target GitHub org |

### Why two GitHub tokens?

- `GITHUB_TOKEN_SECRET_NAME` holds a **GitHub App installation token** (`ghs_` prefix). It can perform
org-level API calls but **cannot** access `/api/v3/user`, which the CSVD Terraform module requires.
- `TF_GITHUB_TOKEN_SECRET_NAME` holds a **personal access token** (`ghp_` prefix, user `arnol377`).
This is passed to CodeBuild and used by the Terraform GitHub provider.

### EKS mode is triggered when all these fields are present in the event:
- `cluster_name`
- `account_name`
- `aws_account_id`
- `vpc_name`
- `vpc_domain_name`

If any of these are missing, the Lambda falls back to **generic mode** (writes only `config.json`).
**Do not pass `vpc_id`** — the Lambda model field is `vpc_name` (a string).

---

## Parameter Naming Convention

The CFN product template passes parameters in `snake_case` directly to the Lambda.
The Lambda has a PascalCase→snake_case normalizer but it mishandles acronyms
(`AWSAccountId` → `a_w_s_account_id` instead of `aws_account_id`). Always pass
snake_case directly in the CFN `Properties` block:

```yaml
Properties:
ServiceToken: !Sub "arn:aws-us-gov:lambda:..."
project_name: !Ref ProjectName # ← snake_case, not ProjectName
aws_account_id: !Ref AWSAccountId # ← snake_case, not AWSAccountId
vpc_name: !Ref VpcName # ← vpc_name, NOT vpc_id
```

---

## Rebuilding the Lambda Image

When `template_automation/app.py` or other Lambda source files change:

```bash
# 1. Zip source and upload to S3
cd lambda-template-repo-generator
zip -r ~/tmp/lambda-source.zip . -x "*.git*" -x "design-docs/*" -x "__pycache__/*" -x "*.pyc" -x "deploy/.terraform/*" -x "deploy/terraform.tfstate*"
UUID=$(python3 -c "import uuid; print(uuid.uuid4())")
source ~/aws-creds
aws s3 cp ~/tmp/lambda-source.zip \
"s3://csvd-packer-pipeline-builds/packer-builds/eks-terragrunt-repo-generator/source/${UUID}/repo.zip" \
--region us-gov-west-1

# 2. Start the packer CodeBuild build
aws codebuild start-build \
--project-name eks-terragrunt-repo-generator-builder \
--region us-gov-west-1 \
--source-type-override S3 \
--source-location-override "csvd-packer-pipeline-builds/packer-builds/eks-terragrunt-repo-generator/source/${UUID}/repo.zip"

# 3. After build SUCCEEDED, force Lambda to pull the new image
aws lambda update-function-code \
--function-name eks-terragrunt-repo-gen-template-automation \
--image-uri "229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/eks-terragrunt-repo-generator/lambda:latest" \
--region us-gov-west-1
```

## Testing

```bash
# End-to-end Service Catalog test (provisions + verifies + terminates)
source ~/aws-creds
cd lambda-template-repo-generator
python scripts/test_service_catalog.py sc-e2e-test-$(date +%Y%m%d-%H%M)

# Clean up leftover test repos
python scripts/cleanup_test_repos.py
```

---

## What NOT to Do

- ❌ Do not rewrite repo creation logic in Lambda Python — all repo creation runs in CodeBuild via `terraform-eks-deployment`
- ❌ Do not use `HappyPathway/terraform-github-repo` **public** module — it pins `github ~> 6.0` (conflicts with internal `>= 6.6.0`)
- ✅ DO use `CSVD/terraform-github-repo` (https://github.e.it.census.gov/CSVD/terraform-github-repo) — internal module, supports `template_repo` + `managed_extra_files`
- ❌ Do not pass `vpc_id` to the Lambda — use `vpc_name`
- ❌ Do not re-add `LambdaFunctionArn` as a CFN parameter — use `!Sub "arn:..."` directly
- ❌ Do not use SSH-based module sources (`git::ssh://`) — Census proxy blocks SSH host key exchange; use HTTPS
- ❌ Do not write temp files or command output to `/tmp` — use `~/tmp` (i.e. `/home/a/arnol377/tmp`) instead
- ❌ Do not use the `terraform` command directly — always use the `tf` alias (e.g. `tf plan`, `tf apply`, `tf init`)
7 changes: 4 additions & 3 deletions csvd_config_packer.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ packer_pipeline {
tools = [
{
name = "packer"
version = "1.13.0"
zip_path = "packer_1.13.0_linux_amd64.zip"
version = "1.10.3"
zip_path = "packer_1.10.3_linux_amd64.zip"
binary_name = "packer"
install_path = "/usr/local/bin"
}
Expand All @@ -29,7 +29,8 @@ packer_pipeline {
partition = "aws-us-gov" // AWS partition (aws or aws-us-gov)

// Role management
create_role = true // Enable automatic role creation
create_role = false // Role already exists; provide ARN directly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we always want to create the role?

codebuild_role_arn = "arn:aws-us-gov:iam::229685449397:role/CodeBuildPackerRole-eks-terragrunt-repo-generator-builder"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be looked up by Name, partition, account id


// Region and partition configuration
aws_region = "us-gov-west-1" // AWS region
Expand Down
78 changes: 78 additions & 0 deletions deploy/.terraform_commits
Original file line number Diff line number Diff line change
Expand Up @@ -82,5 +82,83 @@
"commit_message": "pushing latest code",
"author": "Your Name",
"timestamp": "2026-02-11T17:09:42.508401"
},
{
"commit_hash": "528f4b3c9d142dc7b5b4cd3e9f7ce00aa98352ca",
"commit_message": "fix: VERIFY_SSL=false; public repo visibility; add ec2:DescribeVpcs to SC launch role\n\n- VERIFY_SSL was incorrectly set to 'true' (Census CA cert not in certifi)\n- repo_visibility changed from 'internal' to 'public' per ECA requirements\n- Added EC2DescribeVpcs permission to SC launch role IAM policy",
"author": "Your Name",
"timestamp": "2026-04-06T12:12:58.619384"
},
{
"commit_hash": "528f4b3c9d142dc7b5b4cd3e9f7ce00aa98352ca",
"commit_message": "fix: VERIFY_SSL=false; public repo visibility; add ec2:DescribeVpcs to SC launch role\n\n- VERIFY_SSL was incorrectly set to 'true' (Census CA cert not in certifi)\n- repo_visibility changed from 'internal' to 'public' per ECA requirements\n- Added EC2DescribeVpcs permission to SC launch role IAM policy",
"author": "Your Name",
"timestamp": "2026-04-06T12:18:21.814330"
},
{
"commit_hash": "ec54b54a1c66f0ed6fa814ceda538f18e8453284",
"commit_message": "feat: Lambda delegates EKS repos to CodeBuild + terraform-eks-deployment\n\n- app.py: add start_codebuild_build() and poll_codebuild_build() helpers\n- app.py: EKS deployment path (is_eks_deployment=True) now starts CodeBuild\n project 'eks-terragrunt-repo-creator', polls until SUCCEEDED/FAILED,\n and sends cfn-response accordingly; non-EKS path unchanged\n- deploy/main.tf: add aws_codebuild_project.eks_repo_creator resource\n (NO_SOURCE, uses buildspec.yml from terraform-eks-deployment)\n CODEBUILD_PROJECT_NAME injected into Lambda environment\n- deploy/variables.tf: codebuild_project_name, codebuild_role_arn, codebuild_vpc_id\n- deploy/terraform.tfvars: set CodeBuild project name, role ARN, VPC ID",
"author": "Your Name",
"timestamp": "2026-04-06T13:55:14.843964"
},
{
"commit_hash": "52ebef0541aa8bac0dc9fab41e4e4be4a0ebbbbe",
"commit_message": "chore: tf apply \u2014 add eks-terragrunt-repo-creator CodeBuild project + Lambda CODEBUILD_PROJECT_NAME env var",
"author": "Your Name",
"timestamp": "2026-04-06T14:07:45.300705"
},
{
"commit_hash": "52ebef0541aa8bac0dc9fab41e4e4be4a0ebbbbe",
"commit_message": "chore: tf apply \u2014 add eks-terragrunt-repo-creator CodeBuild project + Lambda CODEBUILD_PROJECT_NAME env var",
"author": "Your Name",
"timestamp": "2026-04-06T14:08:05.836742"
},
{
"commit_hash": "8310ee1b5d65d5b112d891a7eb987ac0856ba9f3",
"commit_message": "fix: increase Lambda timeout to 900s to cover CodeBuild poll window\n\nLambda was set to 300s but poll_codebuild_build loops for up to 12 min (720s).\nLambda would be killed by AWS before it could report back to CloudFormation.\n900s gives a ~180s buffer beyond the poll window.",
"author": "Your Name",
"timestamp": "2026-04-06T14:32:04.632013"
},
{
"commit_hash": "8310ee1b5d65d5b112d891a7eb987ac0856ba9f3",
"commit_message": "fix: increase Lambda timeout to 900s to cover CodeBuild poll window\n\nLambda was set to 300s but poll_codebuild_build loops for up to 12 min (720s).\nLambda would be killed by AWS before it could report back to CloudFormation.\n900s gives a ~180s buffer beyond the poll window.",
"author": "Your Name",
"timestamp": "2026-04-07T12:07:10.663787"
},
{
"commit_hash": "eb184634fcc11c9d9146d06e401b7fcd04cde322",
"commit_message": "fix: remove spurious '- ' prefix from additional_post_build_commands\n\nThe packer-pipeline internal buildspec template already wraps the value\nin '- {{ additional_post_build_commands }}', so prefixing the value with\n'- ' caused YAML_FILE_ERROR (nested list) in CodeBuild build #8.",
"author": "Your Name",
"timestamp": "2026-04-07T12:36:02.814421"
},
{
"commit_hash": "eb184634fcc11c9d9146d06e401b7fcd04cde322",
"commit_message": "fix: remove spurious '- ' prefix from additional_post_build_commands\n\nThe packer-pipeline internal buildspec template already wraps the value\nin '- {{ additional_post_build_commands }}', so prefixing the value with\n'- ' caused YAML_FILE_ERROR (nested list) in CodeBuild build #8.",
"author": "Your Name",
"timestamp": "2026-04-07T12:39:29.803299"
},
{
"commit_hash": "eb184634fcc11c9d9146d06e401b7fcd04cde322",
"commit_message": "fix: remove spurious '- ' prefix from additional_post_build_commands\n\nThe packer-pipeline internal buildspec template already wraps the value\nin '- {{ additional_post_build_commands }}', so prefixing the value with\n'- ' caused YAML_FILE_ERROR (nested list) in CodeBuild build #8.",
"author": "Your Name",
"timestamp": "2026-04-07T12:39:47.151568"
},
{
"commit_hash": "eb184634fcc11c9d9146d06e401b7fcd04cde322",
"commit_message": "fix: remove spurious '- ' prefix from additional_post_build_commands\n\nThe packer-pipeline internal buildspec template already wraps the value\nin '- {{ additional_post_build_commands }}', so prefixing the value with\n'- ' caused YAML_FILE_ERROR (nested list) in CodeBuild build #8.",
"author": "Your Name",
"timestamp": "2026-04-07T12:56:16.684733"
},
{
"commit_hash": "5d3ff19015b916206a52dc8d591cea529b9d62ce",
"commit_message": "fix: use PAT (ghe-runner/github-token) for Terraform GitHub provider in CodeBuild\n\nThe standard github_token (/eks-cluster-deployment/github_token) is a GitHub\nApp installation token (ghs_ prefix) which cannot access /api/v3/user. This\nendpoint is always called by the CSVD terraform-github-repo module's\ndata.github_user.current resource.\n\nChanges:\n- app.py: check TF_GITHUB_TOKEN_SECRET_NAME env var first for CodeBuild token;\n falls back to GITHUB_TOKEN_SECRET_NAME if not set\n- deploy/main.tf: add TF_GITHUB_TOKEN_SECRET_NAME=ghe-runner/github-token env var\n- deploy/main.tf: add IAM policy granting Lambda access to ghe-runner/github-token",
"author": "Your Name",
"timestamp": "2026-04-07T13:10:02.295504"
},
{
"commit_hash": "5d3ff19015b916206a52dc8d591cea529b9d62ce",
"commit_message": "fix: use PAT (ghe-runner/github-token) for Terraform GitHub provider in CodeBuild\n\nThe standard github_token (/eks-cluster-deployment/github_token) is a GitHub\nApp installation token (ghs_ prefix) which cannot access /api/v3/user. This\nendpoint is always called by the CSVD terraform-github-repo module's\ndata.github_user.current resource.\n\nChanges:\n- app.py: check TF_GITHUB_TOKEN_SECRET_NAME env var first for CodeBuild token;\n falls back to GITHUB_TOKEN_SECRET_NAME if not set\n- deploy/main.tf: add TF_GITHUB_TOKEN_SECRET_NAME=ghe-runner/github-token env var\n- deploy/main.tf: add IAM policy granting Lambda access to ghe-runner/github-token",
"author": "Your Name",
"timestamp": "2026-04-07T13:10:20.067727"
}
]
Loading