Skip to content

fix: EKS-only Lambda cleanup + SC template AwsRegion/AWSAccountId removal #1

Merged
merged 20 commits into from
Apr 21, 2026
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
803168a
fix: use Lambda-only approach for EKS repo creation; add Copilot inst…
Apr 2, 2026
0a74dd7
fix: public visibility by default; add collaborator support for repo …
Apr 2, 2026
528f4b3
fix: VERIFY_SSL=false; public repo visibility; add ec2:DescribeVpcs t…
Apr 2, 2026
a79cee4
feat: path_mapper for dynamic EKS repo structure (safe revert baseline)
Apr 6, 2026
ec54b54
feat: Lambda delegates EKS repos to CodeBuild + terraform-eks-deployment
Apr 6, 2026
52ebef0
chore: tf apply — add eks-terragrunt-repo-creator CodeBuild project +…
Apr 6, 2026
aee6987
fix: add CodeBuild VPC endpoint + IAM policy for Lambda→CodeBuild con…
Apr 6, 2026
8310ee1
fix: increase Lambda timeout to 900s to cover CodeBuild poll window
Apr 6, 2026
eb18463
fix: remove spurious '- ' prefix from additional_post_build_commands
Apr 7, 2026
5d3ff19
fix: use PAT (ghe-runner/github-token) for Terraform GitHub provider …
Apr 7, 2026
26c6fe9
fix: add pull_request_url and branch_name to CodeBuild success response
Apr 7, 2026
12a742a
docs: rewrite copilot-instructions to reflect CodeBuild+Terraform arc…
Apr 7, 2026
065d2f2
chore: update deploy Terraform state after tf apply
Apr 7, 2026
560a5ec
fix: address PR1 review comments — EKS-only Lambda + Terraform cleanup
Apr 14, 2026
dff9bfa
docs: clarify cross-account architecture + fix stale refs
Apr 14, 2026
e6547ed
docs: add ECA demo script with talking points and Q&A prep
Apr 14, 2026
ff2a6b5
fix(lambda): make EKS fields required; remove is_eks_deployment dead …
Apr 21, 2026
f37b6c6
fix(sc-template): remove AwsRegion/AWSAccountId as user-facing parame…
Apr 21, 2026
237ab9b
fix(deploy): add eks-repo-creator buildspec; fix partition refs in IA…
Apr 21, 2026
8b268ff
chore: update docs, scripts, and state to reflect current architecture
Apr 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# GitHub Copilot Instructions — lambda-template-repo-generator

## Project Purpose

This repository contains the Lambda function that powers the EKS Cluster Automation (ECA) system.
When a team provisions the "EKS Terragrunt Repo" product via AWS Service Catalog, this Lambda:

1. Receives a CloudFormation Custom Resource event
2. Fetches a GitHub PAT from Secrets Manager (`ghe-runner/github-token`)
3. Triggers the `eks-terragrunt-repo-creator` CodeBuild project with EKS parameters as env vars
4. Polls CodeBuild every 20 seconds until the build completes or the Lambda deadline approaches
5. Fetches the open PR URL from the GitHub API after a successful build
6. Signals CloudFormation `SUCCESS`/`FAILED`

All actual repo creation runs inside **CodeBuild** via the `terraform-eks-deployment` workspace:
- Clones `template-eks-cluster` via `CSVD/terraform-github-repo` Terraform module
- Writes 8 rendered Terragrunt HCL files via `managed_extra_files`
- Opens a pull request (`repo-init``main`)

---

## Architecture: Centralized Lambda in csvd-dev, Cross-Account Invocation

The SC product is shared to **multiple accounts** via the portfolio, but all compute runs
**centrally in csvd-dev** (`229685449397`). When a user in any account provisions the product,
CloudFormation invokes the Lambda cross-account using the hardcoded `ServiceToken` ARN.
This works because the Lambda has an `aws_lambda_permission` with `principal_org_id`,
allowing CloudFormation in any org account to invoke it.

The provisioning account only sees a CloudFormation stack with outputs (repo URL, PR URL).
It never needs the Lambda, CodeBuild, ECR image, or Secrets Manager secrets locally.

```
Any Account (SC portfolio shared via OU):
SC Console (user fills form)
→ CFN Stack creates Custom::GitHubRepository resource
csvd-dev (229685449397) — all compute here:
→ CFN calls Lambda (eks-terragrunt-repo-gen-template-automation) cross-account via ServiceToken
→ Lambda fetches PAT from Secrets Manager (ghe-runner/github-token)
→ Lambda starts CodeBuild project (eks-terragrunt-repo-creator) with TF_VAR_* env overrides
→ CodeBuild clones terraform-eks-deployment repo from GHE
→ CodeBuild runs: terraform init + terraform apply -auto-approve
→ Terraform (CSVD/terraform-github-repo module) creates GHE repo + writes HCL files + opens PR
→ Lambda polls CodeBuild, then fetches PR URL from GitHub API
→ Lambda sends cfn-response SUCCESS with repository_url + pull_request_url
Any Account:
→ CFN stack transitions to CREATE_COMPLETE
→ SC provisioned product shows as AVAILABLE
```

**Why centralized?** The Lambda only interacts with GitHub Enterprise and CodeBuild — it
makes no AWS API calls in the provisioner's account. Deploying per-account would add
complexity (ECR replication, per-account CodeBuild, per-account Secrets Manager) with no benefit.

### CodeBuild Projects

There are **two** CodeBuild projects — do not confuse them:

| Project | Purpose |
|---------|--------|
| `eks-terragrunt-repo-generator-builder` | Builds the Lambda container image (packer + Docker → ECR) |
| `eks-terragrunt-repo-creator` | Creates EKS cluster repos (tf init + tf apply inside terraform-eks-deployment) |

The Lambda triggers **`eks-terragrunt-repo-creator`** at runtime. The **`eks-terragrunt-repo-generator-builder`** is triggered manually via `packer-pipeline` when the Lambda code changes.

---

## Key Files

| File | Purpose |
|------|--------|
| `template_automation/app.py` | Lambda entry point; CFN Custom Resource handler; `start_codebuild_build()` + `poll_codebuild_build()` |
| `service-catalog/product-template.yaml` | CFN template for the SC product (canonical source) |
| `deploy/main.tf` | Terraform: Lambda, CodeBuild project, SC portfolio/product, IAM |
| `deploy/variables.tf` | Input variables including `codebuild_project_name`, `codebuild_role_arn` |
| `csvd_config_packer.hcl` | packer-pipeline config for building the Lambda container image |

The HCL rendering, repo creation, and PR opening logic lives in **`terraform-eks-deployment`**, not here.

---

## Service Catalog Integration

The Service Catalog product is defined by `service-catalog/product-template.yaml`.

## SC Product Deployment Methods

There are **two ways** to deploy the Service Catalog product. Both use the same
`service-catalog/product-template.yaml` CFN template — they must stay in sync.

### Method 1: Direct Terraform via `deploy/` (canonical, use for testing/debugging)

```bash
cd lambda-template-repo-generator/deploy
tf init
tf apply
```

Deploys the Lambda + CodeBuild project + SC portfolio/product + constraints directly.
Use this as the **reference deployment** when debugging issues with the census pipeline.
IDs after last apply: portfolio `port-h5qd63hw5yagq`, product `prod-lmua4oknugafg`.

### Method 2: `terraform-service-catalog-census` Terragrunt (production path)

```bash
cd terraform-service-catalog-census/non-prod/csvd-dev/west/service-catalog
tf apply # (via terragrunt)
```

Census-managed production deployment path. The live CFN template lives at:
`terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml`

Both `service-catalog/product-template.yaml` here and `2-0-0.yaml` in census must stay in sync
(same parameters, same Lambda property names).

---

## Lambda Runtime Details

- **Function name**: `eks-terragrunt-repo-gen-template-automation`
- **Account**: `229685449397` (csvd-dev-gov, `us-gov-west-1`)
- **Timeout**: 900s (15 min) — must exceed CodeBuild poll window
- **ServiceToken**: `arn:aws-us-gov:lambda:${AWS::Region}:${AWS::AccountId}:function:eks-terragrunt-repo-gen-template-automation`
- **GitHub Enterprise**: `https://github.e.it.census.gov`, org `SCT-Engineering`

### Key environment variables

| Variable | Value | Purpose |
|----------|-------|---------|
| `VERIFY_SSL` | `false` | Census CA cert not in the container's `certifi` bundle |
| `GITHUB_TOKEN_SECRET_NAME` | `/eks-cluster-deployment/github_token` | App installation token (`ghs_`) — used by Lambda for Python GitHub API calls |
| `TF_GITHUB_TOKEN_SECRET_NAME` | `ghe-runner/github-token` | PAT (`ghp_`) — passed to CodeBuild as `GITHUB_TOKEN` for the Terraform GitHub provider |
| `CODEBUILD_PROJECT_NAME` | `eks-terragrunt-repo-creator` | CodeBuild project to trigger |
| `GITHUB_API` | `https://github.e.it.census.gov` | GHE API base URL |
| `GITHUB_ORG_NAME` | `SCT-Engineering` | Target GitHub org |

### Why two GitHub tokens?

- `GITHUB_TOKEN_SECRET_NAME` holds a **GitHub App installation token** (`ghs_` prefix). It can perform
org-level API calls but **cannot** access `/api/v3/user`, which the CSVD Terraform module requires.
- `TF_GITHUB_TOKEN_SECRET_NAME` holds a **personal access token** (`ghp_` prefix, user `arnol377`).
This is passed to CodeBuild and used by the Terraform GitHub provider.

### Required EKS fields in the CFN event:
- `cluster_name`
- `account_name`
- `aws_account_id`
- `vpc_name`
- `vpc_domain_name`

The Lambda is EKS-only — there is no generic fallback mode.
**Do not pass `vpc_id`** — the field is `vpc_name` (a string).

---

## Parameter Naming Convention

The CFN product template passes parameters in `snake_case` directly to the Lambda.
The Lambda has a PascalCase→snake_case normalizer but it mishandles acronyms
(`AWSAccountId``a_w_s_account_id` instead of `aws_account_id`). Always pass
snake_case directly in the CFN `Properties` block:

```yaml
Properties:
ServiceToken: !Sub "arn:aws-us-gov:lambda:..."
project_name: !Ref ProjectName # ← snake_case, not ProjectName
aws_account_id: !Ref AWSAccountId # ← snake_case, not AWSAccountId
vpc_name: !Ref VpcName # ← vpc_name, NOT vpc_id
```
---
## Rebuilding the Lambda Image
When `template_automation/app.py` or other Lambda source files change, use `packer-pipeline`:

```bash
cd lambda-template-repo-generator
source ~/aws-creds
packer-pipeline --config csvd_config_packer.hcl
```

This handles zipping the source, uploading to S3, and triggering the
`eks-terragrunt-repo-generator-builder` CodeBuild project automatically.

After the build completes (SUCCEEDED), force the Lambda to pull the new image:

```bash
aws lambda update-function-code \
--function-name eks-terragrunt-repo-gen-template-automation \
--image-uri "229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/eks-terragrunt-repo-generator/lambda:latest" \
--region us-gov-west-1
```

## Testing

```bash
# End-to-end Service Catalog test (provisions + verifies + terminates)
source ~/aws-creds
cd lambda-template-repo-generator
python scripts/test_service_catalog.py sc-e2e-test-$(date +%Y%m%d-%H%M)
# Clean up leftover test repos
python scripts/cleanup_test_repos.py
```

---

## Python & CLI Automation Standards

All automation scripts in this project are written in **Python 3**. Use the following libraries
as the standard stack — do not introduce alternatives without good reason:

| Purpose | Library |
|---------|---------|
| Data validation / config models | `pydantic` (v2) |
| Rich terminal output / progress | `rich` |
| CLI argument parsing | `typer` (preferred) or `argparse` |
| AWS API calls | `boto3` |
| YAML config files | `pyyaml` |
| HTTP calls | `httpx` or `requests` |

### `AWS_DEFAULT_REGION` — always required

The account is in `us-gov-west-1`. Many boto3 calls and the AWS CLI silently fail or
target the wrong region if `AWS_DEFAULT_REGION` is not set.

**Always export it before any AWS CLI or boto3 script:**

```bash
export AWS_DEFAULT_REGION=us-gov-west-1
source ~/aws-creds
```

### SC Template Parameters

`aws_account_id` and `aws_region` are **not** on the SC product form — the CFN template
resolves them automatically via `!Sub "${AWS::AccountId}"` and `!Sub "${AWS::Region}"`
before the Lambda is called. Do not add them back as user-facing parameters.

---

## What NOT to Do

- ❌ Do not rewrite repo creation logic in Lambda Python — all repo creation runs in CodeBuild via `terraform-eks-deployment`
- ❌ Do not use `HappyPathway/terraform-github-repo` **public** module — it pins `github ~> 6.0` (conflicts with internal `>= 6.6.0`)
- ✅ DO use `CSVD/terraform-github-repo` (https://github.e.it.census.gov/CSVD/terraform-github-repo) — internal module, supports `template_repo` + `managed_extra_files`
- ❌ Do not pass `vpc_id` to the Lambda — use `vpc_name`
- ❌ Do not re-add `LambdaFunctionArn` as a CFN parameter — use `!Sub "arn:..."` directly
- ❌ Do not re-add `AWSAccountId` or `AwsRegion` as SC product form parameters — use `!Sub` auto-resolution
- ❌ Do not use SSH-based module sources (`git::ssh://`) — Census proxy blocks SSH host key exchange; use HTTPS
- ❌ Do not write temp files or command output to `/tmp` — use `~/tmp` (i.e. `/home/a/arnol377/tmp`) instead
- ❌ Do not use the `terraform` command directly — always use the `tf` alias (e.g. `tf plan`, `tf apply`, `tf init`)
- ❌ Do not run AWS CLI or boto3 without first exporting `AWS_DEFAULT_REGION=us-gov-west-1`
3 changes: 2 additions & 1 deletion CLOUDFORMATION_CUSTOM_RESOURCE_MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,8 @@ Sees outputs:
1. **Deploy Lambda** with updated code:
```bash
cd /home/a/arnol377/git/lambda-template-repo-generator
packer-pipeline build --config config_packer.hcl
source ~/aws-creds
packer-pipeline --config csvd_config_packer.hcl
```

2. **Update Infrastructure**:
Expand Down
4 changes: 2 additions & 2 deletions DEPLOYMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ terraform apply -var-file=varfiles/default.tfvars
cd /path/to/lambda-template-repo-generator

# Build container image via CodeBuild (waits for completion, ~4 minutes)
packer-pipeline --config config_packer.hcl --wait
packer-pipeline --config csvd_config_packer.hcl
```

This will:
Expand Down Expand Up @@ -259,7 +259,7 @@ When you change Lambda code in `template_automation/`:

```bash
# 1. Build new container
packer-pipeline --config config_packer.hcl --wait
packer-pipeline --config csvd_config_packer.hcl

# 2. Update Lambda to new image
aws lambda update-function-code \
Expand Down
70 changes: 70 additions & 0 deletions PR1-REVIEW-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# PR #1 Review — Implementation Plan

Addresses all comments from morga471 on
https://github.e.it.census.gov/CSVD/lambda-template-repo-generator/pull/1

---

## A. Remove the generic code path

The Lambda was built on top of an older generic repo-creation-from-Python framework
(GitHub/GitLab providers, config.json rendering, template manager). Now that all repo
creation runs through CodeBuild + terraform-eks-deployment, none of this code is
reachable in production. Remove it entirely.

| # | File | Change |
|---|------|--------|
| A1 | `template_automation/app.py` | Gut to ~150 lines: keep only CFN event parsing (`handler`), `start_codebuild_build()`, `poll_codebuild_build()`, post-build PR URL fetch, and `cfn_response()`. Remove all `GitHubProvider`/`GitLabProvider` instantiation, `RepositorySettings`, `MergeRequestSettings`, `FileContent`, `config.json` write, the generic `if not is_eks_deployment` branch, and `to_template_settings()` |
| A2 | `template_automation/app.py` | Remove `CloudFormationResourceInput.to_template_settings()` method; simplify model to only CFN parsing + `is_eks_deployment` check + `to_eks_deployment_config()` |
| A3 | Delete | `template_automation/repository_provider.py` |
| A4 | Delete | `template_automation/github_provider.py` |
| A5 | Delete | `template_automation/gitlab_provider.py` |
| A6 | Delete | `template_automation/github_client.py` |
| A7 | Delete | `template_automation/gitlab_client.py` |
| A8 | Delete | `template_automation/template_manager.py` |
| A9 | Delete | `template_automation/models.py` |
| A10 | `template_automation/requirements.txt` | Remove deps no longer needed (e.g. `pygithub`, `python-gitlab`) |
| A11 | `template_automation/templates/` | Remove `config.json` template if it only served the generic path; remove directory if empty |

---

## B. Terraform infra fixes (from Matt's inline comments)

| # | File | Matt's comment | Change |
|---|------|----------------|--------|
| B1 | `deploy/main.tf` | "Wouldn't we always want to create the role?" | Add `resource "aws_iam_role"` + `aws_iam_role_policy_attachment` to create the CodeBuild execution role in this module; remove dependency on pre-existing role |
| B2 | `deploy/main.tf` | "pass in token secret name" / "Should also pass in the secret name" | Replace hardcoded `"ghe-runner/github-token"` string in Lambda env vars and IAM policy ARN with `var.tf_github_token_secret_name` |
| B3 | `deploy/main.tf` | "look up the partition value with data.aws_caller_identity.current.partition" | Replace `"arn:aws-us-gov:secretsmanager:..."` with `"arn:${data.aws_caller_identity.current.partition}:secretsmanager:..."` (caller_identity already declared) |
| B4 | `deploy/main.tf` | "We shouldn't create VPC endpoints, they should already be in the account we use." | Remove `aws_vpc_endpoint.codebuild` resource entirely |
| B5 | `deploy/main.tf` || Add `data "aws_subnet"` + `data "aws_security_group"` lookups by name/tag to replace hardcoded IDs passed as variables |
| B6 | `deploy/variables.tf` | "This should be looked up so it can work across accounts." | Remove `codebuild_role_arn` variable (role now created in module per B1); add `tf_github_token_secret_name` variable (default `"ghe-runner/github-token"`) |
| B7 | `deploy/variables.tf` || Remove `codebuild_vpc_id` variable; add subnet/SG name filter variables to drive data sources (B5) |
| B8 | `deploy/terraform.tfvars` | "These should be looked up or created" | Replace hardcoded `subnet_ids`, `security_group_ids`, `codebuild_vpc_id` IDs with name-based values that feed data source lookups |
| B9 | `csvd_config_packer.hcl` | "This should be looked up by Name, partition, account id" / "These should be looked up or created" | Replace hardcoded `account_number`, `partition`, `codebuild_role_arn`, `vpc_id`, subnet/SG IDs — drive from env vars resolved at build time via `aws sts get-caller-identity` / `aws iam get-role` wrapper |

---

## C. CFN template fix

| # | File | Matt's comment | Change |
|---|------|----------------|--------|
| C1 | `service-catalog/product-template.yaml` **and** `terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml` | "look up partition" | Change `arn:aws-us-gov:lambda:``arn:${AWS::Partition}:lambda:` in the `ServiceToken` |

---

## D. PR response comment

| # | Action |
|---|--------|
| D1 | Reply to Matt's comment on `repository_provider.py`: "Good call — we're removing the entire generic code path (A1–A11 above). The file won't be needed." |

---

## Notes

- A and B are independent; either can be done first.
- C1 must be applied to **both** copies of the template and kept in sync.
- After B1 (create CodeBuild role in Terraform), run `tf apply` in `deploy/` and update
`deploy/terraform.tfstate` before rebuilding the Lambda image.
- After all changes, rebuild the Lambda image (packer CodeBuild build) and force a Lambda
update (`aws lambda update-function-code --image-uri ...`) before running the e2e test.
Loading