-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from CSVD/fix/eca-lambda-approach-and-copilot-docs
fix: EKS-only Lambda cleanup + SC template AwsRegion/AWSAccountId removal
- Loading branch information
Showing
36 changed files
with
3,016 additions
and
5,348 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,256 @@ | ||
| # GitHub Copilot Instructions — lambda-template-repo-generator | ||
|
|
||
| ## Project Purpose | ||
|
|
||
| This repository contains the Lambda function that powers the EKS Cluster Automation (ECA) system. | ||
| When a team provisions the "EKS Terragrunt Repo" product via AWS Service Catalog, this Lambda: | ||
|
|
||
| 1. Receives a CloudFormation Custom Resource event | ||
| 2. Fetches a GitHub PAT from Secrets Manager (`ghe-runner/github-token`) | ||
| 3. Triggers the `eks-terragrunt-repo-creator` CodeBuild project with EKS parameters as env vars | ||
| 4. Polls CodeBuild every 20 seconds until the build completes or the Lambda deadline approaches | ||
| 5. Fetches the open PR URL from the GitHub API after a successful build | ||
| 6. Signals CloudFormation `SUCCESS`/`FAILED` | ||
|
|
||
| All actual repo creation runs inside **CodeBuild** via the `terraform-eks-deployment` workspace: | ||
| - Clones `template-eks-cluster` via `CSVD/terraform-github-repo` Terraform module | ||
| - Writes 8 rendered Terragrunt HCL files via `managed_extra_files` | ||
| - Opens a pull request (`repo-init` → `main`) | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture: Centralized Lambda in csvd-dev, Cross-Account Invocation | ||
|
|
||
| The SC product is shared to **multiple accounts** via the portfolio, but all compute runs | ||
| **centrally in csvd-dev** (`229685449397`). When a user in any account provisions the product, | ||
| CloudFormation invokes the Lambda cross-account using the hardcoded `ServiceToken` ARN. | ||
| This works because the Lambda has an `aws_lambda_permission` with `principal_org_id`, | ||
| allowing CloudFormation in any org account to invoke it. | ||
|
|
||
| The provisioning account only sees a CloudFormation stack with outputs (repo URL, PR URL). | ||
| It never needs the Lambda, CodeBuild, ECR image, or Secrets Manager secrets locally. | ||
|
|
||
| ``` | ||
| Any Account (SC portfolio shared via OU): | ||
| SC Console (user fills form) | ||
| → CFN Stack creates Custom::GitHubRepository resource | ||
| csvd-dev (229685449397) — all compute here: | ||
| → CFN calls Lambda (eks-terragrunt-repo-gen-template-automation) cross-account via ServiceToken | ||
| → Lambda fetches PAT from Secrets Manager (ghe-runner/github-token) | ||
| → Lambda starts CodeBuild project (eks-terragrunt-repo-creator) with TF_VAR_* env overrides | ||
| → CodeBuild clones terraform-eks-deployment repo from GHE | ||
| → CodeBuild runs: terraform init + terraform apply -auto-approve | ||
| → Terraform (CSVD/terraform-github-repo module) creates GHE repo + writes HCL files + opens PR | ||
| → Lambda polls CodeBuild, then fetches PR URL from GitHub API | ||
| → Lambda sends cfn-response SUCCESS with repository_url + pull_request_url | ||
| Any Account: | ||
| → CFN stack transitions to CREATE_COMPLETE | ||
| → SC provisioned product shows as AVAILABLE | ||
| ``` | ||
|
|
||
| **Why centralized?** The Lambda only interacts with GitHub Enterprise and CodeBuild — it | ||
| makes no AWS API calls in the provisioner's account. Deploying per-account would add | ||
| complexity (ECR replication, per-account CodeBuild, per-account Secrets Manager) with no benefit. | ||
|
|
||
| ### CodeBuild Projects | ||
|
|
||
| There are **two** CodeBuild projects — do not confuse them: | ||
|
|
||
| | Project | Purpose | | ||
| |---------|--------| | ||
| | `eks-terragrunt-repo-generator-builder` | Builds the Lambda container image (packer + Docker → ECR) | | ||
| | `eks-terragrunt-repo-creator` | Creates EKS cluster repos (tf init + tf apply inside terraform-eks-deployment) | | ||
|
|
||
| The Lambda triggers **`eks-terragrunt-repo-creator`** at runtime. The **`eks-terragrunt-repo-generator-builder`** is triggered manually via `packer-pipeline` when the Lambda code changes. | ||
|
|
||
| --- | ||
|
|
||
| ## Key Files | ||
|
|
||
| | File | Purpose | | ||
| |------|--------| | ||
| | `template_automation/app.py` | Lambda entry point; CFN Custom Resource handler; `start_codebuild_build()` + `poll_codebuild_build()` | | ||
| | `service-catalog/product-template.yaml` | CFN template for the SC product (canonical source) | | ||
| | `deploy/main.tf` | Terraform: Lambda, CodeBuild project, SC portfolio/product, IAM | | ||
| | `deploy/variables.tf` | Input variables including `codebuild_project_name`, `codebuild_role_arn` | | ||
| | `csvd_config_packer.hcl` | packer-pipeline config for building the Lambda container image | | ||
|
|
||
| The HCL rendering, repo creation, and PR opening logic lives in **`terraform-eks-deployment`**, not here. | ||
|
|
||
| --- | ||
|
|
||
| ## Service Catalog Integration | ||
|
|
||
| The Service Catalog product is defined by `service-catalog/product-template.yaml`. | ||
|
|
||
| ## SC Product Deployment Methods | ||
|
|
||
| There are **two ways** to deploy the Service Catalog product. Both use the same | ||
| `service-catalog/product-template.yaml` CFN template — they must stay in sync. | ||
|
|
||
| ### Method 1: Direct Terraform via `deploy/` (canonical, use for testing/debugging) | ||
|
|
||
| ```bash | ||
| cd lambda-template-repo-generator/deploy | ||
| tf init | ||
| tf apply | ||
| ``` | ||
|
|
||
| Deploys the Lambda + CodeBuild project + SC portfolio/product + constraints directly. | ||
| Use this as the **reference deployment** when debugging issues with the census pipeline. | ||
| IDs after last apply: portfolio `port-h5qd63hw5yagq`, product `prod-lmua4oknugafg`. | ||
|
|
||
| ### Method 2: `terraform-service-catalog-census` Terragrunt (production path) | ||
|
|
||
| ```bash | ||
| cd terraform-service-catalog-census/non-prod/csvd-dev/west/service-catalog | ||
| tf apply # (via terragrunt) | ||
| ``` | ||
|
|
||
| Census-managed production deployment path. The live CFN template lives at: | ||
| `terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml` | ||
|
|
||
| Both `service-catalog/product-template.yaml` here and `2-0-0.yaml` in census must stay in sync | ||
| (same parameters, same Lambda property names). | ||
|
|
||
| --- | ||
|
|
||
| ## Lambda Runtime Details | ||
|
|
||
| - **Function name**: `eks-terragrunt-repo-gen-template-automation` | ||
| - **Account**: `229685449397` (csvd-dev-gov, `us-gov-west-1`) | ||
| - **Timeout**: 900s (15 min) — must exceed CodeBuild poll window | ||
| - **ServiceToken**: `arn:aws-us-gov:lambda:${AWS::Region}:${AWS::AccountId}:function:eks-terragrunt-repo-gen-template-automation` | ||
| - **GitHub Enterprise**: `https://github.e.it.census.gov`, org `SCT-Engineering` | ||
|
|
||
| ### Key environment variables | ||
|
|
||
| | Variable | Value | Purpose | | ||
| |----------|-------|---------| | ||
| | `VERIFY_SSL` | `false` | Census CA cert not in the container's `certifi` bundle | | ||
| | `GITHUB_TOKEN_SECRET_NAME` | `/eks-cluster-deployment/github_token` | App installation token (`ghs_`) — used by Lambda for Python GitHub API calls | | ||
| | `TF_GITHUB_TOKEN_SECRET_NAME` | `ghe-runner/github-token` | PAT (`ghp_`) — passed to CodeBuild as `GITHUB_TOKEN` for the Terraform GitHub provider | | ||
| | `CODEBUILD_PROJECT_NAME` | `eks-terragrunt-repo-creator` | CodeBuild project to trigger | | ||
| | `GITHUB_API` | `https://github.e.it.census.gov` | GHE API base URL | | ||
| | `GITHUB_ORG_NAME` | `SCT-Engineering` | Target GitHub org | | ||
|
|
||
| ### Why two GitHub tokens? | ||
|
|
||
| - `GITHUB_TOKEN_SECRET_NAME` holds a **GitHub App installation token** (`ghs_` prefix). It can perform | ||
| org-level API calls but **cannot** access `/api/v3/user`, which the CSVD Terraform module requires. | ||
| - `TF_GITHUB_TOKEN_SECRET_NAME` holds a **personal access token** (`ghp_` prefix, user `arnol377`). | ||
| This is passed to CodeBuild and used by the Terraform GitHub provider. | ||
|
|
||
| ### Required EKS fields in the CFN event: | ||
| - `cluster_name` | ||
| - `account_name` | ||
| - `aws_account_id` | ||
| - `vpc_name` | ||
| - `vpc_domain_name` | ||
|
|
||
| The Lambda is EKS-only — there is no generic fallback mode. | ||
| **Do not pass `vpc_id`** — the field is `vpc_name` (a string). | ||
|
|
||
| --- | ||
|
|
||
| ## Parameter Naming Convention | ||
|
|
||
| The CFN product template passes parameters in `snake_case` directly to the Lambda. | ||
| The Lambda has a PascalCase→snake_case normalizer but it mishandles acronyms | ||
| (`AWSAccountId` → `a_w_s_account_id` instead of `aws_account_id`). Always pass | ||
| snake_case directly in the CFN `Properties` block: | ||
|
|
||
| ```yaml | ||
| Properties: | ||
| ServiceToken: !Sub "arn:aws-us-gov:lambda:..." | ||
| project_name: !Ref ProjectName # ← snake_case, not ProjectName | ||
| aws_account_id: !Ref AWSAccountId # ← snake_case, not AWSAccountId | ||
| vpc_name: !Ref VpcName # ← vpc_name, NOT vpc_id | ||
| ``` | ||
| --- | ||
| ## Rebuilding the Lambda Image | ||
| When `template_automation/app.py` or other Lambda source files change, use `packer-pipeline`: | ||
|
|
||
| ```bash | ||
| cd lambda-template-repo-generator | ||
| source ~/aws-creds | ||
| packer-pipeline --config csvd_config_packer.hcl | ||
| ``` | ||
|
|
||
| This handles zipping the source, uploading to S3, and triggering the | ||
| `eks-terragrunt-repo-generator-builder` CodeBuild project automatically. | ||
|
|
||
| After the build completes (SUCCEEDED), force the Lambda to pull the new image: | ||
|
|
||
| ```bash | ||
| aws lambda update-function-code \ | ||
| --function-name eks-terragrunt-repo-gen-template-automation \ | ||
| --image-uri "229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/eks-terragrunt-repo-generator/lambda:latest" \ | ||
| --region us-gov-west-1 | ||
| ``` | ||
|
|
||
| ## Testing | ||
|
|
||
| ```bash | ||
| # End-to-end Service Catalog test (provisions + verifies + terminates) | ||
| source ~/aws-creds | ||
| cd lambda-template-repo-generator | ||
| python scripts/test_service_catalog.py sc-e2e-test-$(date +%Y%m%d-%H%M) | ||
| # Clean up leftover test repos | ||
| python scripts/cleanup_test_repos.py | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Python & CLI Automation Standards | ||
|
|
||
| All automation scripts in this project are written in **Python 3**. Use the following libraries | ||
| as the standard stack — do not introduce alternatives without good reason: | ||
|
|
||
| | Purpose | Library | | ||
| |---------|---------| | ||
| | Data validation / config models | `pydantic` (v2) | | ||
| | Rich terminal output / progress | `rich` | | ||
| | CLI argument parsing | `typer` (preferred) or `argparse` | | ||
| | AWS API calls | `boto3` | | ||
| | YAML config files | `pyyaml` | | ||
| | HTTP calls | `httpx` or `requests` | | ||
|
|
||
| ### `AWS_DEFAULT_REGION` — always required | ||
|
|
||
| The account is in `us-gov-west-1`. Many boto3 calls and the AWS CLI silently fail or | ||
| target the wrong region if `AWS_DEFAULT_REGION` is not set. | ||
|
|
||
| **Always export it before any AWS CLI or boto3 script:** | ||
|
|
||
| ```bash | ||
| export AWS_DEFAULT_REGION=us-gov-west-1 | ||
| source ~/aws-creds | ||
| ``` | ||
|
|
||
| ### SC Template Parameters | ||
|
|
||
| `aws_account_id` and `aws_region` are **not** on the SC product form — the CFN template | ||
| resolves them automatically via `!Sub "${AWS::AccountId}"` and `!Sub "${AWS::Region}"` | ||
| before the Lambda is called. Do not add them back as user-facing parameters. | ||
|
|
||
| --- | ||
|
|
||
| ## What NOT to Do | ||
|
|
||
| - ❌ Do not rewrite repo creation logic in Lambda Python — all repo creation runs in CodeBuild via `terraform-eks-deployment` | ||
| - ❌ Do not use `HappyPathway/terraform-github-repo` **public** module — it pins `github ~> 6.0` (conflicts with internal `>= 6.6.0`) | ||
| - ✅ DO use `CSVD/terraform-github-repo` (https://github.e.it.census.gov/CSVD/terraform-github-repo) — internal module, supports `template_repo` + `managed_extra_files` | ||
| - ❌ Do not pass `vpc_id` to the Lambda — use `vpc_name` | ||
| - ❌ Do not re-add `LambdaFunctionArn` as a CFN parameter — use `!Sub "arn:..."` directly | ||
| - ❌ Do not re-add `AWSAccountId` or `AwsRegion` as SC product form parameters — use `!Sub` auto-resolution | ||
| - ❌ Do not use SSH-based module sources (`git::ssh://`) — Census proxy blocks SSH host key exchange; use HTTPS | ||
| - ❌ Do not write temp files or command output to `/tmp` — use `~/tmp` (i.e. `/home/a/arnol377/tmp`) instead | ||
| - ❌ Do not use the `terraform` command directly — always use the `tf` alias (e.g. `tf plan`, `tf apply`, `tf init`) | ||
| - ❌ Do not run AWS CLI or boto3 without first exporting `AWS_DEFAULT_REGION=us-gov-west-1` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # PR #1 Review — Implementation Plan | ||
|
|
||
| Addresses all comments from morga471 on | ||
| https://github.e.it.census.gov/CSVD/lambda-template-repo-generator/pull/1 | ||
|
|
||
| --- | ||
|
|
||
| ## A. Remove the generic code path | ||
|
|
||
| The Lambda was built on top of an older generic repo-creation-from-Python framework | ||
| (GitHub/GitLab providers, config.json rendering, template manager). Now that all repo | ||
| creation runs through CodeBuild + terraform-eks-deployment, none of this code is | ||
| reachable in production. Remove it entirely. | ||
|
|
||
| | # | File | Change | | ||
| |---|------|--------| | ||
| | A1 | `template_automation/app.py` | Gut to ~150 lines: keep only CFN event parsing (`handler`), `start_codebuild_build()`, `poll_codebuild_build()`, post-build PR URL fetch, and `cfn_response()`. Remove all `GitHubProvider`/`GitLabProvider` instantiation, `RepositorySettings`, `MergeRequestSettings`, `FileContent`, `config.json` write, the generic `if not is_eks_deployment` branch, and `to_template_settings()` | | ||
| | A2 | `template_automation/app.py` | Remove `CloudFormationResourceInput.to_template_settings()` method; simplify model to only CFN parsing + `is_eks_deployment` check + `to_eks_deployment_config()` | | ||
| | A3 | Delete | `template_automation/repository_provider.py` | | ||
| | A4 | Delete | `template_automation/github_provider.py` | | ||
| | A5 | Delete | `template_automation/gitlab_provider.py` | | ||
| | A6 | Delete | `template_automation/github_client.py` | | ||
| | A7 | Delete | `template_automation/gitlab_client.py` | | ||
| | A8 | Delete | `template_automation/template_manager.py` | | ||
| | A9 | Delete | `template_automation/models.py` | | ||
| | A10 | `template_automation/requirements.txt` | Remove deps no longer needed (e.g. `pygithub`, `python-gitlab`) | | ||
| | A11 | `template_automation/templates/` | Remove `config.json` template if it only served the generic path; remove directory if empty | | ||
|
|
||
| --- | ||
|
|
||
| ## B. Terraform infra fixes (from Matt's inline comments) | ||
|
|
||
| | # | File | Matt's comment | Change | | ||
| |---|------|----------------|--------| | ||
| | B1 | `deploy/main.tf` | "Wouldn't we always want to create the role?" | Add `resource "aws_iam_role"` + `aws_iam_role_policy_attachment` to create the CodeBuild execution role in this module; remove dependency on pre-existing role | | ||
| | B2 | `deploy/main.tf` | "pass in token secret name" / "Should also pass in the secret name" | Replace hardcoded `"ghe-runner/github-token"` string in Lambda env vars and IAM policy ARN with `var.tf_github_token_secret_name` | | ||
| | B3 | `deploy/main.tf` | "look up the partition value with data.aws_caller_identity.current.partition" | Replace `"arn:aws-us-gov:secretsmanager:..."` with `"arn:${data.aws_caller_identity.current.partition}:secretsmanager:..."` (caller_identity already declared) | | ||
| | B4 | `deploy/main.tf` | "We shouldn't create VPC endpoints, they should already be in the account we use." | Remove `aws_vpc_endpoint.codebuild` resource entirely | | ||
| | B5 | `deploy/main.tf` | — | Add `data "aws_subnet"` + `data "aws_security_group"` lookups by name/tag to replace hardcoded IDs passed as variables | | ||
| | B6 | `deploy/variables.tf` | "This should be looked up so it can work across accounts." | Remove `codebuild_role_arn` variable (role now created in module per B1); add `tf_github_token_secret_name` variable (default `"ghe-runner/github-token"`) | | ||
| | B7 | `deploy/variables.tf` | — | Remove `codebuild_vpc_id` variable; add subnet/SG name filter variables to drive data sources (B5) | | ||
| | B8 | `deploy/terraform.tfvars` | "These should be looked up or created" | Replace hardcoded `subnet_ids`, `security_group_ids`, `codebuild_vpc_id` IDs with name-based values that feed data source lookups | | ||
| | B9 | `csvd_config_packer.hcl` | "This should be looked up by Name, partition, account id" / "These should be looked up or created" | Replace hardcoded `account_number`, `partition`, `codebuild_role_arn`, `vpc_id`, subnet/SG IDs — drive from env vars resolved at build time via `aws sts get-caller-identity` / `aws iam get-role` wrapper | | ||
|
|
||
| --- | ||
|
|
||
| ## C. CFN template fix | ||
|
|
||
| | # | File | Matt's comment | Change | | ||
| |---|------|----------------|--------| | ||
| | C1 | `service-catalog/product-template.yaml` **and** `terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml` | "look up partition" | Change `arn:aws-us-gov:lambda:` → `arn:${AWS::Partition}:lambda:` in the `ServiceToken` | | ||
|
|
||
| --- | ||
|
|
||
| ## D. PR response comment | ||
|
|
||
| | # | Action | | ||
| |---|--------| | ||
| | D1 | Reply to Matt's comment on `repository_provider.py`: "Good call — we're removing the entire generic code path (A1–A11 above). The file won't be needed." | | ||
|
|
||
| --- | ||
|
|
||
| ## Notes | ||
|
|
||
| - A and B are independent; either can be done first. | ||
| - C1 must be applied to **both** copies of the template and kept in sync. | ||
| - After B1 (create CodeBuild role in Terraform), run `tf apply` in `deploy/` and update | ||
| `deploy/terraform.tfstate` before rebuilding the Lambda image. | ||
| - After all changes, rebuild the Lambda image (packer CodeBuild build) and force a Lambda | ||
| update (`aws lambda update-function-code --image-uri ...`) before running the e2e test. |
Oops, something went wrong.