diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index e2d39e7..46bc2db 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -6,76 +6,67 @@ This repository contains the Lambda function that powers the EKS Cluster Automat When a team provisions the "EKS Terragrunt Repo" product via AWS Service Catalog, this Lambda: 1. Receives a CloudFormation Custom Resource event -2. Creates a GitHub repository in the `SCT-Engineering` org on GitHub Enterprise -3. Clones `template-eks-cluster` as the starting structure -4. Renders 8 Terragrunt HCL files from Jinja2 templates (EKS-specific path) -5. Commits all files atomically via the Git tree API -6. Opens a pull request (`repo-init` → `main`) -7. Signals CloudFormation `SUCCESS`/`FAILED` +2. Fetches a GitHub PAT from Secrets Manager (`ghe-runner/github-token`) +3. Triggers the `eks-terragrunt-repo-creator` CodeBuild project with EKS parameters as env vars +4. Polls CodeBuild every 20 seconds until the build completes or the Lambda deadline approaches +5. Fetches the open PR URL from the GitHub API after a successful build +6. Signals CloudFormation `SUCCESS`/`FAILED` ---- - -## Critical Architecture Decision: Lambda, NOT CodeBuild +All actual repo creation runs inside **CodeBuild** via the `terraform-eks-deployment` workspace: +- Clones `template-eks-cluster` via `CSVD/terraform-github-repo` Terraform module +- Writes 8 rendered Terragrunt HCL files via `managed_extra_files` +- Opens a pull request (`repo-init` → `main`) -**Do not suggest CodeBuild as the mechanism for creating EKS cluster repositories.** - -An earlier approach attempted to provision EKS repos by triggering a CodeBuild project that ran -`terraform apply` with a GitHub Terraform provider. That approach was **abandoned** due to: +--- -- SSH host key failures downloading remote Terraform modules -- AWS credential proxy incompatibility inside CodeBuild build environments -- S3 backend region mismatches -- Irreconcilable Terraform provider version conflicts (the `HappyPathway/terraform-github-repo` - public module pins `github ~> 6.0` while our internal modules require `>= 6.6.0`) +## Architecture: Lambda as Thin Orchestrator over CodeBuild + Terraform -**The correct approach:** CloudFormation Custom Resource → Lambda invocation (direct Python GitHub API). -No Terraform. No CodeBuild buildspec. No SSH keys. No provider version pinning. +``` +SC Console (user fills form) + → CFN Stack creates Custom::GitHubRepository resource + → CFN calls Lambda (eks-terragrunt-repo-gen-template-automation) via ServiceToken + → Lambda fetches PAT from Secrets Manager (ghe-runner/github-token) + → Lambda starts CodeBuild project (eks-terragrunt-repo-creator) with TF_VAR_* env overrides + → CodeBuild clones terraform-eks-deployment repo from GHE + → CodeBuild runs: terraform init + terraform apply -auto-approve + → Terraform (CSVD/terraform-github-repo module) creates GHE repo + writes HCL files + opens PR + → Lambda polls CodeBuild, then fetches PR URL from GitHub API + → Lambda sends cfn-response SUCCESS with repository_url + pull_request_url + → CFN stack transitions to CREATE_COMPLETE + → SC provisioned product shows as AVAILABLE +``` -### What CodeBuild IS still used for (valid) +### CodeBuild Projects -CodeBuild **is** still the correct tool for building the Lambda container image: +There are **two** CodeBuild projects — do not confuse them: -``` -packer-pipeline CLI → CodeBuild project (eks-terragrunt-repo-generator-builder) - → Packer + Docker build - → Push to ECR (229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/eks-terragrunt-repo-generator/lambda) - → Lambda function updated via Terraform -``` +| Project | Purpose | +|---------|--------| +| `eks-terragrunt-repo-generator-builder` | Builds the Lambda container image (packer + Docker → ECR) | +| `eks-terragrunt-repo-creator` | Creates EKS cluster repos (tf init + tf apply inside terraform-eks-deployment) | -This is the `packer-pipeline` CLI workflow. CodeBuild here is for **CI/CD of the Lambda image itself**, -not for creating customer repos. +The Lambda triggers **`eks-terragrunt-repo-creator`** at runtime. The **`eks-terragrunt-repo-generator-builder`** is triggered manually via `packer-pipeline` when the Lambda code changes. --- ## Key Files | File | Purpose | -|------|---------| -| `template_automation/app.py` | Lambda entry point; CFN Custom Resource handler | -| `template_automation/eks_config.py` | Pydantic models + Jinja2 renderer for EKS HCL | -| `template_automation/github_provider.py` | GitHub API client (Git tree API, PRs, permissions) | -| `template_automation/templates/eks/` | 8 Jinja2 templates (root.hcl, cluster.hcl, vpc.hcl, etc.) | +|------|--------| +| `template_automation/app.py` | Lambda entry point; CFN Custom Resource handler; `start_codebuild_build()` + `poll_codebuild_build()` | +| `template_automation/eks_config.py` | Pydantic models + `is_eks_deployment` check | | `service-catalog/product-template.yaml` | CFN template for the SC product (canonical source) | -| `deploy/` | Terraform deploying the Lambda infrastructure | -| `design-docs/README.md` | Architecture overview and implementation status | +| `deploy/main.tf` | Terraform: Lambda, CodeBuild project, SC portfolio/product, IAM | +| `deploy/variables.tf` | Input variables including `codebuild_project_name`, `codebuild_role_arn` | +| `csvd_config_packer.hcl` | packer-pipeline config for building the Lambda container image | + +The HCL rendering, repo creation, and PR opening logic lives in **`terraform-eks-deployment`**, not here. --- ## Service Catalog Integration -The Service Catalog product is defined by a CloudFormation template -(`service-catalog/product-template.yaml`). When a user submits the form: - -``` -SC Console (user fills form) - → CFN Stack creates Custom::GitHubRepository resource - → CFN calls Lambda via ServiceToken - → Lambda processes CloudFormationResourceInput (Pydantic model) - → Lambda creates repo, renders HCL, opens PR - → Lambda calls cfn-response SUCCESS - → CFN stack transitions to CREATE_COMPLETE - → SC provisioned product shows as AVAILABLE -``` +The Service Catalog product is defined by `service-catalog/product-template.yaml`. ## SC Product Deployment Methods @@ -90,7 +81,7 @@ tf init tf apply ``` -This deploys the Lambda + SC portfolio + SC product + constraints directly. +Deploys the Lambda + CodeBuild project + SC portfolio/product + constraints directly. Use this as the **reference deployment** when debugging issues with the census pipeline. IDs after last apply: portfolio `port-h5qd63hw5yagq`, product `prod-lmua4oknugafg`. @@ -101,7 +92,7 @@ cd terraform-service-catalog-census/non-prod/csvd-dev/west/service-catalog tf apply # (via terragrunt) ``` -This is the census-managed production deployment path. The live CFN template lives at: +Census-managed production deployment path. The live CFN template lives at: `terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml` Both `service-catalog/product-template.yaml` here and `2-0-0.yaml` in census must stay in sync @@ -109,14 +100,32 @@ Both `service-catalog/product-template.yaml` here and `2-0-0.yaml` in census mus --- -## Lambda Invocation Details +## Lambda Runtime Details - **Function name**: `eks-terragrunt-repo-gen-template-automation` - **Account**: `229685449397` (csvd-dev-gov, `us-gov-west-1`) +- **Timeout**: 900s (15 min) — must exceed CodeBuild poll window - **ServiceToken**: `arn:aws-us-gov:lambda:${AWS::Region}:${AWS::AccountId}:function:eks-terragrunt-repo-gen-template-automation` -- **Runtime env var**: `VERIFY_SSL=false` (Census CA cert is not in the container's `certifi` bundle) - **GitHub Enterprise**: `https://github.e.it.census.gov`, org `SCT-Engineering` +### Key environment variables + +| Variable | Value | Purpose | +|----------|-------|---------| +| `VERIFY_SSL` | `false` | Census CA cert not in the container's `certifi` bundle | +| `GITHUB_TOKEN_SECRET_NAME` | `/eks-cluster-deployment/github_token` | App installation token (`ghs_`) — used by Lambda for Python GitHub API calls | +| `TF_GITHUB_TOKEN_SECRET_NAME` | `ghe-runner/github-token` | PAT (`ghp_`) — passed to CodeBuild as `GITHUB_TOKEN` for the Terraform GitHub provider | +| `CODEBUILD_PROJECT_NAME` | `eks-terragrunt-repo-creator` | CodeBuild project to trigger | +| `GITHUB_API` | `https://github.e.it.census.gov` | GHE API base URL | +| `GITHUB_ORG_NAME` | `SCT-Engineering` | Target GitHub org | + +### Why two GitHub tokens? + +- `GITHUB_TOKEN_SECRET_NAME` holds a **GitHub App installation token** (`ghs_` prefix). It can perform + org-level API calls but **cannot** access `/api/v3/user`, which the CSVD Terraform module requires. +- `TF_GITHUB_TOKEN_SECRET_NAME` holds a **personal access token** (`ghp_` prefix, user `arnol377`). + This is passed to CodeBuild and used by the Terraform GitHub provider. + ### EKS mode is triggered when all these fields are present in the event: - `cluster_name` - `account_name` @@ -146,27 +155,55 @@ Properties: --- +## Rebuilding the Lambda Image + +When `template_automation/app.py` or other Lambda source files change: + +```bash +# 1. Zip source and upload to S3 +cd lambda-template-repo-generator +zip -r ~/tmp/lambda-source.zip . -x "*.git*" -x "design-docs/*" -x "__pycache__/*" -x "*.pyc" -x "deploy/.terraform/*" -x "deploy/terraform.tfstate*" +UUID=$(python3 -c "import uuid; print(uuid.uuid4())") +source ~/aws-creds +aws s3 cp ~/tmp/lambda-source.zip \ + "s3://csvd-packer-pipeline-builds/packer-builds/eks-terragrunt-repo-generator/source/${UUID}/repo.zip" \ + --region us-gov-west-1 + +# 2. Start the packer CodeBuild build +aws codebuild start-build \ + --project-name eks-terragrunt-repo-generator-builder \ + --region us-gov-west-1 \ + --source-type-override S3 \ + --source-location-override "csvd-packer-pipeline-builds/packer-builds/eks-terragrunt-repo-generator/source/${UUID}/repo.zip" + +# 3. After build SUCCEEDED, force Lambda to pull the new image +aws lambda update-function-code \ + --function-name eks-terragrunt-repo-gen-template-automation \ + --image-uri "229685449397.dkr.ecr.us-gov-west-1.amazonaws.com/eks-terragrunt-repo-generator/lambda:latest" \ + --region us-gov-west-1 +``` + ## Testing ```bash -# End-to-end EKS mode test (dry-run) -python scripts/test_workflow.py --eks --dry-run +# End-to-end Service Catalog test (provisions + verifies + terminates) +source ~/aws-creds +cd lambda-template-repo-generator +python scripts/test_service_catalog.py sc-e2e-test-$(date +%Y%m%d-%H%M) -# Clean up test repos +# Clean up leftover test repos python scripts/cleanup_test_repos.py - -# Validate GitHub PAT permissions -python scripts/check_github_permissions.py ``` --- ## What NOT to Do -- ❌ Do not create a `buildspec.yml` for repo creation using the **old** CodeBuild+Terraform approach +- ❌ Do not rewrite repo creation logic in Lambda Python — all repo creation runs in CodeBuild via `terraform-eks-deployment` - ❌ Do not use `HappyPathway/terraform-github-repo` **public** module — it pins `github ~> 6.0` (conflicts with internal `>= 6.6.0`) -- ✅ DO use `CSVD/terraform-github-repo` (https://github.e.it.census.gov/CSVD/terraform-github-repo) — internal module, uses `github 6.6.0`, supports `template_repo` + `managed_extra_files` +- ✅ DO use `CSVD/terraform-github-repo` (https://github.e.it.census.gov/CSVD/terraform-github-repo) — internal module, supports `template_repo` + `managed_extra_files` - ❌ Do not pass `vpc_id` to the Lambda — use `vpc_name` - ❌ Do not re-add `LambdaFunctionArn` as a CFN parameter — use `!Sub "arn:..."` directly +- ❌ Do not use SSH-based module sources (`git::ssh://`) — Census proxy blocks SSH host key exchange; use HTTPS - ❌ Do not write temp files or command output to `/tmp` — use `~/tmp` (i.e. `/home/a/arnol377/tmp`) instead - ❌ Do not use the `terraform` command directly — always use the `tf` alias (e.g. `tf plan`, `tf apply`, `tf init`)