Skip to content

fix: EKS-only Lambda cleanup + SC template AwsRegion/AWSAccountId removal #1

Merged
merged 20 commits into from
Apr 21, 2026

Conversation

arnol377
Copy link
Collaborator

@arnol377 arnol377 commented Apr 2, 2026

Summary

Cleans up the Lambda function and Service Catalog product template following the
EKS-only architecture decision, verified with a passing E2E test today.


Changes

template_automation/app.py — EKS fields now required

  • Removed the is_eks_deployment property and its guard in lambda_handler — the Lambda is EKS-only; the false-branch only raised a ValueError with no alternative path
  • The eight EKS fields (cluster_name, environment, aws_region, account_name, aws_account_id, environment_abbr, vpc_name, vpc_domain_name) are now str = Field(...) — Pydantic surfaces missing fields with precise per-field errors at invocation time
  • Removed redundant or "" fallbacks in start_codebuild_build() — not needed for required fields

service-catalog/product-template.yaml — remove user-facing AwsRegion/AWSAccountId

  • AwsRegion and AWSAccountId dropped as Parameters
  • Both now resolved automatically: aws_region: !Sub "${AWS::Region}" and aws_account_id: !Sub "${AWS::AccountId}"
  • Ensures the provisioner's correct values are used (not csvd-dev hardcoded values)

deploy/buildspec-eks-repo-creator.yml — new file

  • CodeBuild buildspec for eks-terragrunt-repo-creator, versioned here alongside the Terraform that manages it
  • Currently REPO_BRANCH: test_cluster while PR #16 is in review — must be updated to main after PR #16 merges

deploy/main.tf — IAM partition fix + buildspec path update

  • Fixed all IAM ARN constructions to use data.aws_partition.current.partition (caller_identity has no .partition attribute)
  • Updated file() reference to point to the local buildspec

Docs / scripts / config

  • copilot-instructions.md: simplified rebuild steps; added Python/CLI standards; AWS_DEFAULT_REGION reminder; new What NOT to Do entries
  • DEPLOYMENT.md, CLOUDFORMATION_CUSTOM_RESOURCE_MIGRATION.md: corrected packer-pipeline invocation to csvd_config_packer.hcl
  • csvd_config_packer.hcl: removed duplicate stale comment block at EOF
  • docs/DEMO_SCRIPT.md: updated param table and step 3 walkthrough to match corrected SC form
  • scripts/test_service_catalog.py: bump artifact to v2.1; remove AwsRegion/AWSAccountId from DEFAULT_PARAMS
  • docs/SC-TEMPLATE-FIX-PLAN.md: completed fix plan added for reference

Testing

Full end-to-end Service Catalog test passed:

  • Provisioned sc-e2e-test-20260421-1604 → repo created, 1 open PR, correct structure
  • Duration: 237 seconds

Follow-up required (after PR #16 merges in terraform-eks-deployment)

  1. Update deploy/buildspec-eks-repo-creator.yml REPO_BRANCH from test_clustermain
  2. Run tf apply in deploy/ to push the buildspec change to CodeBuild
  3. Sync terraform-service-catalog-census/templates/products/eks-terragrunt-repo/2-0-0.yaml with the SC template changes here

Your Name added 13 commits April 2, 2026 15:24
…ructions

- service-catalog/product-template.yaml: drop LambdaFunctionArn parameter,
- .github/copilot-instructions.md: document Lambda-first approach, explicitly
  record why CodeBuild+Terraform was abandoned (SSH host keys, proxy, provider
  version conflict), clarify CodeBuild is still used for container image CI/CD
…creator

- repository_provider.py: change REPO_VISIBILITY default from 'internal' to
  'public' so created repos are visible to all org members and external
  visitors on GHE
- github_provider.py: change fallback visibility from 'internal' to 'public';
  add add_collaborator() method (PUT /repos/{org}/{repo}/collaborators/{user})
- app.py: add optional creator_username field to CloudFormationResourceInput;
  after team permission grant, also add individual creator as admin collaborator
  when creator_username is provided (non-fatal on failure)
- service-catalog/product-template.yaml: add optional CreatorUsername parameter
  (defaults to empty string, backward-compatible); wire to creator_username
  Lambda property
…o SC launch role

- VERIFY_SSL was incorrectly set to 'true' (Census CA cert not in certifi)
- repo_visibility changed from 'internal' to 'public' per ECA requirements
- Added EC2DescribeVpcs permission to SC launch role IAM policy
- build_eks_path_mapper remaps environment/region/vpc/cluster/ placeholders
  to concrete values from CFN params
- github_provider: clone_repository_contents accepts optional path_mapper
- csvd_config_packer.hcl: create_role=false, packer 1.10.3, role ARN wired
- deploy/service_catalog.tf: SC product and portfolio synced
- Updated copilot instructions with Lambda-only architecture decision
- app.py: add start_codebuild_build() and poll_codebuild_build() helpers
- app.py: EKS deployment path (is_eks_deployment=True) now starts CodeBuild
  project 'eks-terragrunt-repo-creator', polls until SUCCEEDED/FAILED,
  and sends cfn-response accordingly; non-EKS path unchanged
- deploy/main.tf: add aws_codebuild_project.eks_repo_creator resource
  (NO_SOURCE, uses buildspec.yml from terraform-eks-deployment)
  CODEBUILD_PROJECT_NAME injected into Lambda environment
- deploy/variables.tf: codebuild_project_name, codebuild_role_arn, codebuild_vpc_id
- deploy/terraform.tfvars: set CodeBuild project name, role ARN, VPC ID
…nectivity

Root cause of 'Connect timeout on codebuild.us-gov-west-1.amazonaws.com':
- Lambda is in a VPC with no NAT gateway path to CodeBuild's public endpoint
- Lambda role was missing codebuild:StartBuild / codebuild:BatchGetBuilds perms

Fix:
- aws_vpc_endpoint.codebuild[0]: interface endpoint for CodeBuild in the Lambda
  VPC with private DNS enabled (Lambda API calls resolve to private IPs)
- aws_iam_role_policy.codebuild_access: StartBuild + BatchGetBuilds on the
  eks-terragrunt-repo-creator project ARN only
Lambda was set to 300s but poll_codebuild_build loops for up to 12 min (720s).
Lambda would be killed by AWS before it could report back to CloudFormation.
900s gives a ~180s buffer beyond the poll window.
The packer-pipeline internal buildspec template already wraps the value
in '- {{ additional_post_build_commands }}', so prefixing the value with
'- ' caused YAML_FILE_ERROR (nested list) in CodeBuild build #8.
…in CodeBuild

The standard github_token (/eks-cluster-deployment/github_token) is a GitHub
App installation token (ghs_ prefix) which cannot access /api/v3/user. This
endpoint is always called by the CSVD terraform-github-repo module's
data.github_user.current resource.

Changes:
- app.py: check TF_GITHUB_TOKEN_SECRET_NAME env var first for CodeBuild token;
  falls back to GITHUB_TOKEN_SECRET_NAME if not set
- deploy/main.tf: add TF_GITHUB_TOKEN_SECRET_NAME=ghe-runner/github-token env var
- deploy/main.tf: add IAM policy granting Lambda access to ghe-runner/github-token
returning repository_url/repository_name, causing CFN to fail with:
  'Vendor response doesn't contain pull_request_url attribute'

After CodeBuild SUCCEEDED, query GitHub API /pulls?state=open on the
created repo to get the real PR URL and branch name.
…hitecture

- Replace 'Lambda, NOT CodeBuild' section with the actual working architecture
- Document Lambda as thin orchestrator triggering eks-terragrunt-repo-creator
- Add two-token split explanation (ghs_ App token vs ghp_ PAT for Terraform)
- Add TF_GITHUB_TOKEN_SECRET_NAME and CODEBUILD_PROJECT_NAME env vars
- Add correct rebuild/test commands
- Remove outdated CodeBuild-was-abandoned rationale
@arnol377 arnol377 changed the title fix: Lambda-only EKS repo creation; add Copilot instructions feat: EKS repo creation via CodeBuild + terraform-eks-deployment Apr 7, 2026
@@ -29,7 +29,8 @@ packer_pipeline {
partition = "aws-us-gov" // AWS partition (aws or aws-us-gov)

// Role management
create_role = true // Enable automatic role creation
create_role = false // Role already exists; provide ARN directly
codebuild_role_arn = "arn:aws-us-gov:iam::229685449397:role/CodeBuildPackerRole-eks-terragrunt-repo-generator-builder"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be looked up by Name, partition, account id

@@ -29,7 +29,8 @@ packer_pipeline {
partition = "aws-us-gov" // AWS partition (aws or aws-us-gov)

// Role management
create_role = true // Enable automatic role creation
create_role = false // Role already exists; provide ARN directly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we always want to create the role?

deploy/main.tf Outdated
# PAT used by CodeBuild/Terraform for the GitHub provider (must be a ghp_ PAT —
# the standard App installation token ghs_ cannot access /api/v3/user which is
# required by the CSVD terraform-github-repo module).
TF_GITHUB_TOKEN_SECRET_NAME = "ghe-runner/github-token"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass in token secret name

deploy/main.tf Outdated
Sid = "ReadTFGitHubToken"
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws-us-gov:secretsmanager:${var.aws_region}:${data.aws_caller_identity.current.account_id}:secret:ghe-runner/github-token-*"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look up the partition value with data.aws_caller_identity.current.partition

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also pass in the secret name

deploy/main.tf Outdated
Comment on lines 162 to 177
# ── VPC endpoint: CodeBuild (interface) ──────────────────────────────────────
# The Lambda runs inside a VPC; without this endpoint the CodeBuild API call
# times out because there is no NAT/internet path for codebuild.amazonaws.com.
resource "aws_vpc_endpoint" "codebuild" {
count = var.enable_vpc ? 1 : 0

vpc_id = var.codebuild_vpc_id
service_name = "com.amazonaws.${var.aws_region}.codebuild"
vpc_endpoint_type = "Interface"
subnet_ids = var.subnet_ids
security_group_ids = var.security_group_ids
private_dns_enabled = true

tags = merge(var.tags, { Name = "eks-terragrunt-codebuild-endpoint" })
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create VPC endpoints, they should already be in the account we use.

Comment on lines 33 to 35
subnet_ids = ["subnet-0b1992a84536c581b"]
security_group_ids = ["sg-0641c697588b9aa6b"]
codebuild_vpc_id = "vpc-00576a396ec570b94"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be looked up or created

# CodeBuild project triggered by the Lambda to run terraform-eks-deployment.
# Reuses the existing CodeBuild packer role (has S3 + VPC + CloudWatch perms).
codebuild_project_name = "eks-terragrunt-repo-creator"
codebuild_role_arn = "arn:aws-us-gov:iam::229685449397:role/CodeBuildPackerRole-eks-terragrunt-repo-generator-builder"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look up by name or create?

variable "codebuild_role_arn" {
description = "IAM role ARN for the CodeBuild repo-creator project"
type = string
default = "arn:aws-us-gov:iam::229685449397:role/CodeBuildPackerRole-eks-terragrunt-repo-generator-builder"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be looked up so it can work across accounts.

Copy link

@morga471 morga471 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@@ -189,7 +191,7 @@ Resources:
RepositoryCreator:
Type: Custom::GitHubRepository
Properties:
ServiceToken: !Ref LambdaFunctionArn
ServiceToken: !Sub "arn:aws-us-gov:lambda:${AWS::Region}:${AWS::AccountId}:function:eks-terragrunt-repo-gen-template-automation"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look up partition

@@ -43,6 +43,7 @@ class CloudFormationResourceInput(BaseModel):
"""Input validation model for CloudFormation Custom Resource parameters."""
project_name: str = Field(..., description="Name for the new repository")
owning_team: Optional[str] = Field(default="tf-module-admins", description="Team that should own the repository")
creator_username: Optional[str] = Field(default=None, description="GitHub username of the person provisioning this repo; will be granted admin access")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had abandoned the path for creating the git repo with python? Do we still need this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — the Python repo-creation path is gone. app.py has been rewritten to ~380 lines: it only parses the CFN event, validates the required EKS fields, starts the eks-terragrunt-repo-creator CodeBuild project, polls until completion, then fetches the open PR URL from the GitHub API and returns it in the CFN response. All actual repo creation runs in CodeBuild via terraform-eks-deployment.

@@ -10,14 +10,15 @@


def _default_visibility() -> str:
"""Read REPO_VISIBILITY env var, defaulting to 'internal'.
"""Read REPO_VISIBILITY env var, defaulting to 'public'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used since we aren't using the python path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — removing the entire generic code path (GitHub/GitLab providers, template manager, models, config.json rendering). This file and github_provider.py, gitlab_provider.py, github_client.py, gitlab_client.py, template_manager.py, and models.py have all been deleted. app.py is now ~380 lines, EKS + CodeBuild only.

Your Name added 7 commits April 14, 2026 13:12
- Rewrite app.py to ~380 lines (EKS + CodeBuild only); remove all generic
  GitHub/GitLab/template-manager code paths
- Delete 7 dead modules: repository_provider, github_provider, gitlab_provider,
  github_client, gitlab_client, template_manager, models
- Delete eks_config.py (Jinja2 rendering now handled entirely by terraform-eks-
  deployment in CodeBuild)
- Rewrite tests/test_app.py for EKS-only handler; drop stale test files
- Simplify requirements.txt (remove requests/requests-mock; now using urllib)

deploy/ Terraform:
- Add conditional CodeBuild IAM role (create_codebuild_role var); default=true
  creates a minimal role with scoped logs/S3/VPC policies; false looks up a
  pre-existing role by name
- Add data sources for subnet + security group lookup by name (no hardcoded IDs)
- Replace hardcoded 'ghe-runner/github-token' with var.tf_github_token_secret_name
- Fix IAM policy partition: arn:aws-us-gov: -> arn:${partition}:
- Remove aws_vpc_endpoint.codebuild (VPC endpoints already exist in the account)
- Remove codebuild_role_arn / codebuild_vpc_id / subnet_ids / security_group_ids
  variables; add codebuild_role_name, subnet_name, security_group_name,
  tf_github_token_secret_name, codebuild_assets_bucket

service-catalog/product-template.yaml:
- Fix ServiceToken partition: arn:aws-us-gov:lambda: -> arn:${AWS::Partition}:lambda:

csvd_config_packer.hcl:
- Annotate account-specific fields with (*) comments explaining they cannot be
  dynamically resolved by packer-pipeline HCL
- Rewrite architecture section to explicitly document cross-account flow:
  the SC product is shared to multiple accounts but all compute (Lambda,
  CodeBuild, Secrets Manager) runs centrally in csvd-dev (229685449397)
- Add cross-account ASCII diagram showing provisioner account vs csvd-dev
- Remove stale reference to eks_config.py (deleted in prior commit)
- Remove stale generic-mode fallback description (Lambda is EKS-only)
- Add EKS_SC_RESOURCE_INVENTORY.md: full catalog of resources by owner
  (census Terragrunt, direct Terraform, StackSet, pre-existing, ephemeral)
  with cross-account architecture section and StackSet note on launch role
…branch

The Lambda is EKS-only by design — the is_eks_deployment false-branch only
ever raised a ValueError with no alternative path. Remove it in favour of
Pydantic required fields, which surface missing fields with precise error
messages at validation time.

Required fields (was Optional): cluster_name, environment, aws_region,
account_name, aws_account_id, environment_abbr, vpc_name, vpc_domain_name.

Also simplifies start_codebuild_build() — 'or ""' fallbacks on required
fields are no longer needed.
…ters

Both values are now resolved automatically by CloudFormation:

This ensures the provisioner's own account ID and region are used,
not hardcoded values or user-supplied inputs that could mismatch.

Removes two parameters from the form users see when provisioning.
…M policies

Adds deploy/buildspec-eks-repo-creator.yml — the CodeBuild buildspec for
the eks-terragrunt-repo-creator project, versioned alongside the Terraform
that manages it rather than referencing terraform-eks-deployment.

Currently points at REPO_BRANCH=test_cluster (PR #16 under review) — must
be updated to 'main' once PR #16 merges.

Also fixes all IAM policy ARN constructions to use data.aws_partition.current
instead of data.aws_caller_identity.current (caller_identity has no
.partition attribute in this provider version).
- copilot-instructions: simplify rebuild steps to use packer-pipeline;
  add Python/CLI standards section; add AWS_DEFAULT_REGION reminder;
  add 'do not re-add AwsRegion/AWSAccountId' to What NOT to Do
- DEPLOYMENT.md, CLOUDFORMATION_CUSTOM_RESOURCE_MIGRATION.md: fix
  packer-pipeline invocation to use csvd_config_packer.hcl
- csvd_config_packer.hcl: remove duplicate stale comment block at EOF
- DEMO_SCRIPT.md: update param table and step 3 walkthrough to match
  the corrected SC form (no AwsRegion/AWSAccountId, correct field names)
- test_service_catalog.py: bump artifact to v2.1; remove AwsRegion and
- docs/SC-TEMPLATE-FIX-PLAN.md: add completed fix plan for reference
- deploy state: updated after today's tf apply
@arnol377 arnol377 changed the title feat: EKS repo creation via CodeBuild + terraform-eks-deployment fix: EKS-only Lambda cleanup + SC template AwsRegion/AWSAccountId removal Apr 21, 2026
@arnol377 arnol377 merged commit b247f0b into main Apr 21, 2026
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants