From c6ac44784cc929629342d184a9f1833f1a538d02 Mon Sep 17 00:00:00 2001 From: Dave Arnold Date: Mon, 11 May 2026 17:52:47 -0400 Subject: [PATCH] docs: ADR-001 webhook auto-apply on merge to main (proposed) --- docs/decisions/001-webhook-auto-apply.md | 167 +++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 docs/decisions/001-webhook-auto-apply.md diff --git a/docs/decisions/001-webhook-auto-apply.md b/docs/decisions/001-webhook-auto-apply.md new file mode 100644 index 0000000..795833f --- /dev/null +++ b/docs/decisions/001-webhook-auto-apply.md @@ -0,0 +1,167 @@ +# ADR-001: Webhook-Triggered Auto-Apply on Merge to Main + +**Status:** Proposed +**Date:** 2026-05-11 +**Branch:** feature/template-repo-rendering + +--- + +## Context + +The current two-product model requires a human to manually provision the +`tf-run-executor` Service Catalog product after a Proposer PR is reviewed and +merged. This adds unnecessary friction to the apply step: + +1. Platform engineer reviews and merges the PR opened by the Proposer +2. Platform engineer opens Service Catalog, finds the executor product, fills in + the same parameters they already specified during the Propose step, and + clicks Launch + +Step 2 is pure operational overhead. The information needed to start the executor +build (account repo, layer, region dir, target account) is already known at merge +time and could be stored in the repo itself. + +--- + +## Decision + +Add a **GitHub Enterprise webhook handler** to the Lambda that automatically +starts an executor CodeBuild build whenever a push event lands on `main` in a +watched account repo. + +Target apply configuration is stored in a `.sc-automation.yml` file committed to +the root of each account repo by the Proposer (or manually by a platform engineer). + +--- + +## Proposed Design + +### `.sc-automation.yml` — committed to the account repo root + +```yaml +# Written by the Proposer CodeBuild build or manually by a platform engineer. +# Each entry triggers one executor CodeBuild build when changes land on main. +apply_on_merge: + - layer: infrastructure + region_dir: west + target_account_id: "229685449397" + - layer: infrastructure + region_dir: east + target_account_id: "229685449397" + - layer: vpc + region_dir: west + target_account_id: "229685449397" +``` + +Fields per entry: + +| Field | Required | Description | +|---|---|---| +| `layer` | yes | `common`, `infrastructure`, or `vpc` | +| `region_dir` | yes | `east`, `west`, or `global` | +| `target_account_id` | no | 12-digit AWS account ID; omit to run in csvd-dev | +| `tf_run_start_tag` | no | tf-run TAG label to start from | +| `dry_run` | no | `true` to plan only (default: `false`) | + +### Lambda changes + +Add a `/webhook` path handler alongside the existing CFN handler in +`lambda/app.py`. + +**Invocation:** Lambda Function URL (no API Gateway needed — GHE can POST to +a Function URL directly). The URL is added to the GHE org or repo webhook +settings. + +**Request flow:** + +``` +GHE push event (main branch, account repo) + → Lambda Function URL POST / + → verify HMAC-SHA256 signature (secret in SM: ghe-runner/webhook-secret) + → parse X-GitHub-Event: push + → filter: ref == refs/heads/main + → filter: repo name matches account repo pattern + → fetch .sc-automation.yml via GHE API (no clone needed) + → for each entry in apply_on_merge: + start_codebuild_build(action="apply", account_repo=..., layer=..., ...) + (fire-and-forget — do NOT block for CodeBuild completion) + → return 200 OK immediately +``` + +**Key differences from the CFN handler:** + +- **No polling.** The webhook handler starts builds and returns immediately. + Build results are visible in CodeBuild logs and CloudWatch. There is no CFN + stack to signal. +- **No CFN resource.** The executor product is still available for manual use, + but webhook-triggered runs bypass Service Catalog entirely. +- **Idempotent.** If GHE retries the webhook (network blip), a duplicate build + is started. This is acceptable — `tf-run apply` on an already-applied state is + a no-op. + +### Infrastructure changes + +| Resource | Change | +|---|---| +| Lambda Function URL | Add `aws_lambda_function_url` resource in `deploy/lambda.tf` | +| Lambda invoke permission | Add `aws_lambda_permission` allowing `lambda:InvokeFunctionUrl` from `*` (HMAC signature is the auth mechanism) | +| Secrets Manager | Add a `ghe-runner/webhook-secret` secret for HMAC verification | +| Lambda IAM | No change — existing `codebuild:StartBuild` permission covers webhook-triggered builds | +| GHE Webhook | Manual one-time setup: org or per-repo webhook → Function URL, content-type `application/json`, events: `push` | + +### `.sc-automation.yml` lifecycle + +- **Proposer writes it** when it first creates the branch (if the file doesn't + exist yet). The Proposer knows `layer`, `region_dir`, and `target_account_id` + from its build environment variables. It commits `.sc-automation.yml` alongside + the rendered template files. +- **Platform engineers edit it** directly via PR if they need to add or remove + apply targets. +- **The file is idempotent** — subsequent Proposer runs `--force-with-lease` push + won't break it because the Proposer will only write the file if it doesn't + already exist (avoiding clobbering manual edits). + +--- + +## Consequences + +### Benefits + +- Eliminates the manual "provision executor product" step after PR merge +- Apply is fully traceable: GHE push event → CloudWatch Logs → CodeBuild build ID +- No new infrastructure services (no EventBridge, no SQS, no API Gateway) +- The executor SC product remains available for manual one-off runs and + day-2 operations (re-run from a specific tag, dry-run, etc.) + +### Trade-offs + +- Build results are no longer surfaced in a CloudFormation stack output — users + must check CodeBuild or CloudWatch Logs directly +- GHE webhook requires a one-time manual setup per org (or per repo for + fine-grained control) +- A merge to `main` that does not involve Terraform changes (e.g. README edit) + will still trigger executor builds. Mitigation: add a `paths` filter in + `.sc-automation.yml` (future enhancement) or rely on `tf-run apply` being a + safe no-op + +### Out of scope for this ADR + +- Result notification (Slack, email) after a webhook-triggered apply — tracked + separately +- Path filtering (only trigger on changes under `{layer}/{region_dir}/`) — + tracked separately + +--- + +## Alternatives Considered + +**CodeStar connection + CodePipeline watch:** Requires CodePipeline infrastructure +per repo, CodeStar connector host setup for GHE on-prem, and loses the per-run +environment variable flexibility that the Lambda `StartBuild` override model +provides. Rejected. + +**EventBridge + S3 source:** Would require mirroring the GHE repo to CodeCommit +or S3 to get an EventBridge trigger. Adds a sync layer with no benefit. Rejected. + +**Poll-based apply (Lambda on schedule):** Adds latency and unnecessary API calls. +Rejected.