Skip to content

Commit

Permalink
docs: update CHECKPOINT with Jira sub-tasks and add decision document…
Browse files Browse the repository at this point in the history
…s index; create ADR-003 for Vault cluster topology; create ADR-004 for account baseline IAM role; create ADR-005 for Service Catalog portfolio sharing strategy
  • Loading branch information
Dave Arnold committed May 28, 2026
1 parent b14a084 commit 66838cf
Show file tree
Hide file tree
Showing 5 changed files with 516 additions and 20 deletions.
29 changes: 28 additions & 1 deletion design-docs/CHECKPOINT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,34 @@

## 1. Last Updated

**2026-05-06** — Implementation complete: Phases 1–3 fully built and committed.
**2026-05-28** — Jira sub-tasks created under CSC-1341; three new ADRs added to `docs/decisions/`.

---

## 1a. Jira Ticket Index

Parent: **[CSC-1341](https://jira.it.census.gov/browse/CSC-1341)**[sc-lambda-ghactions] Design & implement next-gen SC automation system

| Key | Summary | Priority | Status | ADR |
|-----|---------|----------|--------|-----|
| [CSC-1342](https://jira.it.census.gov/browse/CSC-1342) | Build and push Lambda container image to ECR (via packer-pipeline) | High | To Do ||
| [CSC-1343](https://jira.it.census.gov/browse/CSC-1343) | End-to-end test: SC provision → CodeBuild → tf-run → PR → CFN SUCCESS | High | To Do ||
| [CSC-1344](https://jira.it.census.gov/browse/CSC-1344) | Provision account baseline IAM role (sc-automation-codebuild-role) | High | To Do | [ADR-004](../docs/decisions/004-account-baseline-iam-role.md) |
| [CSC-1345](https://jira.it.census.gov/browse/CSC-1345) | ADR-002: Implement Vault AWS Secrets Engine for cross-account credentials | High | To Do | [ADR-002](../docs/decisions/002-vault-aws-secrets-engine.md) |
| [CSC-1346](https://jira.it.census.gov/browse/CSC-1346) | Vault cluster topology decision | Medium | To Do | [ADR-003](../docs/decisions/003-vault-cluster-topology.md) |
| [CSC-1348](https://jira.it.census.gov/browse/CSC-1348) | OU sharing and StackSet for Service Catalog portfolio | Medium | To Do | [ADR-005](../docs/decisions/005-portfolio-org-sharing.md) |
| [CSC-1349](https://jira.it.census.gov/browse/CSC-1349) | Migration runbook: lambda-template-repo-generator → sc-lambda-ghactions | Medium | To Do ||
| [CSC-1350](https://jira.it.census.gov/browse/CSC-1350) | Phase 4 observability: CloudWatch dashboard + SNS alerts on FAILED builds | Low | To Do ||

**Decision documents index:**

| ADR | File | Status | Linked tickets |
|-----|------|--------|---------------|
| ADR-001 | [docs/decisions/001-webhook-auto-apply.md](../docs/decisions/001-webhook-auto-apply.md) | Accepted ||
| ADR-002 | [docs/decisions/002-vault-aws-secrets-engine.md](../docs/decisions/002-vault-aws-secrets-engine.md) | Proposed | CSC-1345 |
| ADR-003 | [docs/decisions/003-vault-cluster-topology.md](../docs/decisions/003-vault-cluster-topology.md) | Proposed | CSC-1346 |
| ADR-004 | [docs/decisions/004-account-baseline-iam-role.md](../docs/decisions/004-account-baseline-iam-role.md) | Accepted | CSC-1344, CSC-1348 |
| ADR-005 | [docs/decisions/005-portfolio-org-sharing.md](../docs/decisions/005-portfolio-org-sharing.md) | Proposed | CSC-1348 |

---

Expand Down
68 changes: 49 additions & 19 deletions docs/account-bootstrap-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,10 @@ top-level items. No account was missing any of these.
├── edl-automation/ # EDL-specific automation (EDL accounts only)
├── includes.d/ # shared variable definitions (tags)
├── infrastructure/ # TF state backend, S3 logs, CloudTrail, Config
├── init/ # git repo setup, git-secret, GPG key
│ ├── git-secret/ # team-member GPG public keys (.gpg.asc)
├── init/ # git repo setup; git-secret/gpg-setup present in legacy repos only
│ ├── git-secret/ # ⚠️ legacy — team-member GPG public keys; eliminated with Vault
│ ├── git-setup/ # IaC to create/configure the GitHub repo
│ └── gpg-setup/ # account-specific GPG key generation
│ └── gpg-setup/ # ⚠️ legacy — account-specific GPG key generation; eliminated with Vault
├── provider_configs.d/ # provider secrets: GitHub, LDAP, Infoblox, DNS
├── variables.d/ # variables.common.tf, variables.tfstate.tf, per-region .tfvars
├── vpc/ # VPC resources per region
Expand Down Expand Up @@ -146,7 +146,7 @@ and `tf-run.data`. The sequence is:

```
Phase 0: MANUAL — AWS account creation, initial bootstrap IAM user
Phase 1: init/ — GPG key, git-secret, GitHub repo
Phase 1: init/ — GitHub repo creation (GPG/git-secret eliminated with Vault)
Phase 2: provider_configs.d/ — provider secret initialization
Phase 3: infrastructure/ (partial) — TF state backend (S3 + DynamoDB)
Phase 4: infrastructure/{region}/ — S3 access log buckets per region
Expand Down Expand Up @@ -231,7 +231,6 @@ repo from scratch:
| `program` | `edl`, `ent`, `ma`, `lab`, etc. | Controls edl-automation inclusion |
| `environment` | `dev`, `nonprod`, `prod`, `common` | Tags and policy scoping |
| `admin_users` | `[badra001, dwara001, ...]` | Generates `INF.admin-user.*.tf` files |
| `team_gpg_keys` | Map of username → GPG public key | Populates `init/git-secret/` |
| `github_org` | `SCT-Engineering` or specific org | For `init/git-setup/` |
| `github_repo_name` | `{account_id}-{alias}` | Usually derived from above |
| `tfstate_bucket` | `inf-tfstate-{account_id}` | S3 bucket for remote state |
Expand Down Expand Up @@ -326,25 +325,56 @@ replace git-secret for all provider credentials as well. The mapping is direct:
| `provider.github.auto.tfvars.secret``git secret reveal` | `vault kv get secret/accounts/{alias}/github` → write `.auto.tfvars` at build time |
| `provider.ldap.auto.tfvars.secret``git secret reveal` | `vault kv get secret/accounts/{alias}/ldap` → write `.auto.tfvars` at build time |
| `provider.infoblox.auto.tfvars.secret``git secret reveal` | `vault kv get secret/accounts/{alias}/infoblox` → write `.auto.tfvars` at build time |
| Account GPG private key `git secret reveal` | `vault kv get secret/accounts/{alias}/gpg-private-key` → decrypt IAM passwords at build time |
| Account GPG private key (encrypts IAM passwords in repo) | **Eliminated** — admin-user module writes passwords directly to `vault kv put secret/accounts/{alias}/users/{username}`; no GPG, no `.secret` files |

The executor buildspec would add a `vault kv get` call per needed provider before
The executor buildspec adds a `vault kv get` call per needed provider before
running `tf-init`/`tf-run`, injecting the plaintext credentials as temporary
files that are never committed. This replaces the entire `git secret reveal`
ceremony and eliminates the need for any team member to maintain GPG keys in a
git repo.
ceremony. **The `.gitsecret/` directory, `init/gpg-setup/`, and `init/git-secret/`
are eliminated from all new account repos** — they are artifacts of the old system
and have no role in Vault-managed accounts.

#### CodeBuild authentication to Vault

CodeBuild authenticates to Vault using the **AWS auth method** — no credentials
are injected, stored, or rotated. CodeBuild proves its identity via
`sts:GetCallerIdentity`; Vault verifies the IAM role ARN directly with AWS.

The proposer's `tf apply` provisions the Vault auth role for each new account:

```hcl
resource "vault_aws_auth_backend_role" "codebuild" {
backend = "aws"
role = "sc-automation-${var.account_id}"
auth_type = "iam"
bound_iam_principal_arns = [
"arn:${var.partition}:iam::${var.account_id}:role/sc-automation-codebuild-role"
]
token_policies = ["sc-automation-${var.account_id}"]
token_ttl = 900
}
```

Then in the executor buildspec:

```bash
vault login -method=aws -no-print role=sc-automation-${ACCOUNT_ID}
GITHUB_TOKEN=$(vault kv get -field=github_token secret/accounts/${ACCOUNT_ALIAS}/github)
```

No AppRole, no Secret IDs, nothing in Secrets Manager. The IAM role *is* the credential.

### 6.5 What Vault Cannot Eliminate

Even with Vault managing all secrets, two manual steps survive:
With Vault managing all secrets and CodeBuild authenticating via the AWS auth
method, one manual step survives:

1. **Account-specific GPG keypair generation (M4):** The `init/gpg-setup/`
module still generates a keypair used to encrypt IAM passwords that Terraform
outputs. If the Terraform `admin-user` module is redesigned to deliver
passwords via Vault KV (i.e., `vault kv put secret/accounts/{alias}/users/
{username}/password $(tf output password)`) rather than GPG-encrypted files
in the repo, this step becomes unnecessary. This is an account-module change,
not a sc-lambda-ghactions change.
1. **Account-specific GPG keypair generation (M4): Eliminated.** The
`init/gpg-setup/` directory and the entire `.gitsecret/` tree are dropped from
new account repos. The `admin-user` module delivers IAM passwords directly to
`vault kv put secret/accounts/{alias}/users/{username}` rather than to
GPG-encrypted files. Operators retrieve passwords via `vault kv get` using
their own IAM credentials — no GPG toolchain required at any point.

2. **Bootstrapping Vault itself with the first account credential:** The very
first time a new account is bootstrapped, Vault does not yet have that
Expand All @@ -360,8 +390,8 @@ If ADR-002 is implemented and extended to cover provider credentials via Vault K

| Manual step | Current status | With Vault |
|---|---|---|
| M4 — GPG keypair generation | Required per account | Eliminated if admin-user module writes passwords to Vault KV |
| M5 — Team member GPG key collection | Required per account per new team member | Eliminated — no git-secret recipients needed |
| M4 — GPG keypair generation | Required per account | **Eliminated**`init/gpg-setup/` and `.gitsecret/` removed from new repos; IAM passwords go to Vault KV |
| M5 — Team member GPG key collection | Required per account per new team member | **Eliminated** — no git-secret recipients; operators access secrets via IAM + Vault policy |
| M6 — `*.auto.tfvars.secret` encryption | Required per credential per account | Replaced by one `vault kv put` per credential (one-time, central team) |
| M10 — LDAP objects in `common/` | Currently blocked for CodeBuild | Unblocked — executor reads LDAP credentials from Vault at build time |

Expand Down
134 changes: 134 additions & 0 deletions docs/decisions/003-vault-cluster-topology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# ADR-003: Vault Cluster Topology for SC Automation

## In Plain Language

Before we can implement ADR-002 (dynamic AWS credentials from Vault), we need to decide
*which* Vault cluster the SC automation system will talk to, how that cluster is organized,
and how CodeBuild builds will authenticate to it.

This document records the topology decision: existing shared cluster vs. dedicated cluster,
namespace layout, and the auth method CodeBuild will use to prove its identity to Vault.

**Status:** Proposed
**Date:** 2026-05-28
**Depends on:** ADR-002 (`002-vault-aws-secrets-engine.md`)
**Jira:** [CSC-1346](https://jira.it.census.gov/browse/CSC-1346)

---

## Context

ADR-002 specifies that the CodeBuild executor will authenticate to Vault and request
short-lived AWS credentials from the Vault AWS Secrets Engine. But it deliberately
defers the question of *which* Vault cluster to use. Three viable topologies exist:

### Option A — Shared Census Vault cluster, dedicated namespace

Use an existing Census-managed Vault cluster (e.g. the platform Vault in csvd-prod
or a shared non-prod instance). Create a dedicated namespace (`sc-automation/`) so
that all SC automation policies, roles, and secrets engine mounts are isolated from
other tenants.

**Pros:**
- No new cluster to operate or HA-tune
- Shared cluster is already monitored, patched, and backed up
- Cost is shared across all tenants

**Cons:**
- Dependency on another team's change-management cadence
- Namespace-level isolation is good but not complete cluster isolation
- Shared cluster outage affects all tenants simultaneously

### Option B — Dedicated Vault cluster in csvd-dev

Deploy a standalone Vault cluster (Integrated Storage / Raft, 3-node) in csvd-dev
`us-gov-west-1` specifically for SC automation.

**Pros:**
- Full operational control; can tune lease TTLs, auth policies, and HA config
without coordinating with other teams
- Complete isolation — a misconfiguration in SC automation cannot affect other workloads
- Can be versioned and upgraded on our own schedule

**Cons:**
- New operational burden: cluster patching, unseal key rotation, backup scheduling
- Requires 3 EC2 instances (or ECS tasks) and associated IAM/networking
- Higher cost for a single-tenant cluster

### Option C — Vault on Kubernetes (ECS/EKS sidecar pattern)

Run Vault as a sidecar container alongside CodeBuild tasks (dev/agent pattern), using
`vault agent` injector to deliver credentials to the build environment.

**Pros:** No persistent cluster to manage
**Cons:** CodeBuild does not support sidecars natively; requires workaround; substantially
more complex than Options A or B. **Not recommended.**

---

## Auth Method Decision

Regardless of cluster topology, CodeBuild will authenticate to Vault using the
**AWS IAM auth method** (`auth/aws`). The CodeBuild service role ARN
(`arn:${AWS::Partition}:iam::229685449397:role/sc-automation-codebuild-role`) is
bound to a Vault role. When the executor build starts, `vault login` presents the
current IAM identity (via `GetCallerIdentity`) — no static tokens or secrets are
needed inside the build environment.

```hcl
# Vault IAM auth role (managed in sc-lambda-ghactions deploy/)
resource "vault_aws_auth_backend_role" "codebuild_executor" {
backend = "aws"
role = "sc-automation-codebuild"
auth_type = "iam"
bound_iam_principal_arns = ["arn:aws-us-gov:iam::229685449397:role/sc-automation-codebuild-role"]
token_ttl = 900 # 15 min — matches max CodeBuild build window
token_policies = ["sc-automation-executor"]
}
```

---

## Decision

> **TO BE DECIDED** — this ADR is in Proposed state pending discussion with the
> platform / Vault operations team.
Questions to answer before closing this ADR:

1. Is there an existing Census Vault cluster available for non-prod workloads that
the SC automation team can use? What is its SLA?
2. Does the Census Vault team support dedicated namespaces for product teams?
3. What is the blast-radius / approval process for cluster-level changes on a
shared cluster that affect us?
4. Are there cost / account placement constraints that favour one topology?

**Recommended default (pending discussion): Option A** — shared Census cluster with
a dedicated `sc-automation/` namespace. This avoids new operational burden while
still providing tenant isolation. Revisit if the shared cluster proves too slow to
change or if an outage directly impacts SC automation SLA.

---

## Consequences

### If Option A (shared cluster, dedicated namespace)

- Platform team must grant namespace admin rights to the SC automation team
- SC automation `deploy/` Terraform must include Vault provider config pointing at
the shared cluster
- Vault cluster URL and namespace become required Terraform variables

### If Option B (dedicated cluster in csvd-dev)

- New Terraform module required to stand up 3-node Raft cluster in csvd-dev
- Unseal key escrow procedure must be documented and tested
- Adds ~$X/month to csvd-dev bill (to be estimated)

---

## Related

- [ADR-002: Vault AWS Secrets Engine](./002-vault-aws-secrets-engine.md) — upstream decision
- [CSC-1345](https://jira.it.census.gov/browse/CSC-1345) — ADR-002 implementation ticket
- [CSC-1346](https://jira.it.census.gov/browse/CSC-1346) — this topology decision ticket
Loading

0 comments on commit 66838cf

Please sign in to comment.