-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Your Name
committed
Jan 16, 2026
1 parent
fa18679
commit f3e80d9
Showing
43 changed files
with
2,375 additions
and
146 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| # GitHub Copilot Instructions for ghe-runner Repository | ||
|
|
||
| ## General Guidelines | ||
|
|
||
| ### Terraform Commands | ||
| - **ALWAYS use the `tf` alias instead of `terraform` command** | ||
| - The `tf` alias performs important behind-the-scenes operations required for this environment | ||
| - Examples: | ||
| - ✅ `tf plan` (correct) | ||
| - ✅ `tf apply` (correct) | ||
| - ❌ `terraform plan` (incorrect) | ||
| - ❌ `terraform apply` (incorrect) | ||
|
|
||
| ### Terminal Commands | ||
| - When running terminal commands, always use the `run_in_terminal` tool | ||
| - Set `isBackground=false` for commands that need output | ||
| - Set `isBackground=true` for long-running processes (servers, watches) | ||
|
|
||
| ### AWS Authentication | ||
| - AWS credentials may expire during sessions | ||
| - User will refresh credentials manually using `awscreds` command | ||
| - Do not attempt to source aws credentials automatically | ||
|
|
||
| ### GitHub Authentication | ||
| - This project uses **token-only authentication** (GITHUB_TOKEN environment variable) | ||
| - GitHub App authentication is optional (variables have default = null) | ||
| - Never require GitHub App variables unless explicitly requested | ||
|
|
||
| ## Project-Specific Context | ||
|
|
||
| ### Infrastructure | ||
| - **Region**: us-gov-west-1 (AWS GovCloud) | ||
| - **ECS Cluster**: ecs-ghe-runners-us-gov-west-1 | ||
| - **GitHub Enterprise**: github.e.it.census.gov | ||
| - **Organization**: SCT-Engineering | ||
| - **Proxy**: proxy.tco.census.gov:3128 (required for outbound traffic) | ||
|
|
||
| ### Critical Understanding: Persistent Runners & Token Lifecycle | ||
| ⚠️ **IMPORTANT**: Runners are **persistent, long-running containers** (not ephemeral): | ||
| - Runners run continuously 24/7, handling multiple jobs over their lifetime | ||
| - Registration token is used **only during container startup** (one-time registration) | ||
| - Lambda refreshes token every 30 min to ensure valid token for ECS task restarts | ||
| - **Deadlock risk**: If all runners die AND token expires, ECS cannot auto-recover | ||
| - Running tasks don't need token refresh (already registered) | ||
| - Failed tasks being restarted by ECS need valid token from Secrets Manager | ||
| - This is why monitoring and quick response are essential | ||
|
|
||
| ### File Conventions | ||
| - Main configuration: `default.auto.tfvars` | ||
| - Example template: `example.tfvars.template` (do NOT rename to `.auto.tfvars`) | ||
| - Monitoring: `monitoring.tf` | ||
| - Emergency procedures: `RUNBOOK.md` | ||
|
|
||
| ### Terraform Modules | ||
| - Primary module: `HappyPathway/github-runner/ecs` | ||
| - Optional ECR clone: `HappyPathway/ecr-clone/aws` | ||
| - Module outputs: Check `outputs.tf` before referencing module attributes | ||
|
|
||
| ## Code Editing Guidelines | ||
|
|
||
| ### When Making File Changes | ||
| 1. Always read sufficient context before editing (5+ lines before/after) | ||
| 2. Use `replace_string_in_file` with exact matches including whitespace | ||
| 3. Never use placeholder comments like `...existing code...` in edits | ||
| 4. Verify changes with `tf plan` after modifications | ||
|
|
||
| ### When Implementing Features | ||
| 1. Create a todo list for multi-step work | ||
| 2. Mark items in-progress before starting | ||
| 3. Mark items completed immediately after finishing | ||
| 4. Update the list as new tasks are discovered | ||
|
|
||
| ## Monitoring & Alerting | ||
|
|
||
| ### Alert Configuration | ||
| - Alert email: david.j.arnold.jr@census.gov | ||
| - SNS topic: github-runner-critical-alerts | ||
| - Critical alarms: runners < 50% capacity, all runners down | ||
| - Dashboard: CloudWatch dashboard for visibility | ||
|
|
||
| ### Emergency Response | ||
| - Refer to `RUNBOOK.md` for incident procedures | ||
| - Three critical scenarios documented: | ||
| 1. Lambda token refresh failing | ||
| 2. Runners at 50% capacity | ||
| 3. All runners down (EMERGENCY) | ||
|
|
||
| ## Testing & Validation | ||
|
|
||
| ### Before Committing | ||
| 1. Run `tf plan` to validate configuration | ||
| 2. Check for errors with `get_errors` tool if available | ||
| 3. Verify outputs are as expected | ||
| 4. Review changes in context of overall system | ||
|
|
||
| ### After Deployment | ||
| 1. Verify SNS email subscription confirmation | ||
| 2. Check CloudWatch alarms are configured | ||
| 3. Test dashboard accessibility | ||
| 4. Document any lessons learned | ||
|
|
||
| ## Common Issues & Solutions | ||
|
|
||
| ### "Invalid AWS Region" Error | ||
| - Ensure `providers.tf` has `region = "us-gov-west-1"` | ||
|
|
||
| ### "Unsupported attribute" on Module Outputs | ||
| - Check `outputs.tf` for available module outputs | ||
| - Use `var.repo_org` for service name, not `module.github-runner.service_name` | ||
|
|
||
| ### Image Pull Failures | ||
| - Enable ECR clone: `enable_ecr_clone = true` | ||
| - Verify image version exists in source registry | ||
|
|
||
| ### Token Expiration Risk | ||
| - Monitor Lambda execution via CloudWatch Logs | ||
| - Check token age in Secrets Manager | ||
| - Manual refresh available via Lambda invoke | ||
|
|
||
| ## Resources | ||
|
|
||
| - [Monitoring Plan](./MONITORING_IMPLEMENTATION_PLAN.md) | ||
| - [Emergency Runbook](./RUNBOOK.md) | ||
| - [GitHub App Setup](./GITHUB_APP_SETUP.md) | ||
| - [AWS Permissions](./AWS_PERMISSIONS.md) | ||
| - [Security Review](./SECURITY_REVIEW.md) | ||
|
|
||
| --- | ||
|
|
||
| **Last Updated**: January 15, 2026 | ||
| **Maintainer**: CSVD Team |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| { | ||
| "github.copilot.chat.mcpServers": { | ||
| "terraform": { | ||
| "command": "/home/a/arnol377/.local/bin/terraform-mcp-server", | ||
| "args": [], | ||
| "env": { | ||
| "TF_WORKSPACE_DIR": "/home/a/arnol377/git/ghe-runner" | ||
| } | ||
| } | ||
| } | ||
| } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| # Documentation Review - January 15, 2026 | ||
|
|
||
| ## Summary of Updates | ||
|
|
||
| Updated documentation to accurately reflect the **persistent, long-running runner architecture** rather than describing them as ephemeral/dynamic containers. | ||
|
|
||
| ## Key Architectural Clarifications | ||
|
|
||
| ### Runner Model | ||
| - ✅ **CORRECT**: Runners are persistent, long-running ECS Fargate containers | ||
| - ✅ **CORRECT**: Runners stay active 24/7, polling GitHub for jobs | ||
| - ✅ **CORRECT**: Same runner handles multiple workflow jobs over its lifetime | ||
| - ✅ **CORRECT**: Runners only restart on: crash, manual stop, service deployment | ||
| - ❌ **INCORRECT** (Previous): Runners spin up dynamically per job | ||
|
|
||
| ### Token Lifecycle Understanding | ||
| - ✅ **CORRECT**: Registration token is ONLY used during container startup | ||
| - ✅ **CORRECT**: Running runners don't need token refresh (already registered) | ||
| - ✅ **CORRECT**: Lambda token refresh is insurance for ECS task restarts | ||
| - ✅ **CORRECT**: Deadlock occurs when: all runners down + token expired | ||
| - ❌ **INCORRECT** (Previous): Implied tokens are needed continuously | ||
|
|
||
| ### ECS Auto-Recovery Behavior | ||
| - ✅ **CORRECT**: If a task dies, ECS automatically starts a replacement | ||
| - ✅ **CORRECT**: Replacement task needs valid token from Secrets Manager | ||
| - ✅ **CORRECT**: Lambda ensures fresh token available for automatic recovery | ||
| - ✅ **CORRECT**: Without valid token, ECS enters crash loop | ||
|
|
||
| ## Files Updated | ||
|
|
||
| ### 1. README.md | ||
| **Changes:** | ||
| - Added "Runner Model" note emphasizing persistent containers | ||
| - Updated "Key Features" to include "Persistent Runners" and "Automated Token Refresh" | ||
| - Rewrote "Architecture" section with "Runner Lifecycle Model" | ||
| - Added detailed explanation of startup → active → job execution → restart cycle | ||
| - Updated architecture diagram to show "Persistent Runner" and lifecycle states | ||
| - Added Lambda Token Refresh component to diagram | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| Runner Lifecycle Model: | ||
| 1. Startup: Reads token, registers with GitHub | ||
| 2. Active State: Stays online, polls for jobs | ||
| 3. Job Execution: Executes jobs, returns to polling | ||
| 4. Restart: Only on failure, manual stop, or update | ||
| 5. Auto-Recovery: ECS restarts tasks (requires valid token) | ||
| ``` | ||
|
|
||
| ### 2. RUNBOOK.md | ||
| **Changes:** | ||
| - Renamed section from "Token Lifecycle Dependency" to "Persistent Runners & Token Lifecycle" | ||
| - Added "Runner Architecture" subsection explaining 24/7 operation | ||
| - Clarified "Token Lifecycle & Deadlock Risk" with focus on startup-only token use | ||
| - Added "Why Lambda Token Refresh Matters" section | ||
| - Updated Scenario 1 impact assessment to clarify running vs. new runners | ||
| - Updated Scenario 2 impact assessment to explain reduced capacity implications | ||
| - Expanded Scenario 3 deadlock warning with detailed explanation | ||
| - Added "Task Crash Loop" to common root causes table | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| Running runners: Already registered, don't need token refresh | ||
| Token refresh purpose: Ensures valid token for ECS task restarts | ||
| Deadlock scenario: ECS tries to restart → token expired → tasks fail → retry loop | ||
| ``` | ||
|
|
||
| ### 3. .github/copilot-instructions.md | ||
| **Changes:** | ||
| - Updated "Critical Understanding" section title and content | ||
| - Clarified that runners are "persistent, long-running containers (not ephemeral)" | ||
| - Explained registration token is "only during container startup (one-time registration)" | ||
| - Specified Lambda refreshes for "ECS task restarts" not continuous runner operation | ||
| - Detailed deadlock risk with distinction between running vs. restarting tasks | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| Runners run continuously 24/7, handling multiple jobs | ||
| Registration token used only during container startup | ||
| Running tasks don't need token refresh (already registered) | ||
| Failed tasks being restarted by ECS need valid token | ||
| ``` | ||
|
|
||
| ### 4. lambda_token_refresh.tf | ||
| **Changes:** | ||
| - Expanded header comment from 2 lines to 12 lines | ||
| - Added "IMPORTANT" note about persistent runners | ||
| - Explained token lifecycle in detail | ||
| - Clarified purpose as "insurance for ECS automatic task recovery" | ||
| - Added critical scenario explanation | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| IMPORTANT: Runners are persistent, long-running containers | ||
| Registration token ONLY needed during container startup | ||
| Token refresh purpose: Insurance for ECS automatic task recovery | ||
| Critical for: Preventing deadlock when all runners down + token expires | ||
| ``` | ||
|
|
||
| ### 5. lambda/token_refresh_pat.py | ||
| **Changes:** | ||
| - Expanded docstring from 7 lines to 17 lines | ||
| - Added "CRITICAL CONTEXT" section | ||
| - Detailed persistent runner architecture | ||
| - Explained deadlock scenario step-by-step | ||
| - Added schedule and authentication details | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| CRITICAL CONTEXT: | ||
| - Runners are persistent, long-running ECS containers (not ephemeral) | ||
| - Registration tokens ONLY used during container startup | ||
| - Running runners don't need token refresh | ||
| - Purpose: Prevent deadlock scenario [detailed explanation] | ||
| ``` | ||
|
|
||
| ### 6. monitoring.tf | ||
| **Changes:** | ||
| - Expanded header comment from 7 lines to 13 lines | ||
| - Added "RUNNER MODEL" section | ||
| - Clarified monitoring tracks container health, not job execution | ||
| - Updated monitoring area descriptions | ||
|
|
||
| **Key Additions:** | ||
| ``` | ||
| RUNNER MODEL: Persistent, long-running containers (not ephemeral) | ||
| - Runners stay online 24/7, handling multiple jobs | ||
| - Only restart on: task failure, manual stop, service deployment | ||
| - Monitoring tracks runner CONTAINER health, not individual job execution | ||
| ``` | ||
|
|
||
| ## Documentation Consistency | ||
|
|
||
| All documentation now consistently reflects: | ||
|
|
||
| 1. **Runner Persistence**: Emphasized that runners are NOT ephemeral | ||
| 2. **Token Usage**: Clear that tokens are only for startup, not continuous operation | ||
| 3. **Lambda Purpose**: Reframed as "insurance" for ECS auto-recovery | ||
| 4. **Deadlock Risk**: Detailed explanation with precise conditions | ||
| 5. **ECS Behavior**: Clarified automatic task replacement mechanism | ||
| 6. **Monitoring Context**: Metrics track container health, not job execution | ||
|
|
||
| ## Benefits of These Updates | ||
|
|
||
| 1. **Operational Understanding**: Clearer picture of how the system actually works | ||
| 2. **Troubleshooting**: Better context for investigating runner issues | ||
| 3. **Cost Implications**: Understanding that runners run 24/7 (not per-job) | ||
| 4. **Monitoring Interpretation**: Metrics represent container state, not workflow state | ||
| 5. **Emergency Response**: More accurate mental model for incident response | ||
|
|
||
| ## No Configuration Changes | ||
|
|
||
| These updates are **documentation-only**. No infrastructure, code logic, or configuration was modified. The system operates exactly as before - we've simply corrected the documentation to match reality. | ||
|
|
||
| --- | ||
|
|
||
| **Review Date**: January 15, 2026 | ||
| **Reviewer**: GitHub Copilot (with user guidance) | ||
| **Status**: Complete ✅ |
Oops, something went wrong.