Skip to content

Commit

Permalink
adding monitoring
Browse files Browse the repository at this point in the history
  • Loading branch information
Your Name committed Jan 16, 2026
1 parent fa18679 commit f3e80d9
Show file tree
Hide file tree
Showing 43 changed files with 2,375 additions and 146 deletions.
131 changes: 131 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# GitHub Copilot Instructions for ghe-runner Repository

## General Guidelines

### Terraform Commands
- **ALWAYS use the `tf` alias instead of `terraform` command**
- The `tf` alias performs important behind-the-scenes operations required for this environment
- Examples:
-`tf plan` (correct)
-`tf apply` (correct)
-`terraform plan` (incorrect)
-`terraform apply` (incorrect)

### Terminal Commands
- When running terminal commands, always use the `run_in_terminal` tool
- Set `isBackground=false` for commands that need output
- Set `isBackground=true` for long-running processes (servers, watches)

### AWS Authentication
- AWS credentials may expire during sessions
- User will refresh credentials manually using `awscreds` command
- Do not attempt to source aws credentials automatically

### GitHub Authentication
- This project uses **token-only authentication** (GITHUB_TOKEN environment variable)
- GitHub App authentication is optional (variables have default = null)
- Never require GitHub App variables unless explicitly requested

## Project-Specific Context

### Infrastructure
- **Region**: us-gov-west-1 (AWS GovCloud)
- **ECS Cluster**: ecs-ghe-runners-us-gov-west-1
- **GitHub Enterprise**: github.e.it.census.gov
- **Organization**: SCT-Engineering
- **Proxy**: proxy.tco.census.gov:3128 (required for outbound traffic)

### Critical Understanding: Persistent Runners & Token Lifecycle
⚠️ **IMPORTANT**: Runners are **persistent, long-running containers** (not ephemeral):
- Runners run continuously 24/7, handling multiple jobs over their lifetime
- Registration token is used **only during container startup** (one-time registration)
- Lambda refreshes token every 30 min to ensure valid token for ECS task restarts
- **Deadlock risk**: If all runners die AND token expires, ECS cannot auto-recover
- Running tasks don't need token refresh (already registered)
- Failed tasks being restarted by ECS need valid token from Secrets Manager
- This is why monitoring and quick response are essential

### File Conventions
- Main configuration: `default.auto.tfvars`
- Example template: `example.tfvars.template` (do NOT rename to `.auto.tfvars`)
- Monitoring: `monitoring.tf`
- Emergency procedures: `RUNBOOK.md`

### Terraform Modules
- Primary module: `HappyPathway/github-runner/ecs`
- Optional ECR clone: `HappyPathway/ecr-clone/aws`
- Module outputs: Check `outputs.tf` before referencing module attributes

## Code Editing Guidelines

### When Making File Changes
1. Always read sufficient context before editing (5+ lines before/after)
2. Use `replace_string_in_file` with exact matches including whitespace
3. Never use placeholder comments like `...existing code...` in edits
4. Verify changes with `tf plan` after modifications

### When Implementing Features
1. Create a todo list for multi-step work
2. Mark items in-progress before starting
3. Mark items completed immediately after finishing
4. Update the list as new tasks are discovered

## Monitoring & Alerting

### Alert Configuration
- Alert email: david.j.arnold.jr@census.gov
- SNS topic: github-runner-critical-alerts
- Critical alarms: runners < 50% capacity, all runners down
- Dashboard: CloudWatch dashboard for visibility

### Emergency Response
- Refer to `RUNBOOK.md` for incident procedures
- Three critical scenarios documented:
1. Lambda token refresh failing
2. Runners at 50% capacity
3. All runners down (EMERGENCY)

## Testing & Validation

### Before Committing
1. Run `tf plan` to validate configuration
2. Check for errors with `get_errors` tool if available
3. Verify outputs are as expected
4. Review changes in context of overall system

### After Deployment
1. Verify SNS email subscription confirmation
2. Check CloudWatch alarms are configured
3. Test dashboard accessibility
4. Document any lessons learned

## Common Issues & Solutions

### "Invalid AWS Region" Error
- Ensure `providers.tf` has `region = "us-gov-west-1"`

### "Unsupported attribute" on Module Outputs
- Check `outputs.tf` for available module outputs
- Use `var.repo_org` for service name, not `module.github-runner.service_name`

### Image Pull Failures
- Enable ECR clone: `enable_ecr_clone = true`
- Verify image version exists in source registry

### Token Expiration Risk
- Monitor Lambda execution via CloudWatch Logs
- Check token age in Secrets Manager
- Manual refresh available via Lambda invoke

## Resources

- [Monitoring Plan](./MONITORING_IMPLEMENTATION_PLAN.md)
- [Emergency Runbook](./RUNBOOK.md)
- [GitHub App Setup](./GITHUB_APP_SETUP.md)
- [AWS Permissions](./AWS_PERMISSIONS.md)
- [Security Review](./SECURITY_REVIEW.md)

---

**Last Updated**: January 15, 2026
**Maintainer**: CSVD Team
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,5 @@ aws-image-pipeline/aws-image-pipeline
automation-repos/automation-repos
ghe-runners/ghe-runners
docker-image-pipeline/docker-image-pipeline

terraform_data_dirs
12 changes: 12 additions & 0 deletions .terraform_commits
Original file line number Diff line number Diff line change
Expand Up @@ -88,5 +88,17 @@
"commit_message": "Add GitHub Actions Runner Setup Guide to README.md",
"author": "Your Name",
"timestamp": "2025-10-31T13:13:21.490997"
},
{
"commit_hash": "fa186792281de61333a09ed8477d865d96cb3ae8",
"commit_message": "feat(lambda): Implement GitHub Actions runner token refresh Lambda function\n\n- Added `token_refresh.py` to handle the token refresh logic.\n- Integrated AWS Secrets Manager for storing the GitHub registration token.\n- Utilized GitHub App authentication for secure API access.\n- Scheduled Lambda function to run every 30 minutes using CloudWatch Events.\n- Created necessary IAM roles and policies for Lambda execution.\n\nchore(lambda): Add requirements for token refresh Lambda\n\n- Added `requirements.txt` with dependencies: PyJWT and cryptography.\n\nfeat(terraform): Configure Lambda function and CloudWatch Events\n\n- Created Terraform configuration for the Lambda function and its dependencies.\n- Set up CloudWatch Event Rule to trigger the Lambda function every 30 minutes.\n- Configured IAM roles and policies for Lambda execution and Secrets Manager access.\n\ndocs(scripts): Add monitoring tools for GitHub Runner ECS services\n\n- Created monitoring scripts to track ECS service health and CloudWatch logs.\n- Added README with usage instructions and troubleshooting tips.\n- Implemented a continuous monitoring script using rich for better output formatting.\n\nchore(scripts): Add requirements for monitoring scripts\n\n- Added `requirements.txt` for monitoring scripts with dependencies: boto3, botocore, and rich.\n\nfix(scripts): Update monitoring script to use Terraform outputs\n\n- Modified `monitor_runners.py` to fetch necessary configuration from Terraform outputs.\n- Improved error handling and logging for better visibility.\n\nfeat(varfiles): Add configuration files for Terraform modules\n\n- Created JSON and TFVars files for managing Terraform workspace and GitHub organization settings.",
"author": "Your Name",
"timestamp": "2026-01-12T14:58:24.831561"
},
{
"commit_hash": "fa186792281de61333a09ed8477d865d96cb3ae8",
"commit_message": "feat(lambda): Implement GitHub Actions runner token refresh Lambda function\n\n- Added `token_refresh.py` to handle the token refresh logic.\n- Integrated AWS Secrets Manager for storing the GitHub registration token.\n- Utilized GitHub App authentication for secure API access.\n- Scheduled Lambda function to run every 30 minutes using CloudWatch Events.\n- Created necessary IAM roles and policies for Lambda execution.\n\nchore(lambda): Add requirements for token refresh Lambda\n\n- Added `requirements.txt` with dependencies: PyJWT and cryptography.\n\nfeat(terraform): Configure Lambda function and CloudWatch Events\n\n- Created Terraform configuration for the Lambda function and its dependencies.\n- Set up CloudWatch Event Rule to trigger the Lambda function every 30 minutes.\n- Configured IAM roles and policies for Lambda execution and Secrets Manager access.\n\ndocs(scripts): Add monitoring tools for GitHub Runner ECS services\n\n- Created monitoring scripts to track ECS service health and CloudWatch logs.\n- Added README with usage instructions and troubleshooting tips.\n- Implemented a continuous monitoring script using rich for better output formatting.\n\nchore(scripts): Add requirements for monitoring scripts\n\n- Added `requirements.txt` for monitoring scripts with dependencies: boto3, botocore, and rich.\n\nfix(scripts): Update monitoring script to use Terraform outputs\n\n- Modified `monitor_runners.py` to fetch necessary configuration from Terraform outputs.\n- Improved error handling and logging for better visibility.\n\nfeat(varfiles): Add configuration files for Terraform modules\n\n- Created JSON and TFVars files for managing Terraform workspace and GitHub organization settings.",
"author": "Your Name",
"timestamp": "2026-01-15T17:53:12.576503"
}
]
11 changes: 11 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"github.copilot.chat.mcpServers": {
"terraform": {
"command": "/home/a/arnol377/.local/bin/terraform-mcp-server",
"args": [],
"env": {
"TF_WORKSPACE_DIR": "/home/a/arnol377/git/ghe-runner"
}
}
}
}
159 changes: 159 additions & 0 deletions DOCUMENTATION_REVIEW_2026-01-15.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Documentation Review - January 15, 2026

## Summary of Updates

Updated documentation to accurately reflect the **persistent, long-running runner architecture** rather than describing them as ephemeral/dynamic containers.

## Key Architectural Clarifications

### Runner Model
-**CORRECT**: Runners are persistent, long-running ECS Fargate containers
-**CORRECT**: Runners stay active 24/7, polling GitHub for jobs
-**CORRECT**: Same runner handles multiple workflow jobs over its lifetime
-**CORRECT**: Runners only restart on: crash, manual stop, service deployment
-**INCORRECT** (Previous): Runners spin up dynamically per job

### Token Lifecycle Understanding
-**CORRECT**: Registration token is ONLY used during container startup
-**CORRECT**: Running runners don't need token refresh (already registered)
-**CORRECT**: Lambda token refresh is insurance for ECS task restarts
-**CORRECT**: Deadlock occurs when: all runners down + token expired
-**INCORRECT** (Previous): Implied tokens are needed continuously

### ECS Auto-Recovery Behavior
-**CORRECT**: If a task dies, ECS automatically starts a replacement
-**CORRECT**: Replacement task needs valid token from Secrets Manager
-**CORRECT**: Lambda ensures fresh token available for automatic recovery
-**CORRECT**: Without valid token, ECS enters crash loop

## Files Updated

### 1. README.md
**Changes:**
- Added "Runner Model" note emphasizing persistent containers
- Updated "Key Features" to include "Persistent Runners" and "Automated Token Refresh"
- Rewrote "Architecture" section with "Runner Lifecycle Model"
- Added detailed explanation of startup → active → job execution → restart cycle
- Updated architecture diagram to show "Persistent Runner" and lifecycle states
- Added Lambda Token Refresh component to diagram

**Key Additions:**
```
Runner Lifecycle Model:
1. Startup: Reads token, registers with GitHub
2. Active State: Stays online, polls for jobs
3. Job Execution: Executes jobs, returns to polling
4. Restart: Only on failure, manual stop, or update
5. Auto-Recovery: ECS restarts tasks (requires valid token)
```

### 2. RUNBOOK.md
**Changes:**
- Renamed section from "Token Lifecycle Dependency" to "Persistent Runners & Token Lifecycle"
- Added "Runner Architecture" subsection explaining 24/7 operation
- Clarified "Token Lifecycle & Deadlock Risk" with focus on startup-only token use
- Added "Why Lambda Token Refresh Matters" section
- Updated Scenario 1 impact assessment to clarify running vs. new runners
- Updated Scenario 2 impact assessment to explain reduced capacity implications
- Expanded Scenario 3 deadlock warning with detailed explanation
- Added "Task Crash Loop" to common root causes table

**Key Additions:**
```
Running runners: Already registered, don't need token refresh
Token refresh purpose: Ensures valid token for ECS task restarts
Deadlock scenario: ECS tries to restart → token expired → tasks fail → retry loop
```

### 3. .github/copilot-instructions.md
**Changes:**
- Updated "Critical Understanding" section title and content
- Clarified that runners are "persistent, long-running containers (not ephemeral)"
- Explained registration token is "only during container startup (one-time registration)"
- Specified Lambda refreshes for "ECS task restarts" not continuous runner operation
- Detailed deadlock risk with distinction between running vs. restarting tasks

**Key Additions:**
```
Runners run continuously 24/7, handling multiple jobs
Registration token used only during container startup
Running tasks don't need token refresh (already registered)
Failed tasks being restarted by ECS need valid token
```

### 4. lambda_token_refresh.tf
**Changes:**
- Expanded header comment from 2 lines to 12 lines
- Added "IMPORTANT" note about persistent runners
- Explained token lifecycle in detail
- Clarified purpose as "insurance for ECS automatic task recovery"
- Added critical scenario explanation

**Key Additions:**
```
IMPORTANT: Runners are persistent, long-running containers
Registration token ONLY needed during container startup
Token refresh purpose: Insurance for ECS automatic task recovery
Critical for: Preventing deadlock when all runners down + token expires
```

### 5. lambda/token_refresh_pat.py
**Changes:**
- Expanded docstring from 7 lines to 17 lines
- Added "CRITICAL CONTEXT" section
- Detailed persistent runner architecture
- Explained deadlock scenario step-by-step
- Added schedule and authentication details

**Key Additions:**
```
CRITICAL CONTEXT:
- Runners are persistent, long-running ECS containers (not ephemeral)
- Registration tokens ONLY used during container startup
- Running runners don't need token refresh
- Purpose: Prevent deadlock scenario [detailed explanation]
```

### 6. monitoring.tf
**Changes:**
- Expanded header comment from 7 lines to 13 lines
- Added "RUNNER MODEL" section
- Clarified monitoring tracks container health, not job execution
- Updated monitoring area descriptions

**Key Additions:**
```
RUNNER MODEL: Persistent, long-running containers (not ephemeral)
- Runners stay online 24/7, handling multiple jobs
- Only restart on: task failure, manual stop, service deployment
- Monitoring tracks runner CONTAINER health, not individual job execution
```

## Documentation Consistency

All documentation now consistently reflects:

1. **Runner Persistence**: Emphasized that runners are NOT ephemeral
2. **Token Usage**: Clear that tokens are only for startup, not continuous operation
3. **Lambda Purpose**: Reframed as "insurance" for ECS auto-recovery
4. **Deadlock Risk**: Detailed explanation with precise conditions
5. **ECS Behavior**: Clarified automatic task replacement mechanism
6. **Monitoring Context**: Metrics track container health, not job execution

## Benefits of These Updates

1. **Operational Understanding**: Clearer picture of how the system actually works
2. **Troubleshooting**: Better context for investigating runner issues
3. **Cost Implications**: Understanding that runners run 24/7 (not per-job)
4. **Monitoring Interpretation**: Metrics represent container state, not workflow state
5. **Emergency Response**: More accurate mental model for incident response

## No Configuration Changes

These updates are **documentation-only**. No infrastructure, code logic, or configuration was modified. The system operates exactly as before - we've simply corrected the documentation to match reality.

---

**Review Date**: January 15, 2026
**Reviewer**: GitHub Copilot (with user guidance)
**Status**: Complete ✅
Loading

0 comments on commit f3e80d9

Please sign in to comment.