3.6 KiB
3.6 KiB
Runbook: [Component] - [Failure Scenario]
Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-001 - Runbook Template
Metadata
| Field | Value |
|---|---|
| Component | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
| Severity | Critical / High / Medium / Low |
| On-call scope | [Who should be paged: Platform team, Security team, etc.] |
| Last updated | [YYYY-MM-DD] |
| Doctor check | [Check ID if applicable, e.g., check.scanner.worker-health] |
Symptoms
Observable indicators that this failure is occurring:
- [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
- [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
- [Metric/alert that fires: e.g., "Alert
ScannerWorkerStuckfiring"]
Impact
| Impact Type | Description |
|---|---|
| User-facing | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
| Data integrity | [e.g., "No data loss, but stale scan results may be served"] |
| SLA impact | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
Diagnosis
Quick checks (< 2 minutes)
Run these first to confirm the failure:
-
Check Doctor diagnostics:
stella doctor --check [relevant-check-id] -
Check service status:
stella [component] status -
Check recent logs:
stella [component] logs --tail 50 --level error
Deep diagnosis (if quick checks inconclusive)
-
[Investigation step 1]:
[command]Expected output: [description] If unexpected: [what it means]
-
[Investigation step 2]:
[command] -
Check related services:
- Postgres connectivity:
stella doctor --check check.storage.postgres - Valkey connectivity:
stella doctor --check check.storage.valkey - Network connectivity:
stella doctor --check check.network.[target]
- Postgres connectivity:
Resolution
Immediate mitigation (restore service quickly)
Use these steps to restore service, even if root cause isn't fixed yet:
-
[Mitigation step 1]:
[command]This will: [explanation]
-
[Mitigation step 2]:
[command]
Root cause fix
Once service is restored, address the underlying issue:
-
[Fix step 1]:
[command] -
[Fix step 2]:
[command] -
Verify fix is complete:
stella doctor --check [relevant-check-id]
Verification
Confirm the issue is fully resolved:
# Re-run the failing operation
stella [component] [test-command]
# Verify metrics are healthy
stella obs metrics --filter [component] --last 5m
# Verify no new errors in logs
stella [component] logs --tail 20 --level error
Prevention
How to prevent this failure from recurring:
- Monitoring: [e.g., "Add alert for queue depth > 100"]
- Configuration: [e.g., "Increase worker count in high-volume environments"]
- Code change: [e.g., "Implement circuit breaker for external service calls"]
- Documentation: [e.g., "Update capacity planning guide"]
Related Resources
- Architecture doc: [Link to relevant architecture documentation]
- Related runbooks: [Links to related failure scenarios]
- Doctor check source: [Link to Doctor check implementation]
- Grafana dashboard: [Link to relevant dashboard]
Revision History
| Date | Author | Changes |
|---|---|---|
| YYYY-MM-DD | [Name] | Initial version |