# Runbook: [Component] - [Failure Scenario] > **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage > **Task:** RUN-001 - Runbook Template ## Metadata | Field | Value | |-------|-------| | **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] | | **Severity** | Critical / High / Medium / Low | | **On-call scope** | [Who should be paged: Platform team, Security team, etc.] | | **Last updated** | [YYYY-MM-DD] | | **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] | --- ## Symptoms Observable indicators that this failure is occurring: - [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"] - [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"] - [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"] --- ## Impact | Impact Type | Description | |-------------|-------------| | **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] | | **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] | | **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] | --- ## Diagnosis ### Quick checks (< 2 minutes) Run these first to confirm the failure: 1. **Check Doctor diagnostics:** ```bash stella doctor --check [relevant-check-id] ``` 2. **Check service status:** ```bash stella [component] status ``` 3. **Check recent logs:** ```bash stella [component] logs --tail 50 --level error ``` ### Deep diagnosis (if quick checks inconclusive) 1. **[Investigation step 1]:** ```bash [command] ``` Expected output: [description] If unexpected: [what it means] 2. **[Investigation step 2]:** ```bash [command] ``` 3. **Check related services:** - Postgres connectivity: `stella doctor --check check.storage.postgres` - Valkey connectivity: `stella doctor --check check.storage.valkey` - Network connectivity: `stella doctor --check check.network.[target]` --- ## Resolution ### Immediate mitigation (restore service quickly) Use these steps to restore service, even if root cause isn't fixed yet: 1. **[Mitigation step 1]:** ```bash [command] ``` This will: [explanation] 2. **[Mitigation step 2]:** ```bash [command] ``` ### Root cause fix Once service is restored, address the underlying issue: 1. **[Fix step 1]:** ```bash [command] ``` 2. **[Fix step 2]:** ```bash [command] ``` 3. **Verify fix is complete:** ```bash stella doctor --check [relevant-check-id] ``` ### Verification Confirm the issue is fully resolved: ```bash # Re-run the failing operation stella [component] [test-command] # Verify metrics are healthy stella obs metrics --filter [component] --last 5m # Verify no new errors in logs stella [component] logs --tail 20 --level error ``` --- ## Prevention How to prevent this failure from recurring: - [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"] - [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"] - [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"] - [ ] **Documentation:** [e.g., "Update capacity planning guide"] --- ## Related Resources - **Architecture doc:** [Link to relevant architecture documentation] - **Related runbooks:** [Links to related failure scenarios] - **Doctor check source:** [Link to Doctor check implementation] - **Grafana dashboard:** [Link to relevant dashboard] --- ## Revision History | Date | Author | Changes | |------|--------|---------| | YYYY-MM-DD | [Name] | Initial version |