synergy moats product advisory implementations

2026-01-17 01:30:03 +02:00
parent 77ff029205
commit d8d9c0a6e3
106 changed files with 20603 additions and 123 deletions
--- a/docs/operations/runbooks/_template.md
+++ b/docs/operations/runbooks/_template.md
@@ -0,0 +1,157 @@
+# Runbook: [Component] - [Failure Scenario]
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-001 - Runbook Template
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
+| **Severity** | Critical / High / Medium / Low |
+| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
+| **Last updated** | [YYYY-MM-DD] |
+| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
+
+---
+
+## Symptoms
+
+Observable indicators that this failure is occurring:
+
+- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
+- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
+- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
+| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
+| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
+
+---
+
+## Diagnosis
+
+### Quick checks (< 2 minutes)
+
+Run these first to confirm the failure:
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check [relevant-check-id]
+   ```
+
+2. **Check service status:**
+   ```bash
+   stella [component] status
+   ```
+
+3. **Check recent logs:**
+   ```bash
+   stella [component] logs --tail 50 --level error
+   ```
+
+### Deep diagnosis (if quick checks inconclusive)
+
+1. **[Investigation step 1]:**
+   ```bash
+   [command]
+   ```
+   Expected output: [description]
+   If unexpected: [what it means]
+
+2. **[Investigation step 2]:**
+   ```bash
+   [command]
+   ```
+
+3. **Check related services:**
+   - Postgres connectivity: `stella doctor --check check.storage.postgres`
+   - Valkey connectivity: `stella doctor --check check.storage.valkey`
+   - Network connectivity: `stella doctor --check check.network.[target]`
+
+---
+
+## Resolution
+
+### Immediate mitigation (restore service quickly)
+
+Use these steps to restore service, even if root cause isn't fixed yet:
+
+1. **[Mitigation step 1]:**
+   ```bash
+   [command]
+   ```
+   This will: [explanation]
+
+2. **[Mitigation step 2]:**
+   ```bash
+   [command]
+   ```
+
+### Root cause fix
+
+Once service is restored, address the underlying issue:
+
+1. **[Fix step 1]:**
+   ```bash
+   [command]
+   ```
+
+2. **[Fix step 2]:**
+   ```bash
+   [command]
+   ```
+
+3. **Verify fix is complete:**
+   ```bash
+   stella doctor --check [relevant-check-id]
+   ```
+
+### Verification
+
+Confirm the issue is fully resolved:
+
+```bash
+# Re-run the failing operation
+stella [component] [test-command]
+
+# Verify metrics are healthy
+stella obs metrics --filter [component] --last 5m
+
+# Verify no new errors in logs
+stella [component] logs --tail 20 --level error
+```
+
+---
+
+## Prevention
+
+How to prevent this failure from recurring:
+
+- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
+- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
+- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
+- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
+
+---
+
+## Related Resources
+
+- **Architecture doc:** [Link to relevant architecture documentation]
+- **Related runbooks:** [Links to related failure scenarios]
+- **Doctor check source:** [Link to Doctor check implementation]
+- **Grafana dashboard:** [Link to relevant dashboard]
+
+---
+
+## Revision History
+
+| Date | Author | Changes |
+|------|--------|---------|
+| YYYY-MM-DD | [Name] | Initial version |