synergy moats product advisory implementations
This commit is contained in:
157
docs/operations/runbooks/_template.md
Normal file
157
docs/operations/runbooks/_template.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Runbook: [Component] - [Failure Scenario]
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-001 - Runbook Template
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
|
||||
| **Severity** | Critical / High / Medium / Low |
|
||||
| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
|
||||
| **Last updated** | [YYYY-MM-DD] |
|
||||
| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
Observable indicators that this failure is occurring:
|
||||
|
||||
- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
|
||||
- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
|
||||
- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
|
||||
| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
|
||||
| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
Run these first to confirm the failure:
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
2. **Check service status:**
|
||||
```bash
|
||||
stella [component] status
|
||||
```
|
||||
|
||||
3. **Check recent logs:**
|
||||
```bash
|
||||
stella [component] logs --tail 50 --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis (if quick checks inconclusive)
|
||||
|
||||
1. **[Investigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
Expected output: [description]
|
||||
If unexpected: [what it means]
|
||||
|
||||
2. **[Investigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Check related services:**
|
||||
- Postgres connectivity: `stella doctor --check check.storage.postgres`
|
||||
- Valkey connectivity: `stella doctor --check check.storage.valkey`
|
||||
- Network connectivity: `stella doctor --check check.network.[target]`
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation (restore service quickly)
|
||||
|
||||
Use these steps to restore service, even if root cause isn't fixed yet:
|
||||
|
||||
1. **[Mitigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
This will: [explanation]
|
||||
|
||||
2. **[Mitigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
Once service is restored, address the underlying issue:
|
||||
|
||||
1. **[Fix step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
2. **[Fix step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Verify fix is complete:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Confirm the issue is fully resolved:
|
||||
|
||||
```bash
|
||||
# Re-run the failing operation
|
||||
stella [component] [test-command]
|
||||
|
||||
# Verify metrics are healthy
|
||||
stella obs metrics --filter [component] --last 5m
|
||||
|
||||
# Verify no new errors in logs
|
||||
stella [component] logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
How to prevent this failure from recurring:
|
||||
|
||||
- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
|
||||
- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
|
||||
- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
|
||||
- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture doc:** [Link to relevant architecture documentation]
|
||||
- **Related runbooks:** [Links to related failure scenarios]
|
||||
- **Doctor check source:** [Link to Doctor check implementation]
|
||||
- **Grafana dashboard:** [Link to relevant dashboard]
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Date | Author | Changes |
|
||||
|------|--------|---------|
|
||||
| YYYY-MM-DD | [Name] | Initial version |
|
||||
Reference in New Issue
Block a user