stella-ops.org/git.stella-ops.org

Fork 0

Files

master 702a27ac83 synergy moats product advisory implementations

2026-01-17 01:32:20 +02:00

3.6 KiB

Raw Blame History

Runbook: [Component] - [Failure Scenario]

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-001 - Runbook Template

Metadata

Field	Value
Component	[Module name: Scanner, Policy, Orchestrator, Attestor, etc.]
Severity	Critical / High / Medium / Low
On-call scope	[Who should be paged: Platform team, Security team, etc.]
Last updated	[YYYY-MM-DD]
Doctor check	[Check ID if applicable, e.g., `check.scanner.worker-health`]

Symptoms

Observable indicators that this failure is occurring:

[Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
[Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
[Metric/alert that fires: e.g., "Alert ScannerWorkerStuck firing"]

Impact

Impact Type	Description
User-facing	[e.g., "New scans cannot complete, blocking CI/CD pipelines"]
Data integrity	[e.g., "No data loss, but stale scan results may be served"]
SLA impact	[e.g., "Scan latency SLO violated if not resolved within 15 minutes"]

Diagnosis

Quick checks (< 2 minutes)

Run these first to confirm the failure:

Check Doctor diagnostics:

stella doctor --check [relevant-check-id]

Check service status:
```
stella [component] status
```

Check recent logs:

stella [component] logs --tail 50 --level error

Deep diagnosis (if quick checks inconclusive)

[Investigation step 1]:
```
[command]
```
Expected output: [description] If unexpected: [what it means]
[Investigation step 2]:
```
[command]
```
Check related services:
- Postgres connectivity: stella doctor --check check.storage.postgres
- Valkey connectivity: stella doctor --check check.storage.valkey
- Network connectivity: stella doctor --check check.network.[target]

Resolution

Immediate mitigation (restore service quickly)

Use these steps to restore service, even if root cause isn't fixed yet:

[Mitigation step 1]:
```
[command]
```
This will: [explanation]
[Mitigation step 2]:
```
[command]
```

Root cause fix

Once service is restored, address the underlying issue:

[Fix step 1]:
```
[command]
```
[Fix step 2]:
```
[command]
```

Verify fix is complete:

stella doctor --check [relevant-check-id]

Verification

Confirm the issue is fully resolved:

# Re-run the failing operation
stella [component] [test-command]

# Verify metrics are healthy
stella obs metrics --filter [component] --last 5m

# Verify no new errors in logs
stella [component] logs --tail 20 --level error

Prevention

How to prevent this failure from recurring:

Monitoring: [e.g., "Add alert for queue depth > 100"]
Configuration: [e.g., "Increase worker count in high-volume environments"]
Code change: [e.g., "Implement circuit breaker for external service calls"]
Documentation: [e.g., "Update capacity planning guide"]

Architecture doc: [Link to relevant architecture documentation]
Related runbooks: [Links to related failure scenarios]
Doctor check source: [Link to Doctor check implementation]
Grafana dashboard: [Link to relevant dashboard]

Revision History

Date	Author	Changes
YYYY-MM-DD	[Name]	Initial version

3.6 KiB Raw Blame History