Files
git.stella-ops.org/docs/operations/runbooks/_template.md

3.6 KiB

Runbook: [Component] - [Failure Scenario]

Sprint: SPRINT_20260117_029_DOCS_runbook_coverage Task: RUN-001 - Runbook Template

Metadata

Field Value
Component [Module name: Scanner, Policy, Orchestrator, Attestor, etc.]
Severity Critical / High / Medium / Low
On-call scope [Who should be paged: Platform team, Security team, etc.]
Last updated [YYYY-MM-DD]
Doctor check [Check ID if applicable, e.g., check.scanner.worker-health]

Symptoms

Observable indicators that this failure is occurring:

  • [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
  • [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
  • [Metric/alert that fires: e.g., "Alert ScannerWorkerStuck firing"]

Impact

Impact Type Description
User-facing [e.g., "New scans cannot complete, blocking CI/CD pipelines"]
Data integrity [e.g., "No data loss, but stale scan results may be served"]
SLA impact [e.g., "Scan latency SLO violated if not resolved within 15 minutes"]

Diagnosis

Quick checks (< 2 minutes)

Run these first to confirm the failure:

  1. Check Doctor diagnostics:

    stella doctor --check [relevant-check-id]
    
  2. Check service status:

    stella [component] status
    
  3. Check recent logs:

    stella [component] logs --tail 50 --level error
    

Deep diagnosis (if quick checks inconclusive)

  1. [Investigation step 1]:

    [command]
    

    Expected output: [description] If unexpected: [what it means]

  2. [Investigation step 2]:

    [command]
    
  3. Check related services:

    • Postgres connectivity: stella doctor --check check.storage.postgres
    • Valkey connectivity: stella doctor --check check.storage.valkey
    • Network connectivity: stella doctor --check check.network.[target]

Resolution

Immediate mitigation (restore service quickly)

Use these steps to restore service, even if root cause isn't fixed yet:

  1. [Mitigation step 1]:

    [command]
    

    This will: [explanation]

  2. [Mitigation step 2]:

    [command]
    

Root cause fix

Once service is restored, address the underlying issue:

  1. [Fix step 1]:

    [command]
    
  2. [Fix step 2]:

    [command]
    
  3. Verify fix is complete:

    stella doctor --check [relevant-check-id]
    

Verification

Confirm the issue is fully resolved:

# Re-run the failing operation
stella [component] [test-command]

# Verify metrics are healthy
stella obs metrics --filter [component] --last 5m

# Verify no new errors in logs
stella [component] logs --tail 20 --level error

Prevention

How to prevent this failure from recurring:

  • Monitoring: [e.g., "Add alert for queue depth > 100"]
  • Configuration: [e.g., "Increase worker count in high-volume environments"]
  • Code change: [e.g., "Implement circuit breaker for external service calls"]
  • Documentation: [e.g., "Update capacity planning guide"]

  • Architecture doc: [Link to relevant architecture documentation]
  • Related runbooks: [Links to related failure scenarios]
  • Doctor check source: [Link to Doctor check implementation]
  • Grafana dashboard: [Link to relevant dashboard]

Revision History

Date Author Changes
YYYY-MM-DD [Name] Initial version