Files
git.stella-ops.org/devops/observability/signals-playbook.md
2025-12-26 18:11:06 +02:00

1.9 KiB

Signals Pipeline Playbook

Scope: Signals ingestion, cache, scoring, and sensor freshness.

Dashboards

  • Grafana: import ops/devops/observability/grafana/signals-pipeline.json (datasource Prometheus).
  • Key tiles: Scoring p95, Cache hit ratio, Sensor staleness, Ingestion outcomes.

Alerts

  • Rules: ops/devops/observability/signals-alerts.yaml
    • SignalsScoringLatencyP95High (p95 > 2s for 10m)
    • SignalsCacheMissRateHigh (miss ratio >30% for 10m)
    • SignalsCacheDown
    • SignalsSensorStaleness (no update >15m)
    • SignalsIngestionErrorRate (failures >5%)

Runbook

  1. Scoring latency high
    • Check Mongo/Redis health; inspect CPU on workers.
    • Scale Signals API pods or increase cache TTL to reduce load.
  2. Cache miss rate / cache down
    • Validate Redis connectivity/ACL; flush not recommended unless key explosion.
    • Increase cache TTL; ensure connection string matches deployment.
  3. Sensor staleness
    • Identify stale sensors from alert label; verify upstream pipeline/log shipping.
    • If sensor retired, update allowlist to silence expected gaps.
  4. Ingestion errors
    • Tail ingestion logs; classify errors (schema vs. storage).
    • If artifacts rejected, check storage path and disk fullness; add capacity or rotate.
  5. Verification
    • Ensure cache hit ratio >90%, scoring p95 <2s, staleness panel near baseline (<5m) after mitigation.

Escalation

  • Primary: Signals on-call.
  • Secondary: DevOps Guild (observability).
  • Page when critical alerts persist >20m or when cache down + scoring latency co-occur.

Notes

  • Metrics expected: signals_reachability_scoring_duration_seconds_bucket, signals_cache_hits_total, signals_cache_misses_total, signals_cache_available, signals_sensor_last_seen_timestamp_seconds, signals_ingestion_total, signals_ingestion_failures_total.
  • Keep thresholds version-controlled; align with Policy Engine consumers if scoring SLAs change.