# Signals Pipeline Playbook Scope: Signals ingestion, cache, scoring, and sensor freshness. ## Dashboards - Grafana: import `ops/devops/observability/grafana/signals-pipeline.json` (datasource `Prometheus`). - Key tiles: Scoring p95, Cache hit ratio, Sensor staleness, Ingestion outcomes. ## Alerts - Rules: `ops/devops/observability/signals-alerts.yaml` - `SignalsScoringLatencyP95High` (p95 > 2s for 10m) - `SignalsCacheMissRateHigh` (miss ratio >30% for 10m) - `SignalsCacheDown` - `SignalsSensorStaleness` (no update >15m) - `SignalsIngestionErrorRate` (failures >5%) ## Runbook 1. **Scoring latency high** - Check Mongo/Redis health; inspect CPU on workers. - Scale Signals API pods or increase cache TTL to reduce load. 2. **Cache miss rate / cache down** - Validate Redis connectivity/ACL; flush not recommended unless key explosion. - Increase cache TTL; ensure connection string matches deployment. 3. **Sensor staleness** - Identify stale sensors from alert label; verify upstream pipeline/log shipping. - If sensor retired, update allowlist to silence expected gaps. 4. **Ingestion errors** - Tail ingestion logs; classify errors (schema vs. storage). - If artifacts rejected, check storage path and disk fullness; add capacity or rotate. 5. **Verification** - Ensure cache hit ratio >90%, scoring p95 <2s, staleness panel near baseline (<5m) after mitigation. ## Escalation - Primary: Signals on-call. - Secondary: DevOps Guild (observability). - Page when critical alerts persist >20m or when cache down + scoring latency co-occur. ## Notes - Metrics expected: `signals_reachability_scoring_duration_seconds_bucket`, `signals_cache_hits_total`, `signals_cache_misses_total`, `signals_cache_available`, `signals_sensor_last_seen_timestamp_seconds`, `signals_ingestion_total`, `signals_ingestion_failures_total`. - Keep thresholds version-controlled; align with Policy Engine consumers if scoring SLAs change.