up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
This commit is contained in:
40
ops/devops/observability/signals-playbook.md
Normal file
40
ops/devops/observability/signals-playbook.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Signals Pipeline Playbook
|
||||
|
||||
Scope: Signals ingestion, cache, scoring, and sensor freshness.
|
||||
|
||||
## Dashboards
|
||||
- Grafana: import `ops/devops/observability/grafana/signals-pipeline.json` (datasource `Prometheus`).
|
||||
- Key tiles: Scoring p95, Cache hit ratio, Sensor staleness, Ingestion outcomes.
|
||||
|
||||
## Alerts
|
||||
- Rules: `ops/devops/observability/signals-alerts.yaml`
|
||||
- `SignalsScoringLatencyP95High` (p95 > 2s for 10m)
|
||||
- `SignalsCacheMissRateHigh` (miss ratio >30% for 10m)
|
||||
- `SignalsCacheDown`
|
||||
- `SignalsSensorStaleness` (no update >15m)
|
||||
- `SignalsIngestionErrorRate` (failures >5%)
|
||||
|
||||
## Runbook
|
||||
1. **Scoring latency high**
|
||||
- Check Mongo/Redis health; inspect CPU on workers.
|
||||
- Scale Signals API pods or increase cache TTL to reduce load.
|
||||
2. **Cache miss rate / cache down**
|
||||
- Validate Redis connectivity/ACL; flush not recommended unless key explosion.
|
||||
- Increase cache TTL; ensure connection string matches deployment.
|
||||
3. **Sensor staleness**
|
||||
- Identify stale sensors from alert label; verify upstream pipeline/log shipping.
|
||||
- If sensor retired, update allowlist to silence expected gaps.
|
||||
4. **Ingestion errors**
|
||||
- Tail ingestion logs; classify errors (schema vs. storage).
|
||||
- If artifacts rejected, check storage path and disk fullness; add capacity or rotate.
|
||||
5. **Verification**
|
||||
- Ensure cache hit ratio >90%, scoring p95 <2s, staleness panel near baseline (<5m) after mitigation.
|
||||
|
||||
## Escalation
|
||||
- Primary: Signals on-call.
|
||||
- Secondary: DevOps Guild (observability).
|
||||
- Page when critical alerts persist >20m or when cache down + scoring latency co-occur.
|
||||
|
||||
## Notes
|
||||
- Metrics expected: `signals_reachability_scoring_duration_seconds_bucket`, `signals_cache_hits_total`, `signals_cache_misses_total`, `signals_cache_available`, `signals_sensor_last_seen_timestamp_seconds`, `signals_ingestion_total`, `signals_ingestion_failures_total`.
|
||||
- Keep thresholds version-controlled; align with Policy Engine consumers if scoring SLAs change.
|
||||
Reference in New Issue
Block a user