Files

StellaOps Bot 9f6e6f7fb3

Docs CI / lint-and-preview (push) Has been cancelled

Details

Signals CI & Image / signals-ci (push) Has been cancelled

Details

Policy Lint & Smoke / policy-lint (push) Has been cancelled

Details

Policy Simulation / policy-simulate (push) Has been cancelled

Details

SDK Publish & Sign / sdk-publish (push) Has been cancelled

Details

AOC Guard CI / aoc-guard (push) Has been cancelled

Details

AOC Guard CI / aoc-verify (push) Has been cancelled

Details

Concelier Attestation Tests / attestation-tests (push) Has been cancelled

Details

devportal-offline / build-offline (push) Has been cancelled

Details

2025-11-25 22:09:44 +02:00

Signals Pipeline Playbook

Scope: Signals ingestion, cache, scoring, and sensor freshness.

Dashboards

Grafana: import ops/devops/observability/grafana/signals-pipeline.json (datasource Prometheus).
Key tiles: Scoring p95, Cache hit ratio, Sensor staleness, Ingestion outcomes.

Scoring latency high
- Check Mongo/Redis health; inspect CPU on workers.
- Scale Signals API pods or increase cache TTL to reduce load.
Cache miss rate / cache down
- Validate Redis connectivity/ACL; flush not recommended unless key explosion.
- Increase cache TTL; ensure connection string matches deployment.
Sensor staleness
- Identify stale sensors from alert label; verify upstream pipeline/log shipping.
- If sensor retired, update allowlist to silence expected gaps.
Ingestion errors
- Tail ingestion logs; classify errors (schema vs. storage).
- If artifacts rejected, check storage path and disk fullness; add capacity or rotate.
Verification
- Ensure cache hit ratio >90%, scoring p95 <2s, staleness panel near baseline (<5m) after mitigation.

Primary: Signals on-call.
Secondary: DevOps Guild (observability).
Page when critical alerts persist >20m or when cache down + scoring latency co-occur.

Metrics expected: signals_reachability_scoring_duration_seconds_bucket, signals_cache_hits_total, signals_cache_misses_total, signals_cache_available, signals_sensor_last_seen_timestamp_seconds, signals_ingestion_total, signals_ingestion_failures_total.
Keep thresholds version-controlled; align with Policy Engine consumers if scoring SLAs change.