doctor enhancements, setup, enhancements, ui functionality and design consolidation and , test projects fixes , product advisory attestation/rekor and delta verfications enhancements
This commit is contained in:
@@ -494,71 +494,142 @@ stella unknowns resolve unk-... \
|
||||
|
||||
## 7. Monitoring & Alerting
|
||||
|
||||
> **Updated**: Sprint SPRINT_20260118_018_Unknowns_queue_enhancement (UQ-007)
|
||||
|
||||
### 7.1 Key Metrics
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `unknowns_total` | Total unknowns in queue | > 500 |
|
||||
| `unknowns_hot_count` | HOT band count | > 20 |
|
||||
| `unknowns_sla_breached` | SLA breaches | > 0 |
|
||||
| `unknowns_resolution_rate` | Daily resolutions | < 5 |
|
||||
| `unknowns_escalation_failures` | Failed escalations | > 0 |
|
||||
| `unknowns_avg_age_hours` | Average unknown age | > 168 (1 week) |
|
||||
| `unknowns_queue_depth_hot` | HOT band queue depth | > 5 critical, > 0 for 1h warning |
|
||||
| `unknowns_queue_depth_warm` | WARM band queue depth | > 25 warning |
|
||||
| `unknowns_queue_depth_cold` | COLD band queue depth | > 100 warning |
|
||||
| `unknowns_sla_compliance` | SLA compliance rate (0-1) | < 0.80 critical, < 0.95 warning |
|
||||
| `unknowns_sla_breach_total` | Total SLA breaches (counter) | increase > 0 |
|
||||
| `unknowns_escalated_total` | Escalations (counter) | rate > 10/hour |
|
||||
| `unknowns_demoted_total` | Demotions (counter) | - |
|
||||
| `unknowns_expired_total` | Expirations (counter) | - |
|
||||
| `unknowns_processing_time_seconds` | Processing time histogram | p95 > 30s |
|
||||
| `unknowns_resolution_time_hours` | Resolution time by band | p95 > SLA |
|
||||
| `unknowns_state_transitions_total` | State transitions (by from/to) | - |
|
||||
| `greyqueue_stuck_total` | Stuck processing entries | > 0 |
|
||||
| `greyqueue_timeout_total` | Processing timeouts | > 5/hour |
|
||||
| `greyqueue_processing_count` | Currently processing | > 10 for 30m |
|
||||
|
||||
### 7.2 Grafana Dashboard
|
||||
|
||||
```
|
||||
Dashboard: Unknowns Queue Health
|
||||
Panels:
|
||||
- Queue size by band (HOT/WARM/COLD)
|
||||
- SLA compliance rate
|
||||
- Unknowns by reason code
|
||||
- Resolution velocity
|
||||
- Escalation success rate
|
||||
- Queue age distribution
|
||||
- KEV item tracking
|
||||
```
|
||||
Import dashboard from: `devops/observability/grafana/dashboards/unknowns-queue-dashboard.json`
|
||||
|
||||
**Dashboard Panels:**
|
||||
|
||||
| Panel | Description |
|
||||
|-------|-------------|
|
||||
| Total Queue Depth | Stat showing total across all bands |
|
||||
| HOT/WARM/COLD Unknowns | Individual band stats with thresholds |
|
||||
| SLA Compliance | Gauge showing compliance percentage |
|
||||
| Queue Depth Over Time | Time series by band |
|
||||
| SLA Compliance Over Time | Trending compliance |
|
||||
| State Transitions | Rate of state changes |
|
||||
| Processing Time (p95) | Performance histogram |
|
||||
| Escalations & Failures | Lifecycle events |
|
||||
| Resolution Time by Band | Time-to-resolution |
|
||||
| Stuck & Timeout Events | Watchdog metrics |
|
||||
| SLA Breaches Today | 24h breach counter |
|
||||
|
||||
### 7.3 Alerting Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: unknowns-queue
|
||||
rules:
|
||||
- alert: UnknownsHotBandHigh
|
||||
expr: unknowns_hot_count > 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "HOT unknowns queue is high ({{ $value }} items)"
|
||||
|
||||
- alert: UnknownsSLABreach
|
||||
expr: unknowns_sla_breached > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "{{ $value }} unknowns have breached SLA"
|
||||
|
||||
- alert: UnknownsQueueGrowing
|
||||
expr: rate(unknowns_total[1h]) > 10
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unknowns queue is growing rapidly"
|
||||
|
||||
- alert: UnknownsKEVPending
|
||||
expr: unknowns_kev_count > 0 and unknowns_kev_unresolved_age_hours > 24
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "KEV unknown pending for over 24 hours"
|
||||
Alert rules deployed from: `devops/observability/prometheus/rules/unknowns-queue-alerts.yaml`
|
||||
|
||||
**Critical Alerts:**
|
||||
|
||||
| Alert | Condition | Response |
|
||||
|-------|-----------|----------|
|
||||
| `UnknownsSlaBreachCritical` | compliance < 80% | Immediate escalation to security team |
|
||||
| `UnknownsHotQueueHigh` | HOT > 5 for 10m | Prioritize resolution |
|
||||
| `UnknownsProcessingFailures` | Failed entries in 1h | Manual intervention required |
|
||||
| `UnknownsSlaMonitorDown` | No metrics for 5m | Check service health |
|
||||
| `UnknownsHealthCheckUnhealthy` | Health check failing | Check SLA breaches |
|
||||
|
||||
**Warning Alerts:**
|
||||
|
||||
| Alert | Condition | Response |
|
||||
|-------|-----------|----------|
|
||||
| `UnknownsSlaBreachWarning` | 80% ≤ compliance < 95% | Review queue health |
|
||||
| `UnknownsHotQueuePresent` | HOT > 0 for 1h | Check progress |
|
||||
| `UnknownsQueueBacklog` | Total > 100 for 30m | Scale processing |
|
||||
| `UnknownsStuckProcessing` | Processing > 10 for 30m | Check bottlenecks |
|
||||
| `UnknownsProcessingTimeout` | Timeouts > 5/hour | Review automation |
|
||||
| `UnknownsEscalationRate` | Escalations > 10/hour | Review criteria |
|
||||
|
||||
### 7.4 Metric-Based Troubleshooting
|
||||
|
||||
#### SLA Breach Investigation
|
||||
|
||||
```bash
|
||||
# 1. Check current breach status
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=unknowns_sla_compliance" | jq
|
||||
|
||||
# 2. Identify breached entries
|
||||
curl -s "$UNKNOWNS_API/grey-queue?status=pending" | \
|
||||
jq '.items[] | select(.sla_breached == true)'
|
||||
|
||||
# 3. Check SLA health endpoint
|
||||
curl -s "$UNKNOWNS_API/health/sla" | jq
|
||||
|
||||
# 4. Review breach timeline
|
||||
# In Grafana: SLA Compliance Over Time panel, last 24h
|
||||
```
|
||||
|
||||
### 7.4 Daily Report
|
||||
#### Stuck Processing Investigation
|
||||
|
||||
```bash
|
||||
# 1. Check processing count
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=greyqueue_processing_count" | jq
|
||||
|
||||
# 2. List stuck entries
|
||||
curl -s "$UNKNOWNS_API/grey-queue?status=Processing" | \
|
||||
jq '.items[] | select((.last_processed_at | fromdateiso8601) < (now - 3600))'
|
||||
|
||||
# 3. Check watchdog metrics
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=rate(greyqueue_stuck_total[1h])" | jq
|
||||
|
||||
# 4. Force retry if needed
|
||||
curl -X POST "$UNKNOWNS_API/grey-queue/{id}/retry"
|
||||
```
|
||||
|
||||
#### High Escalation Rate
|
||||
|
||||
```bash
|
||||
# 1. Check escalation rate
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_escalated_total[1h])" | jq
|
||||
|
||||
# 2. Review escalation reasons
|
||||
curl -s "$UNKNOWNS_API/grey-queue?status=Escalated" | \
|
||||
jq 'group_by(.escalation_reason) | map({reason: .[0].escalation_reason, count: length})'
|
||||
|
||||
# 3. Check for EPSS/KEV spikes
|
||||
# Events triggering escalations:
|
||||
# - epss.updated with score increase
|
||||
# - kev.added events
|
||||
# - deployment.created with affected components
|
||||
```
|
||||
|
||||
#### Queue Growth Analysis
|
||||
|
||||
```bash
|
||||
# 1. Check inflow rate
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_enqueued_total[1h])" | jq
|
||||
|
||||
# 2. Check resolution rate
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_resolved_total[1h])" | jq
|
||||
|
||||
# 3. Calculate net growth
|
||||
# growth_rate = inflow_rate - resolution_rate
|
||||
|
||||
# 4. Review reasons for new unknowns
|
||||
curl -s "$UNKNOWNS_API/grey-queue/summary" | jq '.by_reason'
|
||||
```
|
||||
|
||||
### 7.5 Daily Report
|
||||
|
||||
```bash
|
||||
# Generate daily report
|
||||
|
||||
Reference in New Issue
Block a user