Files
git.stella-ops.org/docs/operations/unknowns-queue-runbook.md

849 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Unknowns Queue Management Runbook
> **Version**: 1.0.0
> **Sprint**: 3500.0004.0004
> **Last Updated**: 2025-12-20
This runbook covers operational procedures for managing the Unknowns queue, including triage, escalation, resolution, and queue health maintenance.
---
## Table of Contents
1. [Overview](#1-overview)
2. [Queue Operations](#2-queue-operations)
3. [Triage Procedures](#3-triage-procedures)
4. [Escalation Workflows](#4-escalation-workflows)
5. [Resolution Procedures](#5-resolution-procedures)
6. [Troubleshooting](#6-troubleshooting)
7. [Monitoring & Alerting](#7-monitoring--alerting)
---
## 1. Overview
### What are Unknowns?
Unknowns are items that could not be fully classified during scanning due to:
- Missing VEX statements
- Ambiguous indirect calls in call graphs
- Incomplete SBOM data
- Missing advisory information
- Conflicting evidence from multiple sources
### Unknown Ranking
Unknowns are ranked using a 2-factor scoring model:
```
score = 0.60 × blast + 0.30 × scarcity + 0.30 × pressure + containment_deduction
```
| Factor | Weight | Description |
|--------|--------|-------------|
| Blast Radius | 0.60 | Impact scope (dependents, network exposure) |
| Evidence Scarcity | 0.30 | How much data is missing |
| Exploit Pressure | 0.30 | EPSS score, KEV status |
| Containment | -0.20 | Mitigation factors (seccomp, read-only FS) |
### Band Assignment
| Band | Score Range | Priority | SLA |
|------|-------------|----------|-----|
| HOT | ≥ 0.70 | Critical | 24 hours |
| WARM | 0.40 - 0.69 | Normal | 7 days |
| COLD | < 0.40 | Low | 30 days |
---
## 2. Queue Operations
### 2.1 View Queue Status
```bash
# Get queue summary
stella unknowns summary
# Output:
# Total: 142 unknowns
# HOT: 12 (8%) - Requires immediate attention
# WARM: 85 (60%) - Normal priority
# COLD: 45 (32%) - Low priority
#
# KEV items: 3
# Average score: 0.52
# Get queue summary via API
curl "https://scanner.example.com/api/v1/unknowns/summary" \
-H "Authorization: Bearer $TOKEN"
```
### 2.2 List Unknowns
```bash
# List all HOT unknowns
stella unknowns list --band HOT
# List by score (highest first)
stella unknowns list --sort score --order desc --limit 20
# Filter by reason
stella unknowns list --reason missing_vex
# Filter by artifact
stella unknowns list --artifact sha256:abc123...
# Filter by KEV status
stella unknowns list --kev true
```
### 2.3 View Unknown Details
```bash
# Get detailed view
stella unknowns show unk-12345678-abcd-1234-5678-abcdef123456
# Output:
# ID: unk-12345678-...
# Artifact: pkg:oci/myapp@sha256:abc123
# Reasons: [missing_vex, ambiguous_indirect_call]
#
# Blast Radius:
# Dependents: 15 services
# Network: internet-facing
# Privilege: user
#
# Evidence Scarcity: 0.7 (high)
#
# Exploit Pressure:
# EPSS: 0.45
# KEV: false
#
# Containment:
# Seccomp: enforced (-0.10)
# Filesystem: read-only (-0.10)
#
# Score: 0.62 (WARM band)
# Score Breakdown:
# Blast component: +0.35
# Scarcity component: +0.21
# Pressure component: +0.26
# Containment deduction: -0.20
# Show proof tree
stella unknowns proof unk-12345678-...
```
### 2.4 Export Queue Data
```bash
# Export for analysis
stella unknowns export --format json --output unknowns.json
# Export HOT items for daily review
stella unknowns export --band HOT --format csv --output hot-unknowns.csv
# Export with full details
stella unknowns export --verbose --include-proofs --output full-export.json
```
---
## 3. Triage Procedures
### 3.1 Daily Triage Workflow
**Schedule**: Daily at 9:00 AM
**Duration**: 30 minutes
**Participants**: Security analyst, on-call engineer
**Process**:
```bash
# 1. Get today's queue snapshot
stella unknowns snapshot --output daily-$(date +%Y%m%d).json
# 2. Review all HOT items
stella unknowns list --band HOT --since 24h
# 3. For each HOT unknown, determine action:
# - Escalate: Trigger immediate rescan
# - Investigate: Needs manual analysis
# - Defer: Move to WARM (with justification)
# - Resolve: Evidence found, can close
# 4. Process each item
stella unknowns triage unk-12345678-... --action escalate
stella unknowns triage unk-87654321-... --action investigate --notes "Need VEX from vendor"
stella unknowns triage unk-11111111-... --action defer --reason "False positive suspected"
```
### 3.2 Triage Decision Matrix
| Reason Code | KEV | EPSS > 0.5 | Action |
|-------------|-----|------------|--------|
| `missing_vex` | Yes | Any | Escalate + Vendor outreach |
| `missing_vex` | No | Yes | Escalate |
| `missing_vex` | No | No | Request VEX |
| `ambiguous_indirect_call` | Any | Any | Manual code review |
| `incomplete_sbom` | Any | Any | Rescan with updated extractor |
| `conflicting_evidence` | Any | Any | Manual analysis |
### 3.3 Triage Templates
```bash
# Quick escalate (HOT + KEV)
stella unknowns triage unk-... --action escalate \
--priority P1 \
--notes "KEV item, requires immediate attention"
# Request vendor VEX
stella unknowns triage unk-... --action investigate \
--notes "Requested VEX from vendor via security@vendor.com" \
--due-date 7d
# Mark for code review
stella unknowns triage unk-... --action investigate \
--notes "Requires manual code review to resolve indirect call" \
--assign @code-review-team
# Defer with justification
stella unknowns triage unk-... --action defer \
--reason "Component not deployed to production" \
--evidence "deployment-manifest.yaml shows staging-only"
```
---
## 4. Escalation Workflows
### 4.1 Automatic Escalation
Unknowns are automatically escalated when:
- Score increases above HOT threshold (0.70)
- KEV status added to related CVE
- EPSS score increases significantly (> 0.2 delta)
- Blast radius increases (new dependents detected)
**Configure auto-escalation**:
```yaml
# policy.unknowns.escalation.yaml
autoEscalation:
enabled: true
triggers:
- condition: score >= 0.70
action: escalate
notify: [security-team]
- condition: kev == true
action: escalate
priority: P1
notify: [security-team, management]
- condition: epss_delta > 0.2
action: escalate
notify: [security-team]
```
### 4.2 Manual Escalation
```bash
# Escalate via CLI
stella unknowns escalate unk-12345678-...
# Escalate with reason
stella unknowns escalate unk-12345678-... \
--reason "Customer reported potential exploit"
# Escalate to trigger rescan
stella unknowns escalate unk-12345678-... --rescan
# Output:
# Escalated: unk-12345678-...
# Rescan job: rescan-job-001
# Status: queued
# ETA: 5 minutes
```
### 4.3 Bulk Escalation
```bash
# Escalate all KEV items
stella unknowns escalate --filter "kev=true" --reason "KEV bulk escalation"
# Escalate high-score items
stella unknowns escalate --filter "score>=0.8" --rescan
# Escalate by artifact
stella unknowns escalate --artifact sha256:abc123... --reason "Production incident"
```
### 4.4 Escalation SLA Tracking
```bash
# Check SLA status
stella unknowns sla-status
# Output:
# HOT unknowns SLA (24h):
# In SLA: 10 (83%)
# Breached: 2 (17%)
#
# Breached items:
# unk-111... (26h old) - missing_vex
# unk-222... (30h old) - conflicting_evidence
# Get SLA breach notifications
stella unknowns list --sla-breached
```
---
## 5. Resolution Procedures
### 5.1 Resolution Types
| Resolution | Description | Evidence Required |
|------------|-------------|-------------------|
| `not_affected` | Vulnerability doesn't apply | VEX statement or manual analysis |
| `fixed` | Vulnerability patched | Version upgrade confirmation |
| `mitigated` | Controls in place | Mitigation documentation |
| `false_positive` | Incorrect classification | Analysis report |
| `wont_fix` | Accepted risk | Risk acceptance form |
### 5.2 Resolve Unknown
```bash
# Resolve as not affected
stella unknowns resolve unk-12345678-... \
--resolution not_affected \
--justification "vulnerable_code_not_present" \
--notes "Manual code review confirmed function not used"
# Resolve as fixed
stella unknowns resolve unk-12345678-... \
--resolution fixed \
--justification "version_upgraded" \
--evidence "Upgraded lodash to 4.17.21, CVE patched"
# Resolve as mitigated
stella unknowns resolve unk-12345678-... \
--resolution mitigated \
--justification "inline_mitigations_exist" \
--evidence "WAF rule WAF-001 blocks exploit pattern"
# Resolve as won't fix (risk accepted)
stella unknowns resolve unk-12345678-... \
--resolution wont_fix \
--justification "risk_accepted" \
--evidence "Risk acceptance ticket RISK-123" \
--expires 90d # Re-evaluate in 90 days
```
### 5.3 Bulk Resolution
```bash
# Resolve all items for a fixed package version
stella unknowns resolve-batch \
--filter "purl=pkg:npm/lodash@4.17.20" \
--resolution fixed \
--justification "Upgraded to 4.17.21 fleet-wide" \
--evidence "Fleet upgrade ticket FLEET-456"
# Resolve false positives from analysis
stella unknowns resolve-batch \
--file false-positives.json \
--resolution false_positive
```
### 5.4 Resolution Audit Trail
```bash
# View resolution history
stella unknowns history unk-12345678-...
# Output:
# 2025-12-15 10:00:00 - Created (score: 0.62)
# 2025-12-16 09:30:00 - Triaged by analyst@example.com
# 2025-12-17 14:00:00 - Escalated (KEV added)
# 2025-12-18 11:00:00 - Resolved by security@example.com
# Resolution: not_affected
# Justification: vulnerable_code_not_present
# Notes: Manual code review confirmed function not used
# Export audit trail
stella unknowns audit-export --from 2025-01-01 --to 2025-12-31 --output audit.json
```
---
## 6. Troubleshooting
### 6.1 Score Seems Wrong
**Symptom**: Unknown scored too high or too low.
**Diagnosis**:
```bash
# View score breakdown
stella unknowns show unk-... --score-details
# View proof tree
stella unknowns proof unk-... --verbose
```
**Common causes**:
1. **Stale EPSS data**: EPSS feed not updated
2. **Incorrect blast radius**: Dependency data outdated
3. **Missing containment data**: Seccomp/filesystem status unknown
**Resolution**:
```bash
# Trigger score recalculation
stella unknowns recalculate unk-...
# Force refresh of all input signals
stella unknowns refresh unk-... --force
```
### 6.2 Duplicate Unknowns
**Symptom**: Same issue appears multiple times.
**Diagnosis**:
```bash
# Find potential duplicates
stella unknowns duplicates --scan
# Output shows items with same CVE+PURL but different artifacts
```
**Resolution**:
```bash
# Merge duplicates
stella unknowns merge \
--primary unk-111... \
--secondary unk-222... \
--reason "Same CVE across artifact versions"
```
### 6.3 Escalation Not Working
**Symptom**: Escalation doesn't trigger rescan.
**Diagnosis**:
```bash
# Check escalation status
stella unknowns escalation-status unk-...
# Check Scheduler connectivity
stella health check --service scheduler
# Check job queue
stella scheduler queue status rescan
```
**Resolution**:
```bash
# Retry escalation
stella unknowns escalate unk-... --force
# Manual rescan trigger
stella scan trigger --artifact sha256:abc123... --priority high
```
### 6.4 Resolution Rejected
**Symptom**: Resolution attempt fails validation.
**Diagnosis**:
```bash
# Check resolution requirements
stella unknowns resolution-requirements unk-...
# Output:
# Resolution requirements for unk-12345678-...
# - Justification: required
# - Evidence: required (reason: KEV item)
# - Approver: required (band: HOT)
```
**Resolution**:
```bash
# Provide required evidence
stella unknowns resolve unk-... \
--resolution not_affected \
--justification "vulnerable_code_not_present" \
--evidence "Code review: CRV-123" \
--approver security-lead@example.com
```
---
## 7. Monitoring & Alerting
> **Updated**: Sprint SPRINT_20260118_018_Unknowns_queue_enhancement (UQ-007)
### 7.1 Key Metrics
| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| `unknowns_queue_depth_hot` | HOT band queue depth | > 5 critical, > 0 for 1h warning |
| `unknowns_queue_depth_warm` | WARM band queue depth | > 25 warning |
| `unknowns_queue_depth_cold` | COLD band queue depth | > 100 warning |
| `unknowns_sla_compliance` | SLA compliance rate (0-1) | < 0.80 critical, < 0.95 warning |
| `unknowns_sla_breach_total` | Total SLA breaches (counter) | increase > 0 |
| `unknowns_escalated_total` | Escalations (counter) | rate > 10/hour |
| `unknowns_demoted_total` | Demotions (counter) | - |
| `unknowns_expired_total` | Expirations (counter) | - |
| `unknowns_processing_time_seconds` | Processing time histogram | p95 > 30s |
| `unknowns_resolution_time_hours` | Resolution time by band | p95 > SLA |
| `unknowns_state_transitions_total` | State transitions (by from/to) | - |
| `greyqueue_stuck_total` | Stuck processing entries | > 0 |
| `greyqueue_timeout_total` | Processing timeouts | > 5/hour |
| `greyqueue_processing_count` | Currently processing | > 10 for 30m |
### 7.2 Grafana Dashboard
Import dashboard from: `devops/observability/grafana/dashboards/unknowns-queue-dashboard.json`
**Dashboard Panels:**
| Panel | Description |
|-------|-------------|
| Total Queue Depth | Stat showing total across all bands |
| HOT/WARM/COLD Unknowns | Individual band stats with thresholds |
| SLA Compliance | Gauge showing compliance percentage |
| Queue Depth Over Time | Time series by band |
| SLA Compliance Over Time | Trending compliance |
| State Transitions | Rate of state changes |
| Processing Time (p95) | Performance histogram |
| Escalations & Failures | Lifecycle events |
| Resolution Time by Band | Time-to-resolution |
| Stuck & Timeout Events | Watchdog metrics |
| SLA Breaches Today | 24h breach counter |
### 7.3 Alerting Rules
Alert rules deployed from: `devops/observability/prometheus/rules/unknowns-queue-alerts.yaml`
**Critical Alerts:**
| Alert | Condition | Response |
|-------|-----------|----------|
| `UnknownsSlaBreachCritical` | compliance < 80% | Immediate escalation to security team |
| `UnknownsHotQueueHigh` | HOT > 5 for 10m | Prioritize resolution |
| `UnknownsProcessingFailures` | Failed entries in 1h | Manual intervention required |
| `UnknownsSlaMonitorDown` | No metrics for 5m | Check service health |
| `UnknownsHealthCheckUnhealthy` | Health check failing | Check SLA breaches |
**Warning Alerts:**
| Alert | Condition | Response |
|-------|-----------|----------|
| `UnknownsSlaBreachWarning` | 80% ≤ compliance < 95% | Review queue health |
| `UnknownsHotQueuePresent` | HOT > 0 for 1h | Check progress |
| `UnknownsQueueBacklog` | Total > 100 for 30m | Scale processing |
| `UnknownsStuckProcessing` | Processing > 10 for 30m | Check bottlenecks |
| `UnknownsProcessingTimeout` | Timeouts > 5/hour | Review automation |
| `UnknownsEscalationRate` | Escalations > 10/hour | Review criteria |
### 7.4 Metric-Based Troubleshooting
#### SLA Breach Investigation
```bash
# 1. Check current breach status
curl -s "http://prometheus:9090/api/v1/query?query=unknowns_sla_compliance" | jq
# 2. Identify breached entries
curl -s "$UNKNOWNS_API/grey-queue?status=pending" | \
jq '.items[] | select(.sla_breached == true)'
# 3. Check SLA health endpoint
curl -s "$UNKNOWNS_API/health/sla" | jq
# 4. Review breach timeline
# In Grafana: SLA Compliance Over Time panel, last 24h
```
#### Stuck Processing Investigation
```bash
# 1. Check processing count
curl -s "http://prometheus:9090/api/v1/query?query=greyqueue_processing_count" | jq
# 2. List stuck entries
curl -s "$UNKNOWNS_API/grey-queue?status=Processing" | \
jq '.items[] | select((.last_processed_at | fromdateiso8601) < (now - 3600))'
# 3. Check watchdog metrics
curl -s "http://prometheus:9090/api/v1/query?query=rate(greyqueue_stuck_total[1h])" | jq
# 4. Force retry if needed
curl -X POST "$UNKNOWNS_API/grey-queue/{id}/retry"
```
#### High Escalation Rate
```bash
# 1. Check escalation rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_escalated_total[1h])" | jq
# 2. Review escalation reasons
curl -s "$UNKNOWNS_API/grey-queue?status=Escalated" | \
jq 'group_by(.escalation_reason) | map({reason: .[0].escalation_reason, count: length})'
# 3. Check for EPSS/KEV spikes
# Events triggering escalations:
# - epss.updated with score increase
# - kev.added events
# - deployment.created with affected components
```
#### Queue Growth Analysis
```bash
# 1. Check inflow rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_enqueued_total[1h])" | jq
# 2. Check resolution rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_resolved_total[1h])" | jq
# 3. Calculate net growth
# growth_rate = inflow_rate - resolution_rate
# 4. Review reasons for new unknowns
curl -s "$UNKNOWNS_API/grey-queue/summary" | jq '.by_reason'
```
### 7.5 Daily Report
```bash
# Generate daily report
stella unknowns report --format email --send-to security-team@example.com
# Report includes:
# - Queue summary (total, by band, by reason)
# - SLA status (in compliance, breaches)
# - Top 10 highest-scored items
# - Newly added items (last 24h)
# - Resolved items (last 24h)
# - KEV item status
# - Trends (7-day, 30-day)
```
---
## 8. Unknown Budgets
Unknown budgets enforce per-environment caps on unknowns by reason code. Budgets can warn or block when exceeded.
**Configuration**:
```yaml
# etc/policy.unknowns.budgets.yaml
unknownBudgets:
enforceBudgets: true
budgets:
prod:
environment: prod
totalLimit: 3
reasonLimits:
Reachability: 0
Provenance: 0
VexConflict: 1
action: Block
exceededMessage: "Production requires zero reachability unknowns"
stage:
environment: stage
totalLimit: 10
reasonLimits:
Reachability: 1
action: WarnUnlessException
dev:
environment: dev
totalLimit: null
action: Warn
default:
environment: default
totalLimit: 5
action: Warn
```
**Exception coverage**:
To allow approved exceptions to cover specific unknown reason codes, set exception metadata
`unknown_reason_codes` (comma-separated). Example: `Reachability, U-VEX`.
---
## Related Documentation
- [Unknowns API Reference](../api/score-proofs-reachability-api-reference.md#5-unknowns-api)
- [Triage Technical Reference](../product/advisories/14-Dec-2025%20-%20Triage%20and%20Unknowns%20Technical%20Reference.md)
- [Score Proofs Runbook](./score-proofs-runbook.md)
- [Policy Engine](../modules/policy/architecture.md)
- [Determinization API](../modules/policy/determinization-api.md)
- [VEX Consensus Guide](../VEX_CONSENSUS_GUIDE.md)
---
## 8. Grey Queue Operations
> **Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli
The Grey Queue handles observations with uncertain status requiring operator attention or additional evidence. These are distinct from standard HOT/WARM/COLD band unknowns.
### 8.1 Grey Queue Overview
Grey Queue items have:
- **Observation state**: `PendingDeterminization`, `Disputed`, or `GuardedPass`
- **Reanalysis fingerprint**: Deterministic ID for reproducible replays
- **Triggers**: Events that caused reanalysis
- **Conflicts**: Detected evidence disagreements
- **Next actions**: Suggested resolution paths
### 8.2 List Grey Queue Items
```bash
# List all grey queue items
stella unknowns list --state grey
# List by observation state
stella unknowns list --observation-state pending-determinization
stella unknowns list --observation-state disputed
stella unknowns list --observation-state guarded-pass
# List with fingerprint details
stella unknowns list --state grey --show-fingerprint
# List with conflict summary
stella unknowns list --state grey --show-conflicts
```
### 8.3 View Grey Queue Details
```bash
# Show grey queue item with full details
stella unknowns show unk-12345678-... --grey
# Output:
# ID: unk-12345678-...
# Observation State: Disputed
#
# Reanalysis Fingerprint:
# ID: sha256:abc123...
# Computed At: 2026-01-15T10:00:00Z
# Policy Config Hash: sha256:def456...
#
# Triggers (2):
# - epss.updated@1 (2026-01-15T09:55:00Z) delta=0.15
# - vex.updated@1 (2026-01-15T09:50:00Z)
#
# Conflicts (1):
# - VexStatusConflict: vendor-a reports 'not_affected', vendor-b reports 'affected'
# Severity: high
# Adjudication: manual_review
#
# Next Actions:
# - trust_resolution: Resolve issuer trust conflict
# - manual_review: Escalate to security team
# Show fingerprint only
stella unknowns fingerprint unk-12345678-...
# Show triggers only
stella unknowns triggers unk-12345678-...
```
### 8.4 Grey Queue Triage Actions
```bash
# Resolve a grey queue item (operator determination)
stella unknowns resolve unk-12345678-... \
--status not_affected \
--justification "Verified vendor VEX is authoritative" \
--evidence-ref "vex-observation-id-123"
# Escalate for manual review
stella unknowns escalate unk-12345678-... \
--priority P1 \
--reason "Conflicting VEX requires security team decision"
# Defer pending additional evidence
stella unknowns defer unk-12345678-... \
--await vex \
--reason "Waiting for upstream vendor VEX statement"
```
### 8.5 Grey Queue Conflict Resolution
```bash
# List items with conflicts
stella unknowns list --has-conflicts
# Filter by conflict type
stella unknowns list --conflict-type vex-status-conflict
stella unknowns list --conflict-type vex-reachability-contradiction
stella unknowns list --conflict-type trust-tie
# Resolve a conflict manually
stella unknowns resolve-conflict unk-12345678-... \
--winner vendor-a \
--reason "vendor-a is the upstream maintainer"
```
### 8.6 Grey Queue Summary
```bash
# Get grey queue summary
stella unknowns summary --grey
# Output:
# Grey Queue: 23 items
#
# By State:
# PendingDeterminization: 15 (65%)
# Disputed: 5 (22%)
# GuardedPass: 3 (13%)
#
# Conflicts: 8 items have conflicts
# Avg. Triggers: 2.3 per item
# Oldest: 7 days
```
### 8.7 Grey Queue Export
```bash
# Export grey queue for analysis
stella unknowns export --state grey --format json --output grey-queue.json
# Export with full fingerprints and triggers
stella unknowns export --state grey --verbose --output grey-full.json
# Export conflicts only
stella unknowns export --has-conflicts --format csv --output conflicts.csv
```
---
**Last Updated**: 2026-01-16
**Version**: 1.1.0
**Sprint**: SPRINT_20260112_010_CLI_unknowns_grey_queue_cli