Files

StellaOps Bot da315965ff feat: Add operations runbooks and UI API models for Sprint 3500.0004.x

Operations documentation:
- docs/operations/reachability-runbook.md - Reachability troubleshooting guide
- docs/operations/unknowns-queue-runbook.md - Unknowns queue management guide

UI TypeScript models:
- src/Web/StellaOps.Web/src/app/core/api/proof.models.ts - Proof ledger types
- src/Web/StellaOps.Web/src/app/core/api/reachability.models.ts - Reachability types
- src/Web/StellaOps.Web/src/app/core/api/unknowns.models.ts - Unknowns queue types

Sprint: SPRINT_3500_0004_0002 (UI), SPRINT_3500_0004_0004 (Docs)

2025-12-20 22:22:09 +02:00

15 KiB

Raw Blame History

Reachability Analysis Operations Runbook

Version: 1.0.0
Sprint: 3500.0004.0004
Last Updated: 2025-12-20

This runbook covers operational procedures for Reachability Analysis, including call graph management, analysis troubleshooting, and explain queries.

Overview
Call Graph Operations
Reachability Computation
Explain Queries
Troubleshooting
Monitoring & Alerting
Escalation Procedures

1. Overview

What is Reachability Analysis?

Reachability Analysis determines whether vulnerable code is actually reachable from application entrypoints. This reduces false positives by filtering out vulnerabilities in code that cannot be executed.

Reachability Statuses

Status	Confidence	Description
`UNREACHABLE`	High	No path from entrypoints to vulnerable code
`POSSIBLY_REACHABLE`	Medium	Path exists but contains heuristic edges
`REACHABLE_STATIC`	High	Static analysis proves path exists
`REACHABLE_PROVEN`	Very High	Runtime evidence confirms execution
`UNKNOWN`	Low	Insufficient data to determine

Key Components

Component	Purpose	Location
Call Graph Extractor	Language-specific CG extraction	Scanner Worker plugins
Call Graph Store	Persistent graph storage	`scanner.cg_node`, `scanner.cg_edge`
Reachability Analyzer	BFS pathfinding algorithm	Scanner Core library
Entrypoint Detector	Identifies application entrypoints	Language-specific plugins

Prerequisites

Access to Scanner WebService API
scanner.reachability OAuth scope
CLI access with stella configured
Language-specific workers deployed (dotnet, java, etc.)

2. Call Graph Operations

2.1 Call Graph Upload

# Upload via API
curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/callgraphs" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Content-Digest: sha256=$(sha256sum callgraph.json | cut -d' ' -f1)" \
  -d @callgraph.json

# Upload via CLI
stella scan graph upload --scan-id $SCAN_ID --file callgraph.json

# Upload streaming NDJSON (for large graphs)
stella scan graph upload --scan-id $SCAN_ID \
  --file callgraph.ndjson \
  --format ndjson \
  --streaming

2.2 Call Graph Inspection

# Get call graph summary
stella scan graph summary --scan-id $SCAN_ID

# Output:
# Nodes: 12,345
# Edges: 56,789
# Entrypoints: 42
# Languages: [dotnet, java]
# Size: 15.2 MB

# List entrypoints
stella scan graph entrypoints --scan-id $SCAN_ID

# Export full graph (for debugging)
stella scan graph export --scan-id $SCAN_ID --output graph.json

# Visualize subgraph (requires GraphViz)
stella scan graph visualize --scan-id $SCAN_ID \
  --node sha256:node123... \
  --depth 3 \
  --output subgraph.svg

2.3 Call Graph Validation

# Validate graph structure
stella scan graph validate --scan-id $SCAN_ID

# Checks performed:
# - All edge targets exist as nodes
# - Entrypoints reference valid nodes
# - No orphan nodes
# - No cycles in entrypoint definitions
# - Schema compliance

# Validate before upload
stella scan graph validate --file callgraph.json --strict

2.4 Call Graph Merging

When multiple language workers produce graphs:

# View merge status
stella scan graph merges --scan-id $SCAN_ID

# Output:
# Language   | Nodes  | Edges  | Status
# dotnet     | 8,234  | 34,567 | merged
# java       | 4,111  | 22,222 | merged
# Total      | 12,345 | 56,789 | complete

# Force re-merge (after fix)
stella scan graph merge --scan-id $SCAN_ID --force

3. Reachability Computation

3.1 Triggering Computation

# Trigger via API
curl -X POST "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/reachability/compute" \
  -H "Authorization: Bearer $TOKEN"

# Trigger via CLI
stella reachability compute --scan-id $SCAN_ID

# Trigger with options
stella reachability compute --scan-id $SCAN_ID \
  --max-depth 20 \
  --indirect-resolution conservative \
  --timeout 300s

3.2 Computation Options

Option	Default	Description
`max-depth`	10	Maximum path length to explore
`indirect-resolution`	`conservative`	How to handle indirect calls: `conservative`, `aggressive`, `skip`
`timeout`	300s	Maximum computation time
`parallel`	true	Parallel BFS from multiple entrypoints
`include-runtime`	true	Merge runtime evidence if available

3.3 Job Monitoring

# Check job status
stella reachability job-status --job-id reachability-job-001

# Output:
# Status: running
# Progress: 67% (8,234 / 12,345 nodes visited)
# Started: 2025-12-20T10:00:00Z
# Estimated completion: 2025-12-20T10:02:30Z

# Stream job logs
stella reachability job-logs --job-id reachability-job-001 --follow

# Cancel running job
stella reachability job-cancel --job-id reachability-job-001

3.4 Computation Results

# Get summary
stella reachability summary --scan-id $SCAN_ID

# Output:
# Total vulnerabilities: 45
# Unreachable: 38 (84%)
# Possibly reachable: 4 (9%)
# Reachable (static): 2 (4%)
# Reachable (proven): 1 (2%)
# Unknown: 0 (0%)

# Get detailed findings
stella reachability findings --scan-id $SCAN_ID --format json

# Filter by status
stella reachability findings --scan-id $SCAN_ID --status REACHABLE_STATIC

# Export for CI gate
stella reachability findings --scan-id $SCAN_ID \
  --status REACHABLE_STATIC,REACHABLE_PROVEN \
  --format sarif \
  --output findings.sarif

4. Explain Queries

4.1 Explain Single Finding

# Via API
curl "https://scanner.example.com/api/v1/scanner/scans/$SCAN_ID/reachability/explain?cve=CVE-2024-1234&purl=pkg:npm/lodash@4.17.20" \
  -H "Authorization: Bearer $TOKEN"

# Via CLI
stella reachability explain --scan-id $SCAN_ID \
  --cve CVE-2024-1234 \
  --purl "pkg:npm/lodash@4.17.20"

# Output:
# Status: REACHABLE_STATIC
# Confidence: 0.70
# 
# Shortest Path (depth=3):
# [0] MyApp.Controllers.OrdersController::Get(Guid)
#     Entrypoint: HTTP GET /api/orders/{id}
# [1] MyApp.Services.OrderService::Process(Order)
#     Edge: static (direct_call)
# [2] Lodash.merge(Object, Object) [VULNERABLE]
#     Edge: static (direct_call)
# 
# Why Reachable:
# - Static call path exists from HTTP entrypoint
# - All edges are statically proven
# - Vulnerable function is directly invoked

4.2 Explain with Alternatives

# Show all paths (not just shortest)
stella reachability explain --scan-id $SCAN_ID \
  --cve CVE-2024-1234 \
  --purl "pkg:npm/lodash@4.17.20" \
  --all-paths

# Output includes:
# Alternative paths found: 3
# Path 1 (depth=3): ... [shown above]
# Path 2 (depth=5): Controllers.UserController -> ... -> Lodash.merge
# Path 3 (depth=7): Background.JobProcessor -> ... -> Lodash.merge

4.3 Why Unreachable

# Explain why vulnerability is unreachable
stella reachability explain --scan-id $SCAN_ID \
  --cve CVE-2024-5678 \
  --purl "pkg:npm/unused-lib@1.0.0"

# Output:
# Status: UNREACHABLE
# Confidence: 0.95
# 
# Why Unreachable:
# - No path found from any entrypoint
# - Vulnerable function: UnusedLib.dangerousMethod()
# - Function visibility: private
# - Callers found: 0
# - Dead code analysis: likely dead code

4.4 Batch Explain

# Export all reachability explanations
stella reachability explain-all --scan-id $SCAN_ID \
  --output explanations.json

# Explain only reachable findings
stella reachability explain-all --scan-id $SCAN_ID \
  --status REACHABLE_STATIC,REACHABLE_PROVEN \
  --output reachable-explanations.json

5. Troubleshooting

5.1 Call Graph Too Large

Symptom: Upload fails with "413 Payload Too Large".

Diagnosis:

# Check graph size
du -h callgraph.json
wc -l callgraph.json

# Count nodes/edges
jq '.nodes | length' callgraph.json
jq '.edges | length' callgraph.json

Resolution:

# Option 1: Use streaming upload
stella scan graph upload --scan-id $SCAN_ID \
  --file callgraph.json \
  --streaming

# Option 2: Convert to NDJSON
stella scan graph convert --input callgraph.json \
  --output callgraph.ndjson \
  --format ndjson

# Option 3: Partition by artifact
stella scan graph partition --input callgraph.json \
  --output-dir ./partitions/ \
  --by artifact

5.2 Missing Entrypoints

Symptom: "No entrypoints found" warning.

Diagnosis:

# Check entrypoint detection
stella scan graph entrypoints --scan-id $SCAN_ID --verbose

# Check for framework detection
stella scan graph detect-framework --scan-id $SCAN_ID

Common causes:

Framework not detected: Add framework hints
Custom entrypoints: Manually specify
Wrong language worker: Check artifact analysis

Resolution:

# Specify framework explicitly
stella scan graph upload --scan-id $SCAN_ID \
  --file callgraph.json \
  --framework aspnetcore

# Add custom entrypoints
stella scan graph entrypoint add --scan-id $SCAN_ID \
  --node sha256:node123... \
  --kind http \
  --route "/api/custom"

5.3 Reachability Computation Timeout

Symptom: Job fails with "computation timeout".

Diagnosis:

# Check computation stats
stella reachability job-stats --job-id reachability-job-001

# Output:
# Nodes visited: 500,000
# Edges traversed: 2,500,000
# Time elapsed: 300s
# Memory used: 4.2 GB

Resolution:

# Option 1: Increase timeout
stella reachability compute --scan-id $SCAN_ID --timeout 600s

# Option 2: Reduce depth
stella reachability compute --scan-id $SCAN_ID --max-depth 5

# Option 3: Skip indirect calls
stella reachability compute --scan-id $SCAN_ID --indirect-resolution skip

# Option 4: Partition analysis
stella reachability compute --scan-id $SCAN_ID --partition-by artifact

5.4 Inconsistent Results

Symptom: Different results between runs.

Diagnosis:

# Check determinism settings
stella scan manifest --scan-id $SCAN_ID | jq '.deterministic, .seed'

# Compare graph hashes
stella scan graph hash --scan-id $SCAN_ID

Resolution:

# Ensure deterministic mode
stella reachability compute --scan-id $SCAN_ID \
  --deterministic \
  --seed "AQIDBA=="  # Fixed seed

# Use same graph version
stella reachability compute --scan-id $SCAN_ID \
  --graph-digest sha256:cg123...

5.5 False Positives/Negatives

Symptom: Reachability verdict seems incorrect.

Diagnosis:

# Get detailed explanation
stella reachability explain --scan-id $SCAN_ID \
  --cve CVE-2024-1234 \
  --purl "pkg:npm/lodash@4.17.20" \
  --verbose

# Check edge confidence
stella scan graph edge --scan-id $SCAN_ID \
  --from sha256:nodeA... \
  --to sha256:nodeB...

Common causes for false positives:

Heuristic edges: Indirect call resolution too aggressive
Reflection/dynamic calls: May create false paths
Dead code not detected: Code exists but never executes

Common causes for false negatives:

Missing edges: Call graph incomplete
Indirect calls skipped: Resolution too conservative
Cross-language calls: Language boundary not bridged

Resolution:

# Adjust indirect call resolution
stella reachability compute --scan-id $SCAN_ID \
  --indirect-resolution conservative

# Add runtime evidence
stella scan evidence upload --scan-id $SCAN_ID \
  --file runtime-trace.json

# Report false positive/negative for ML training
stella reachability feedback --scan-id $SCAN_ID \
  --cve CVE-2024-1234 \
  --verdict false-positive \
  --reason "Dead code - feature flag disabled"

6. Monitoring & Alerting

6.1 Key Metrics

Metric	Description	Alert Threshold
`callgraph_upload_duration_seconds`	Time to upload call graph	> 60s
`callgraph_size_bytes`	Size of uploaded graphs	> 200MB
`reachability_computation_duration_seconds`	Time to compute reachability	> 300s
`reachability_nodes_visited`	Nodes visited during BFS	> 1M
`reachability_job_failures_total`	Failed computation jobs	> 0/hour
`entrypoint_detection_rate`	% of scans with entrypoints	< 90%

6.2 Grafana Dashboard

Dashboard: Reachability Operations
Panels:
- Call graph upload throughput
- Graph size distribution
- Computation duration (p50, p95, p99)
- Reachability verdict distribution
- Job queue depth
- Entrypoint detection rate

6.3 Alerting Rules

groups:
  - name: reachability
    rules:
      - alert: ReachabilityComputationSlow
        expr: histogram_quantile(0.95, reachability_computation_duration_seconds) > 300
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Reachability computation is slow"
          
      - alert: ReachabilityJobFailures
        expr: increase(reachability_job_failures_total[1h]) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Multiple reachability job failures"
          
      - alert: LowEntrypointDetectionRate
        expr: entrypoint_detection_rate < 0.8
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Entrypoint detection rate is low"

7. Escalation Procedures

7.1 Escalation Matrix

Severity	Condition	Response Time	Escalation Path
P1	Reachability failing for all scans	15 min	On-call → Team Lead
P2	Computation failures > 20%	1 hour	On-call → Team Lead
P3	Computation latency > 600s p95	4 hours	On-call
P4	Entrypoint detection < 70%	24 hours	Ticket

7.2 P1 Response Procedure

Acknowledge alert

Triage:

# Check worker health
stella scanner workers status

# Check graph store connectivity
stella health check --service graph-store

# Check recent failures
stella reachability jobs --status failed --last 10

Mitigate:

# Scale up workers if queue backlog
kubectl scale deployment scanner-worker --replicas=10

# Clear stuck jobs
stella reachability jobs cancel --status stuck

Communicate: Update status page
Resolve: Fix root cause
Postmortem: Document within 48 hours

Last Updated: 2025-12-20
Version: 1.0.0
Sprint: 3500.0004.0004

15 KiB Raw Blame History

Reachability Analysis Operations Runbook

Table of Contents

1. Overview

What is Reachability Analysis?

Reachability Statuses

Key Components

Prerequisites

2. Call Graph Operations

2.1 Call Graph Upload

2.2 Call Graph Inspection

2.3 Call Graph Validation

2.4 Call Graph Merging

3. Reachability Computation

3.1 Triggering Computation

3.2 Computation Options

3.3 Job Monitoring

3.4 Computation Results

4. Explain Queries

4.1 Explain Single Finding

4.2 Explain with Alternatives

4.3 Why Unreachable

4.4 Batch Explain

5. Troubleshooting

5.1 Call Graph Too Large

5.2 Missing Entrypoints

5.3 Reachability Computation Timeout

5.4 Inconsistent Results

5.5 False Positives/Negatives

6. Monitoring & Alerting

6.1 Key Metrics

6.2 Grafana Dashboard

6.3 Alerting Rules

7. Escalation Procedures

7.1 Escalation Matrix

7.2 P1 Response Procedure

Related Documentation

15 KiB

Raw Blame History