Files
git.stella-ops.org/docs/operations/reachability-drift-guide.md
StellaOps Bot df94136727 feat: Implement distro-native version comparison for RPM, Debian, and Alpine packages
- Add RpmVersionComparer for RPM version comparison with epoch, version, and release handling.
- Introduce DebianVersion for parsing Debian EVR (Epoch:Version-Release) strings.
- Create ApkVersion for parsing Alpine APK version strings with suffix support.
- Define IVersionComparator interface for version comparison with proof-line generation.
- Implement VersionComparisonResult struct to encapsulate comparison results and proof lines.
- Add tests for Debian and RPM version comparers to ensure correct functionality and edge case handling.
- Create project files for the version comparison library and its tests.
2025-12-22 09:49:53 +02:00

13 KiB

Reachability Drift Detection - Operations Guide

Module: Scanner Version: 1.0 Last Updated: 2025-12-22


1. Prerequisites

1.1 Infrastructure Requirements

Component Minimum Recommended Notes
CPU 4 cores 8 cores For call graph extraction
Memory 4 GB 8 GB Large projects need more
PostgreSQL 16+ 16+ With RLS enabled
Valkey/Redis 7.0+ 7.0+ For caching (optional)
.NET Runtime 10.0 10.0 Preview features enabled

1.2 Network Requirements

Direction Endpoints Notes
Inbound Scanner API (8080) Load balancer health checks
Outbound PostgreSQL (5432) Database connections
Outbound Valkey (6379) Cache connections (optional)
Outbound Signer service For DSSE attestations

1.3 Dependencies

  • Scanner WebService deployed and healthy
  • PostgreSQL database with Scanner schema migrations applied
  • (Optional) Valkey cluster for caching
  • (Optional) Signer service for attestation signing

2. Configuration

2.1 Scanner Service Configuration

File: etc/scanner.yaml

scanner:
  reachability:
    # Enable reachability drift detection
    enabled: true

    # Languages to analyze (empty = all supported)
    languages:
      - dotnet
      - java
      - node
      - python
      - go

    # Call graph extraction options
    extraction:
      max_depth: 100
      max_nodes: 100000
      timeout_seconds: 300
      include_test_code: false
      include_vendored: false

    # Drift detection options
    drift:
      # Auto-compute on scan completion
      auto_compute: true
      # Base scan selection (previous, tagged, specific)
      base_selection: previous
      # Emit VEX candidates for unreachable sinks
      emit_vex_candidates: true

  storage:
    postgres:
      connection_string: "Host=localhost;Database=stellaops;Username=scanner;Password=${SCANNER_DB_PASSWORD}"
      schema: scanner
      pool_size: 20

  cache:
    valkey:
      enabled: true
      connection: "localhost:6379"
      bucket: "stella-callgraph"
      ttl_hours: 24
      circuit_breaker:
        failure_threshold: 5
        timeout_seconds: 30

2.2 Valkey Cache Configuration

# Valkey-specific settings
cache:
  valkey:
    enabled: true
    connection: "valkey-cluster.internal:6379"
    bucket: "stella-callgraph"
    ttl_hours: 24

    # Circuit breaker prevents cache storms
    circuit_breaker:
      failure_threshold: 5
      timeout_seconds: 30
      half_open_max_attempts: 3

    # Compression reduces memory usage
    compression:
      enabled: true
      algorithm: gzip
      level: fastest

2.3 Policy Gate Configuration

File: etc/policy.yaml

smart_diff:
  gates:
    # Block on KEV becoming reachable
    - id: drift_block_kev
      condition: "delta_reachable > 0 AND is_kev = true"
      action: block
      severity: critical
      message: "Known Exploited Vulnerability now reachable"

    # Block on high-severity sink becoming reachable
    - id: drift_block_critical
      condition: "delta_reachable > 0 AND max_cvss >= 9.0"
      action: block
      severity: critical
      message: "Critical vulnerability now reachable"

    # Warn on any new reachable paths
    - id: drift_warn_new_paths
      condition: "delta_reachable > 0"
      action: warn
      severity: medium
      message: "New reachable paths detected"

    # Auto-allow mitigated paths
    - id: drift_allow_mitigated
      condition: "delta_unreachable > 0 AND delta_reachable = 0"
      action: allow
      auto_approve: true

3. Deployment Modes

3.1 Standalone Deployment

# Run Scanner WebService with drift detection
docker run -d \
  --name scanner \
  -p 8080:8080 \
  -e SCANNER_DB_PASSWORD=secret \
  -v /etc/scanner:/etc/scanner:ro \
  stellaops/scanner:latest

# Verify health
curl http://localhost:8080/health

3.2 Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanner
  namespace: stellaops
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scanner
  template:
    metadata:
      labels:
        app: scanner
    spec:
      containers:
        - name: scanner
          image: stellaops/scanner:latest
          ports:
            - containerPort: 8080
          env:
            - name: SCANNER_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: scanner-secrets
                  key: db-password
          volumeMounts:
            - name: config
              mountPath: /etc/scanner
              readOnly: true
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: scanner-config

3.3 Air-Gapped Deployment

For air-gapped environments:

  1. Disable external lookups:

    scanner:
      reachability:
        offline_mode: true
        # No external advisory fetching
    
  2. Pre-load call graph caches:

    # Export from connected environment
    stella cache export --type callgraph --output graphs.tar.gz
    
    # Import in air-gapped environment
    stella cache import --input graphs.tar.gz
    
  3. Use local VEX sources:

    excititor:
      sources:
        - type: local
          path: /data/vex-bundles/
    

4. Monitoring & Metrics

4.1 Key Metrics

Metric Type Description Alert Threshold
scanner_callgraph_extraction_duration_seconds histogram Time to extract call graph p99 > 300s
scanner_callgraph_node_count gauge Nodes in extracted graph > 100,000
scanner_reachability_analysis_duration_seconds histogram BFS analysis time p99 > 30s
scanner_drift_newly_reachable_total counter Count of newly reachable sinks > 0 (alert)
scanner_drift_newly_unreachable_total counter Count of mitigated sinks (info)
scanner_cache_hit_ratio gauge Valkey cache hit rate < 0.5
scanner_cache_circuit_breaker_open gauge Circuit breaker state = 1 (alert)

4.2 Grafana Dashboard

Import dashboard JSON from: deploy/grafana/scanner-drift-dashboard.json

Key panels:

  • Drift detection rate over time
  • Newly reachable sinks by category
  • Call graph extraction latency
  • Cache hit/miss ratio
  • Circuit breaker state

4.3 Alert Rules

# Prometheus alerting rules
groups:
  - name: scanner-drift
    rules:
      - alert: KevBecameReachable
        expr: increase(scanner_drift_kev_reachable_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "KEV vulnerability became reachable"
          description: "A Known Exploited Vulnerability is now reachable from public entrypoints"

      - alert: HighDriftRate
        expr: rate(scanner_drift_newly_reachable_total[1h]) > 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High rate of new reachable vulnerabilities"

      - alert: CacheCircuitOpen
        expr: scanner_cache_circuit_breaker_open == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Valkey cache circuit breaker is open"

5. Troubleshooting

5.1 Call Graph Extraction Failures

Symptom: GRAPH_NOT_EXTRACTED error

Causes & Solutions:

Cause Solution
Missing SDK/runtime Install required SDK (.NET, Node.js, JDK)
Build errors in project Fix compilation errors first
Timeout exceeded Increase extraction.timeout_seconds
Memory exhaustion Increase container memory limits
Unsupported language Check language support matrix

Debugging:

# Check extraction logs
kubectl logs -f deployment/scanner | grep -i extraction

# Manual extraction test
stella scan callgraph \
  --project /path/to/project \
  --language dotnet \
  --verbose

5.2 Drift Detection Issues

Symptom: Drift not computed or incorrect results

Causes & Solutions:

Cause Solution
No base scan available Ensure previous scan exists
Different languages Base and head must have same language
Graph digest unchanged No material code changes detected
Cache stale Clear Valkey cache for scan

Debugging:

# Check drift computation status
curl "http://scanner:8080/api/scanner/scans/{scanId}/drift"

# Force recomputation
curl -X POST \
  "http://scanner:8080/api/scanner/scans/{scanId}/compute-reachability" \
  -d '{"forceRecompute": true}'

# View graph digests
psql -c "SELECT scan_id, graph_digest FROM scanner.call_graph_snapshots ORDER BY extracted_at DESC LIMIT 10"

5.3 Cache Problems

Symptom: Slow performance, cache misses, circuit breaker open

Solutions:

# Check Valkey connectivity
redis-cli -h valkey.internal ping

# Check circuit breaker state
curl "http://scanner:8080/health/ready" | jq '.checks.cache'

# Clear specific scan cache
redis-cli DEL "stella-callgraph:scanId:*"

# Reset circuit breaker (restart scanner)
kubectl rollout restart deployment/scanner

5.4 Common Error Messages

Error Meaning Action
ERR_GRAPH_TOO_LARGE > 100K nodes Increase max_nodes or split project
ERR_EXTRACTION_TIMEOUT Analysis timed out Increase timeout or reduce scope
ERR_NO_ENTRYPOINTS No public entrypoints found Check framework detection
ERR_BASE_SCAN_MISSING Base scan not found Specify valid baseScanId
ERR_CACHE_UNAVAILABLE Valkey unreachable Check network, circuit breaker will activate

6. Performance Tuning

6.1 Call Graph Extraction

scanner:
  reachability:
    extraction:
      # Exclude test code (reduces graph size)
      include_test_code: false

      # Exclude vendored dependencies
      include_vendored: false

      # Limit analysis depth
      max_depth: 50  # Default: 100

      # Parallel project analysis
      parallelism: 4

6.2 Caching Strategy

cache:
  valkey:
    # Longer TTL for stable projects
    ttl_hours: 72

    # Aggressive compression for large graphs
    compression:
      level: optimal  # vs 'fastest'

    # Larger connection pool
    pool_size: 20

6.3 Database Optimization

-- Ensure indexes exist
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_callgraph_scan_lang
  ON scanner.call_graph_snapshots(scan_id, language);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_drift_head_scan
  ON scanner.reachability_drift_results(head_scan_id);

-- Vacuum after large imports
VACUUM ANALYZE scanner.call_graph_snapshots;
VACUUM ANALYZE scanner.reachability_drift_results;

7. Backup & Recovery

7.1 Database Backup

# Backup drift-related tables
pg_dump -h postgres.internal -U stellaops \
  -t scanner.call_graph_snapshots \
  -t scanner.reachability_results \
  -t scanner.reachability_drift_results \
  -t scanner.drifted_sinks \
  -t scanner.code_changes \
  > scanner_drift_backup.sql

7.2 Cache Recovery

# Export cache to file (if needed)
redis-cli -h valkey.internal --rdb /backup/callgraph-cache.rdb

# Cache is ephemeral - can be regenerated from database
# Recompute after cache loss:
stella scan recompute-reachability --all-pending

8. Security Considerations

8.1 Database Access

  • Scanner service uses dedicated PostgreSQL user with schema-limited permissions
  • Row-Level Security (RLS) enforces tenant isolation
  • Connection strings use secrets management (not plaintext)

8.2 API Authentication

  • All drift endpoints require valid Bearer token
  • Scopes: scanner:read, scanner:write, scanner:admin
  • Rate limiting prevents abuse

8.3 Attestation Signing

  • Drift results can be DSSE-signed for audit trails
  • Signing keys managed by Signer service
  • Optional Rekor transparency logging

9. References

  • Architecture: docs/modules/scanner/reachability-drift.md
  • API Reference: docs/api/scanner-drift-api.md
  • PostgreSQL Guide: docs/operations/postgresql-guide.md
  • Air-Gap Operations: docs/operations/airgap-operations-runbook.md
  • Reachability Runbook: docs/operations/reachability-runbook.md