Refactor code structure for improved readability and maintainability; optimize performance in key functions.

2025-12-22 19:06:31 +02:00
parent dfaa2079aa
commit 4602ccc3a3
1444 changed files with 109919 additions and 8058 deletions
--- a/docs/operations/reachability-drift-guide.md
+++ b/docs/operations/reachability-drift-guide.md
@@ -1,4 +1,4 @@
-# Reachability Drift Detection - Operations Guide
+# Reachability Drift Detection - Operations Guide

 **Module:** Scanner
 **Version:** 1.0
@@ -6,514 +6,142 @@

 ---

-## 1. Prerequisites
+## 1. Overview

-### 1.1 Infrastructure Requirements
-
-| Component | Minimum | Recommended | Notes |
-|-----------|---------|-------------|-------|
-| CPU | 4 cores | 8 cores | For call graph extraction |
-| Memory | 4 GB | 8 GB | Large projects need more |
-| PostgreSQL | 16+ | 16+ | With RLS enabled |
-| Valkey/Redis | 7.0+ | 7.0+ | For caching (optional) |
-| .NET Runtime | 10.0 | 10.0 | Preview features enabled |
-
-### 1.2 Network Requirements
-
-| Direction | Endpoints | Notes |
-|-----------|-----------|-------|
-| Inbound | Scanner API (8080) | Load balancer health checks |
-| Outbound | PostgreSQL (5432) | Database connections |
-| Outbound | Valkey (6379) | Cache connections (optional) |
-| Outbound | Signer service | For DSSE attestations |
-
-### 1.3 Dependencies
-
- Scanner WebService deployed and healthy
- PostgreSQL database with Scanner schema migrations applied
- (Optional) Valkey cluster for caching
- (Optional) Signer service for attestation signing
+Reachability Drift Detection compares call graph reachability between two scans and surfaces newly reachable or newly unreachable sinks. The API lives in the Scanner WebService and relies on call graph snapshots stored in PostgreSQL.

 ---

-## 2. Configuration
+## 2. Prerequisites

-### 2.1 Scanner Service Configuration
+### 2.1 Infrastructure Requirements

-**File:** `etc/scanner.yaml`
+| Component | Minimum | Recommended | Notes |
+|---|---|---|---|
+| CPU | 4 cores | 8 cores | Call graph extraction is CPU heavy. |
+| Memory | 4 GB | 8 GB | Large graphs need more memory. |
+| PostgreSQL | 16+ | 16+ | Required for call graph + drift tables. |
+| Valkey/Redis | 7.0+ | 7.0+ | Optional call graph cache. |
+| .NET Runtime | 10.0 | 10.0 | Scanner WebService runtime. |
+
+### 2.2 Required Services
+
+- Scanner WebService running with storage configured.
+- Call graph ingestion pipeline populating `call_graph_snapshots` (Scanner Worker or external ingestion).
+- PostgreSQL migrations for call graph and drift tables applied (auto-migrate is enabled by default).
+
+Optional:
+- Valkey call graph cache (`CallGraph:Cache`).
+- Signer service for drift attestations (if enabled by the integration layer).
+
+---
+
+## 3. Configuration
+
+### 3.1 Scanner WebService
+
+**File:** `etc/scanner.yaml` (path depends on deployment)

 ```yaml
 scanner:
-  reachability:
-    # Enable reachability drift detection
-    enabled: true
-
-    # Languages to analyze (empty = all supported)
-    languages:
-      - dotnet
-      - java
-      - node
-      - python
-      - go
-
-    # Call graph extraction options
-    extraction:
-      max_depth: 100
-      max_nodes: 100000
-      timeout_seconds: 300
-      include_test_code: false
-      include_vendored: false
-
-    # Drift detection options
-    drift:
-      # Auto-compute on scan completion
-      auto_compute: true
-      # Base scan selection (previous, tagged, specific)
-      base_selection: previous
-      # Emit VEX candidates for unreachable sinks
-      emit_vex_candidates: true
-
  storage:
-    postgres:
-      connection_string: "Host=localhost;Database=stellaops;Username=scanner;Password=${SCANNER_DB_PASSWORD}"
-      schema: scanner
-      pool_size: 20
+    dsn: "Host=postgres;Database=stellaops;Username=scanner;Password=${SCANNER_DB_PASSWORD}"
+    database: "scanner"
+    commandTimeoutSeconds: 30
+    autoMigrate: true

-  cache:
-    valkey:
-      enabled: true
-      connection: "localhost:6379"
-      bucket: "stella-callgraph"
-      ttl_hours: 24
-      circuit_breaker:
-        failure_threshold: 5
-        timeout_seconds: 30
+  api:
+    basePath: "/api/v1"
+    scansSegment: "scans"
 ```

-### 2.2 Valkey Cache Configuration
+### 3.2 Call Graph Cache (Optional)

 ```yaml
-# Valkey-specific settings
-cache:
-  valkey:
+CallGraph:
+  Cache:
    enabled: true
-    connection: "valkey-cluster.internal:6379"
-    bucket: "stella-callgraph"
-    ttl_hours: 24
-
-    # Circuit breaker prevents cache storms
+    connection_string: "valkey:6379"
+    key_prefix: "callgraph:"
+    ttl_seconds: 3600
+    gzip: true
    circuit_breaker:
      failure_threshold: 5
      timeout_seconds: 30
-      half_open_max_attempts: 3
-
-    # Compression reduces memory usage
-    compression:
-      enabled: true
-      algorithm: gzip
-      level: fastest
+      half_open_timeout: 10
 ```

-### 2.3 Policy Gate Configuration
-
-**File:** `etc/policy.yaml`
-
-```yaml
-smart_diff:
-  gates:
-    # Block on KEV becoming reachable
-    - id: drift_block_kev
-      condition: "delta_reachable > 0 AND is_kev = true"
-      action: block
-      severity: critical
-      message: "Known Exploited Vulnerability now reachable"
-
-    # Block on high-severity sink becoming reachable
-    - id: drift_block_critical
-      condition: "delta_reachable > 0 AND max_cvss >= 9.0"
-      action: block
-      severity: critical
-      message: "Critical vulnerability now reachable"
-
-    # Warn on any new reachable paths
-    - id: drift_warn_new_paths
-      condition: "delta_reachable > 0"
-      action: warn
-      severity: medium
-      message: "New reachable paths detected"
-
-    # Auto-allow mitigated paths
-    - id: drift_allow_mitigated
-      condition: "delta_unreachable > 0 AND delta_reachable = 0"
-      action: allow
-      auto_approve: true
-```
-
---
-
-## 3. Deployment Modes
-
-### 3.1 Standalone Deployment
-
-```bash
-# Run Scanner WebService with drift detection
-docker run -d \
-  --name scanner \
-  -p 8080:8080 \
-  -e SCANNER_DB_PASSWORD=secret \
-  -v /etc/scanner:/etc/scanner:ro \
-  stellaops/scanner:latest
-
-# Verify health
-curl http://localhost:8080/health
-```
-
-### 3.2 Kubernetes Deployment
-
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: scanner
-  namespace: stellaops
-spec:
-  replicas: 3
-  selector:
-    matchLabels:
-      app: scanner
-  template:
-    metadata:
-      labels:
-        app: scanner
-    spec:
-      containers:
-        - name: scanner
-          image: stellaops/scanner:latest
-          ports:
-            - containerPort: 8080
-          env:
-            - name: SCANNER_DB_PASSWORD
-              valueFrom:
-                secretKeyRef:
-                  name: scanner-secrets
-                  key: db-password
-          volumeMounts:
-            - name: config
-              mountPath: /etc/scanner
-              readOnly: true
-          resources:
-            requests:
-              memory: "4Gi"
-              cpu: "2"
-            limits:
-              memory: "8Gi"
-              cpu: "4"
-          livenessProbe:
-            httpGet:
-              path: /health/live
-              port: 8080
-            initialDelaySeconds: 30
-            periodSeconds: 10
-          readinessProbe:
-            httpGet:
-              path: /health/ready
-              port: 8080
-            initialDelaySeconds: 10
-            periodSeconds: 5
-      volumes:
-        - name: config
-          configMap:
-            name: scanner-config
-```
-
-### 3.3 Air-Gapped Deployment
-
-For air-gapped environments:
-
-1. **Disable external lookups:**
-   ```yaml
-   scanner:
-     reachability:
-       offline_mode: true
-       # No external advisory fetching
-   ```
-
-2. **Pre-load call graph caches:**
-   ```bash
-   # Export from connected environment
-   stella cache export --type callgraph --output graphs.tar.gz
-
-   # Import in air-gapped environment
-   stella cache import --input graphs.tar.gz
-   ```
-
-3. **Use local VEX sources:**
-   ```yaml
-   excititor:
-     sources:
-       - type: local
-         path: /data/vex-bundles/
-   ```
-
---
-
-## 4. Monitoring & Metrics
-
-### 4.1 Key Metrics
-
-| Metric | Type | Description | Alert Threshold |
-|--------|------|-------------|-----------------|
-| `scanner_callgraph_extraction_duration_seconds` | histogram | Time to extract call graph | p99 > 300s |
-| `scanner_callgraph_node_count` | gauge | Nodes in extracted graph | > 100,000 |
-| `scanner_reachability_analysis_duration_seconds` | histogram | BFS analysis time | p99 > 30s |
-| `scanner_drift_newly_reachable_total` | counter | Count of newly reachable sinks | > 0 (alert) |
-| `scanner_drift_newly_unreachable_total` | counter | Count of mitigated sinks | (info) |
-| `scanner_cache_hit_ratio` | gauge | Valkey cache hit rate | < 0.5 |
-| `scanner_cache_circuit_breaker_open` | gauge | Circuit breaker state | = 1 (alert) |
-
-### 4.2 Grafana Dashboard
-
-Import dashboard JSON from: `deploy/grafana/scanner-drift-dashboard.json`
-
-Key panels:
- Drift detection rate over time
- Newly reachable sinks by category
- Call graph extraction latency
- Cache hit/miss ratio
- Circuit breaker state
-
-### 4.3 Alert Rules
-
-```yaml
-# Prometheus alerting rules
-groups:
-  - name: scanner-drift
-    rules:
-      - alert: KevBecameReachable
-        expr: increase(scanner_drift_kev_reachable_total[5m]) > 0
-        for: 0m
-        labels:
-          severity: critical
-        annotations:
-          summary: "KEV vulnerability became reachable"
-          description: "A Known Exploited Vulnerability is now reachable from public entrypoints"
-
-      - alert: HighDriftRate
-        expr: rate(scanner_drift_newly_reachable_total[1h]) > 10
-        for: 15m
-        labels:
-          severity: warning
-        annotations:
-          summary: "High rate of new reachable vulnerabilities"
-
-      - alert: CacheCircuitOpen
-        expr: scanner_cache_circuit_breaker_open == 1
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Valkey cache circuit breaker is open"
-```
-
---
-
-## 5. Troubleshooting
-
-### 5.1 Call Graph Extraction Failures
-
-**Symptom:** `GRAPH_NOT_EXTRACTED` error
-
-**Causes & Solutions:**
-
-| Cause | Solution |
-|-------|----------|
-| Missing SDK/runtime | Install required SDK (.NET, Node.js, JDK) |
-| Build errors in project | Fix compilation errors first |
-| Timeout exceeded | Increase `extraction.timeout_seconds` |
-| Memory exhaustion | Increase container memory limits |
-| Unsupported language | Check language support matrix |
-
-**Debugging:**
-
-```bash
-# Check extraction logs
-kubectl logs -f deployment/scanner | grep -i extraction
-
-# Manual extraction test
-stella scan callgraph \
-  --project /path/to/project \
-  --language dotnet \
-  --verbose
-```
-
-### 5.2 Drift Detection Issues
-
-**Symptom:** Drift not computed or incorrect results
-
-**Causes & Solutions:**
-
-| Cause | Solution |
-|-------|----------|
-| No base scan available | Ensure previous scan exists |
-| Different languages | Base and head must have same language |
-| Graph digest unchanged | No material code changes detected |
-| Cache stale | Clear Valkey cache for scan |
-
-**Debugging:**
-
-```bash
-# Check drift computation status
-curl "http://scanner:8080/api/scanner/scans/{scanId}/drift"
-
-# Force recomputation
-curl -X POST \
-  "http://scanner:8080/api/scanner/scans/{scanId}/compute-reachability" \
-  -d '{"forceRecompute": true}'
-
-# View graph digests
-psql -c "SELECT scan_id, graph_digest FROM scanner.call_graph_snapshots ORDER BY extracted_at DESC LIMIT 10"
-```
-
-### 5.3 Cache Problems
-
-**Symptom:** Slow performance, cache misses, circuit breaker open
-
-**Solutions:**
-
-```bash
-# Check Valkey connectivity
-redis-cli -h valkey.internal ping
-
-# Check circuit breaker state
-curl "http://scanner:8080/health/ready" | jq '.checks.cache'
-
-# Clear specific scan cache
-redis-cli DEL "stella-callgraph:scanId:*"
-
-# Reset circuit breaker (restart scanner)
-kubectl rollout restart deployment/scanner
-```
-
-### 5.4 Common Error Messages
-
-| Error | Meaning | Action |
-|-------|---------|--------|
-| `ERR_GRAPH_TOO_LARGE` | > 100K nodes | Increase `max_nodes` or split project |
-| `ERR_EXTRACTION_TIMEOUT` | Analysis timed out | Increase timeout or reduce scope |
-| `ERR_NO_ENTRYPOINTS` | No public entrypoints found | Check framework detection |
-| `ERR_BASE_SCAN_MISSING` | Base scan not found | Specify valid `baseScanId` |
-| `ERR_CACHE_UNAVAILABLE` | Valkey unreachable | Check network, circuit breaker will activate |
-
---
-
-## 6. Performance Tuning
-
-### 6.1 Call Graph Extraction
+### 3.3 Authorization (Optional)

 ```yaml
 scanner:
-  reachability:
-    extraction:
-      # Exclude test code (reduces graph size)
-      include_test_code: false
-
-      # Exclude vendored dependencies
-      include_vendored: false
-
-      # Limit analysis depth
-      max_depth: 50  # Default: 100
-
-      # Parallel project analysis
-      parallelism: 4
-```
-
-### 6.2 Caching Strategy
-
-```yaml
-cache:
-  valkey:
-    # Longer TTL for stable projects
-    ttl_hours: 72
-
-    # Aggressive compression for large graphs
-    compression:
-      level: optimal  # vs 'fastest'
-
-    # Larger connection pool
-    pool_size: 20
-```
-
-### 6.3 Database Optimization
-
-```sql
-- Ensure indexes exist
-CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_callgraph_scan_lang
-  ON scanner.call_graph_snapshots(scan_id, language);
-
-CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_drift_head_scan
-  ON scanner.reachability_drift_results(head_scan_id);
-
-- Vacuum after large imports
-VACUUM ANALYZE scanner.call_graph_snapshots;
-VACUUM ANALYZE scanner.reachability_drift_results;
+  authority:
+    enabled: true
+    issuer: "https://authority.local"
+    requiredScopes:
+      - "scanner.scans.read"
+      - "scanner.scans.write"
 ```

 ---

-## 7. Backup & Recovery
+## 4. Running Drift Analysis

-### 7.1 Database Backup
+1. Ensure call graph snapshots exist for base and head scans.
+2. Compute drift by providing the base scan ID:
+   - `GET /api/v1/scans/{scanId}/drift?baseScanId={baseScanId}&language=dotnet`
+3. Page through sinks:
+   - `GET /api/v1/drift/{driftId}/sinks?direction=became_reachable&offset=0&limit=100`

-```bash
-# Backup drift-related tables
-pg_dump -h postgres.internal -U stellaops \
-  -t scanner.call_graph_snapshots \
-  -t scanner.reachability_results \
-  -t scanner.reachability_drift_results \
-  -t scanner.drifted_sinks \
-  -t scanner.code_changes \
-  > scanner_drift_backup.sql
-```
-
-### 7.2 Cache Recovery
-
-```bash
-# Export cache to file (if needed)
-redis-cli -h valkey.internal --rdb /backup/callgraph-cache.rdb
-
-# Cache is ephemeral - can be regenerated from database
-# Recompute after cache loss:
-stella scan recompute-reachability --all-pending
-```
+If `baseScanId` is omitted, the API returns the most recent stored drift result for the head scan.

 ---

-## 8. Security Considerations
+## 5. Deployment Modes

-### 8.1 Database Access
+### 5.1 Standalone

- Scanner service uses dedicated PostgreSQL user with schema-limited permissions
- Row-Level Security (RLS) enforces tenant isolation
- Connection strings use secrets management (not plaintext)
+- Run Scanner WebService with PostgreSQL reachable.
+- Provide `scanner.storage.dsn` and `scanner.api.basePath`.

-### 8.2 API Authentication
+### 5.2 Kubernetes

- All drift endpoints require valid Bearer token
- Scopes: `scanner:read`, `scanner:write`, `scanner:admin`
- Rate limiting prevents abuse
+- Configure readiness and liveness probes (`/health/ready`, `/health/live`).
+- Mount `scanner.yaml` via ConfigMap or Secret.
+- Ensure Postgres connectivity and schema migrations are enabled.

-### 8.3 Attestation Signing
+### 5.3 Air-Gapped

- Drift results can be DSSE-signed for audit trails
- Signing keys managed by Signer service
- Optional Rekor transparency logging
+- Use Offline Kit flows for advisory data and signatures.
+- Avoid external endpoints; configure any optional integrations to local services.

 ---

-## 9. References
+## 6. Monitoring and Metrics

- **Architecture:** `docs/modules/scanner/reachability-drift.md`
- **API Reference:** `docs/api/scanner-drift-api.md`
- **PostgreSQL Guide:** `docs/operations/postgresql-guide.md`
- **Air-Gap Operations:** `docs/operations/airgap-operations-runbook.md`
- **Reachability Runbook:** `docs/operations/reachability-runbook.md`
+There are no drift-specific metrics emitted by the drift endpoints yet. Recommended operational checks:
+- API logs for `/api/v1/scans/{scanId}/drift` and `/api/v1/drift/{driftId}/sinks`.
+- PostgreSQL table sizes and growth for `call_graph_snapshots`, `reachability_drift_results`, `drifted_sinks`.
+- Valkey connectivity and cache hit rates if `CallGraph:Cache` is enabled.
+
+---
+
+## 7. Troubleshooting
+
+| Symptom | Likely Cause | Resolution |
+|---|---|---|
+| 404 scan not found | Invalid scan ID | Verify scan ID or resolve by image reference. |
+| 404 call graph not found | Call graph not ingested | Ingest call graph snapshot before running drift. |
+| 404 drift result not found | No stored drift and no base scan provided | Provide `baseScanId` to compute drift. |
+| 400 invalid direction | Unsupported direction value | Use `became_reachable` or `became_unreachable`. |
+| 409 computation already in progress | Reachability job already running | Wait or retry later. |
+
+---
+
+## 8. References
+
+- `docs/modules/scanner/reachability-drift.md`
+- `docs/api/scanner-drift-api.md`
+- `docs/airgap/reachability-drift-airgap-workflows.md`
+- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/009_call_graph_tables.sql`
+- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/010_reachability_drift_tables.sql`
--- a/docs/operations/unknowns-queue-runbook.md
+++ b/docs/operations/unknowns-queue-runbook.md
@@ -576,6 +576,52 @@ stella unknowns report --format email --send-to security-team@example.com

 ---

+## 8. Unknown Budgets
+
+Unknown budgets enforce per-environment caps on unknowns by reason code. Budgets can warn or block when exceeded.
+
+**Configuration**:
+
+```yaml
+# etc/policy.unknowns.budgets.yaml
+unknownBudgets:
+  enforceBudgets: true
+  budgets:
+    prod:
+      environment: prod
+      totalLimit: 3
+      reasonLimits:
+        Reachability: 0
+        Provenance: 0
+        VexConflict: 1
+      action: Block
+      exceededMessage: "Production requires zero reachability unknowns"
+
+    stage:
+      environment: stage
+      totalLimit: 10
+      reasonLimits:
+        Reachability: 1
+      action: WarnUnlessException
+
+    dev:
+      environment: dev
+      totalLimit: null
+      action: Warn
+
+    default:
+      environment: default
+      totalLimit: 5
+      action: Warn
+```
+
+**Exception coverage**:
+
+To allow approved exceptions to cover specific unknown reason codes, set exception metadata
+`unknown_reason_codes` (comma-separated). Example: `Reachability, U-VEX`.
+
+---
+
 ## Related Documentation

 - [Unknowns API Reference](../api/score-proofs-reachability-api-reference.md#5-unknowns-api)
@@ -585,6 +631,6 @@ stella unknowns report --format email --send-to security-team@example.com

 ---

-**Last Updated**: 2025-12-20  
+**Last Updated**: 2025-12-22  
 **Version**: 1.0.0  
 **Sprint**: 3500.0004.0004