git.stella-ops.org/docs/operations/reachability-drift-guide.md

# Reachability Drift Detection - Operations Guide

**Module:** Scanner
**Version:** 1.0
**Last Updated:** 2025-12-22

---

## 1. Prerequisites

### 1.1 Infrastructure Requirements

| Component | Minimum | Recommended | Notes |
|-----------|---------|-------------|-------|
| CPU | 4 cores | 8 cores | For call graph extraction |
| Memory | 4 GB | 8 GB | Large projects need more |
| PostgreSQL | 16+ | 16+ | With RLS enabled |
| Valkey/Redis | 7.0+ | 7.0+ | For caching (optional) |
| .NET Runtime | 10.0 | 10.0 | Preview features enabled |

### 1.2 Network Requirements

| Direction | Endpoints | Notes |
|-----------|-----------|-------|
| Inbound | Scanner API (8080) | Load balancer health checks |
| Outbound | PostgreSQL (5432) | Database connections |
| Outbound | Valkey (6379) | Cache connections (optional) |
| Outbound | Signer service | For DSSE attestations |

### 1.3 Dependencies

- Scanner WebService deployed and healthy
- PostgreSQL database with Scanner schema migrations applied
- (Optional) Valkey cluster for caching
- (Optional) Signer service for attestation signing

---

## 2. Configuration

### 2.1 Scanner Service Configuration

**File:** `etc/scanner.yaml`

```yaml
scanner:
  reachability:
    # Enable reachability drift detection
    enabled: true

    # Languages to analyze (empty = all supported)
    languages:
      - dotnet
      - java
      - node
      - python
      - go

    # Call graph extraction options
    extraction:
      max_depth: 100
      max_nodes: 100000
      timeout_seconds: 300
      include_test_code: false
      include_vendored: false

    # Drift detection options
    drift:
      # Auto-compute on scan completion
      auto_compute: true
      # Base scan selection (previous, tagged, specific)
      base_selection: previous
      # Emit VEX candidates for unreachable sinks
      emit_vex_candidates: true

  storage:
    postgres:
      connection_string: "Host=localhost;Database=stellaops;Username=scanner;Password=${SCANNER_DB_PASSWORD}"
      schema: scanner
      pool_size: 20

  cache:
    valkey:
      enabled: true
      connection: "localhost:6379"
      bucket: "stella-callgraph"
      ttl_hours: 24
      circuit_breaker:
        failure_threshold: 5
        timeout_seconds: 30
```

### 2.2 Valkey Cache Configuration

```yaml
# Valkey-specific settings
cache:
  valkey:
    enabled: true
    connection: "valkey-cluster.internal:6379"
    bucket: "stella-callgraph"
    ttl_hours: 24

    # Circuit breaker prevents cache storms
    circuit_breaker:
      failure_threshold: 5
      timeout_seconds: 30
      half_open_max_attempts: 3

    # Compression reduces memory usage
    compression:
      enabled: true
      algorithm: gzip
      level: fastest
```

### 2.3 Policy Gate Configuration

**File:** `etc/policy.yaml`

```yaml
smart_diff:
  gates:
    # Block on KEV becoming reachable
    - id: drift_block_kev
      condition: "delta_reachable > 0 AND is_kev = true"
      action: block
      severity: critical
      message: "Known Exploited Vulnerability now reachable"

    # Block on high-severity sink becoming reachable
    - id: drift_block_critical
      condition: "delta_reachable > 0 AND max_cvss >= 9.0"
      action: block
      severity: critical
      message: "Critical vulnerability now reachable"

    # Warn on any new reachable paths
    - id: drift_warn_new_paths
      condition: "delta_reachable > 0"
      action: warn
      severity: medium
      message: "New reachable paths detected"

    # Auto-allow mitigated paths
    - id: drift_allow_mitigated
      condition: "delta_unreachable > 0 AND delta_reachable = 0"
      action: allow
      auto_approve: true
```

---

## 3. Deployment Modes

### 3.1 Standalone Deployment

```bash
# Run Scanner WebService with drift detection
docker run -d \
  --name scanner \
  -p 8080:8080 \
  -e SCANNER_DB_PASSWORD=secret \
  -v /etc/scanner:/etc/scanner:ro \
  stellaops/scanner:latest

# Verify health
curl http://localhost:8080/health
```

### 3.2 Kubernetes Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanner
  namespace: stellaops
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scanner
  template:
    metadata:
      labels:
        app: scanner
    spec:
      containers:
        - name: scanner
          image: stellaops/scanner:latest
          ports:
            - containerPort: 8080
          env:
            - name: SCANNER_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: scanner-secrets
                  key: db-password
          volumeMounts:
            - name: config
              mountPath: /etc/scanner
              readOnly: true
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: scanner-config
```

### 3.3 Air-Gapped Deployment

For air-gapped environments:

1. **Disable external lookups:**
   ```yaml
   scanner:
     reachability:
       offline_mode: true
       # No external advisory fetching
   ```

2. **Pre-load call graph caches:**
   ```bash
   # Export from connected environment
   stella cache export --type callgraph --output graphs.tar.gz

   # Import in air-gapped environment
   stella cache import --input graphs.tar.gz
   ```

3. **Use local VEX sources:**
   ```yaml
   excititor:
     sources:
       - type: local
         path: /data/vex-bundles/
   ```

---

## 4. Monitoring & Metrics

### 4.1 Key Metrics

| Metric | Type | Description | Alert Threshold |
|--------|------|-------------|-----------------|
| `scanner_callgraph_extraction_duration_seconds` | histogram | Time to extract call graph | p99 > 300s |
| `scanner_callgraph_node_count` | gauge | Nodes in extracted graph | > 100,000 |
| `scanner_reachability_analysis_duration_seconds` | histogram | BFS analysis time | p99 > 30s |
| `scanner_drift_newly_reachable_total` | counter | Count of newly reachable sinks | > 0 (alert) |
| `scanner_drift_newly_unreachable_total` | counter | Count of mitigated sinks | (info) |
| `scanner_cache_hit_ratio` | gauge | Valkey cache hit rate | < 0.5 |
| `scanner_cache_circuit_breaker_open` | gauge | Circuit breaker state | = 1 (alert) |

### 4.2 Grafana Dashboard

Import dashboard JSON from: `deploy/grafana/scanner-drift-dashboard.json`

Key panels:
- Drift detection rate over time
- Newly reachable sinks by category
- Call graph extraction latency
- Cache hit/miss ratio
- Circuit breaker state

### 4.3 Alert Rules

```yaml
# Prometheus alerting rules
groups:
  - name: scanner-drift
    rules:
      - alert: KevBecameReachable
        expr: increase(scanner_drift_kev_reachable_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "KEV vulnerability became reachable"
          description: "A Known Exploited Vulnerability is now reachable from public entrypoints"

      - alert: HighDriftRate
        expr: rate(scanner_drift_newly_reachable_total[1h]) > 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High rate of new reachable vulnerabilities"

      - alert: CacheCircuitOpen
        expr: scanner_cache_circuit_breaker_open == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Valkey cache circuit breaker is open"
```

---

## 5. Troubleshooting

### 5.1 Call Graph Extraction Failures

**Symptom:** `GRAPH_NOT_EXTRACTED` error

**Causes & Solutions:**

| Cause | Solution |
|-------|----------|
| Missing SDK/runtime | Install required SDK (.NET, Node.js, JDK) |
| Build errors in project | Fix compilation errors first |
| Timeout exceeded | Increase `extraction.timeout_seconds` |
| Memory exhaustion | Increase container memory limits |
| Unsupported language | Check language support matrix |

**Debugging:**

```bash
# Check extraction logs
kubectl logs -f deployment/scanner | grep -i extraction

# Manual extraction test
stella scan callgraph \
  --project /path/to/project \
  --language dotnet \
  --verbose
```

### 5.2 Drift Detection Issues

**Symptom:** Drift not computed or incorrect results

**Causes & Solutions:**

| Cause | Solution |
|-------|----------|
| No base scan available | Ensure previous scan exists |
| Different languages | Base and head must have same language |
| Graph digest unchanged | No material code changes detected |
| Cache stale | Clear Valkey cache for scan |

**Debugging:**

```bash
# Check drift computation status
curl "http://scanner:8080/api/scanner/scans/{scanId}/drift"

# Force recomputation
curl -X POST \
  "http://scanner:8080/api/scanner/scans/{scanId}/compute-reachability" \
  -d '{"forceRecompute": true}'

# View graph digests
psql -c "SELECT scan_id, graph_digest FROM scanner.call_graph_snapshots ORDER BY extracted_at DESC LIMIT 10"
```

### 5.3 Cache Problems

**Symptom:** Slow performance, cache misses, circuit breaker open

**Solutions:**

```bash
# Check Valkey connectivity
redis-cli -h valkey.internal ping

# Check circuit breaker state
curl "http://scanner:8080/health/ready" | jq '.checks.cache'

# Clear specific scan cache
redis-cli DEL "stella-callgraph:scanId:*"

# Reset circuit breaker (restart scanner)
kubectl rollout restart deployment/scanner
```

### 5.4 Common Error Messages

| Error | Meaning | Action |
|-------|---------|--------|
| `ERR_GRAPH_TOO_LARGE` | > 100K nodes | Increase `max_nodes` or split project |
| `ERR_EXTRACTION_TIMEOUT` | Analysis timed out | Increase timeout or reduce scope |
| `ERR_NO_ENTRYPOINTS` | No public entrypoints found | Check framework detection |
| `ERR_BASE_SCAN_MISSING` | Base scan not found | Specify valid `baseScanId` |
| `ERR_CACHE_UNAVAILABLE` | Valkey unreachable | Check network, circuit breaker will activate |

---

## 6. Performance Tuning

### 6.1 Call Graph Extraction

```yaml
scanner:
  reachability:
    extraction:
      # Exclude test code (reduces graph size)
      include_test_code: false

      # Exclude vendored dependencies
      include_vendored: false

      # Limit analysis depth
      max_depth: 50  # Default: 100

      # Parallel project analysis
      parallelism: 4
```

### 6.2 Caching Strategy

```yaml
cache:
  valkey:
    # Longer TTL for stable projects
    ttl_hours: 72

    # Aggressive compression for large graphs
    compression:
      level: optimal  # vs 'fastest'

    # Larger connection pool
    pool_size: 20
```

### 6.3 Database Optimization

```sql
-- Ensure indexes exist
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_callgraph_scan_lang
  ON scanner.call_graph_snapshots(scan_id, language);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_drift_head_scan
  ON scanner.reachability_drift_results(head_scan_id);

-- Vacuum after large imports
VACUUM ANALYZE scanner.call_graph_snapshots;
VACUUM ANALYZE scanner.reachability_drift_results;
```

---

## 7. Backup & Recovery

### 7.1 Database Backup

```bash
# Backup drift-related tables
pg_dump -h postgres.internal -U stellaops \
  -t scanner.call_graph_snapshots \
  -t scanner.reachability_results \
  -t scanner.reachability_drift_results \
  -t scanner.drifted_sinks \
  -t scanner.code_changes \
  > scanner_drift_backup.sql
```

### 7.2 Cache Recovery

```bash
# Export cache to file (if needed)
redis-cli -h valkey.internal --rdb /backup/callgraph-cache.rdb

# Cache is ephemeral - can be regenerated from database
# Recompute after cache loss:
stella scan recompute-reachability --all-pending
```

---

## 8. Security Considerations

### 8.1 Database Access

- Scanner service uses dedicated PostgreSQL user with schema-limited permissions
- Row-Level Security (RLS) enforces tenant isolation
- Connection strings use secrets management (not plaintext)

### 8.2 API Authentication

- All drift endpoints require valid Bearer token
- Scopes: `scanner:read`, `scanner:write`, `scanner:admin`
- Rate limiting prevents abuse

### 8.3 Attestation Signing

- Drift results can be DSSE-signed for audit trails
- Signing keys managed by Signer service
- Optional Rekor transparency logging

---

## 9. References

- **Architecture:** `docs/modules/scanner/reachability-drift.md`
- **API Reference:** `docs/api/scanner-drift-api.md`
- **PostgreSQL Guide:** `docs/operations/postgresql-guide.md`
- **Air-Gap Operations:** `docs/operations/airgap-operations-runbook.md`
- **Reachability Runbook:** `docs/operations/reachability-runbook.md`