Files
git.stella-ops.org/docs/operations/reachability-drift-guide.md
StellaOps Bot df94136727 feat: Implement distro-native version comparison for RPM, Debian, and Alpine packages
- Add RpmVersionComparer for RPM version comparison with epoch, version, and release handling.
- Introduce DebianVersion for parsing Debian EVR (Epoch:Version-Release) strings.
- Create ApkVersion for parsing Alpine APK version strings with suffix support.
- Define IVersionComparator interface for version comparison with proof-line generation.
- Implement VersionComparisonResult struct to encapsulate comparison results and proof lines.
- Add tests for Debian and RPM version comparers to ensure correct functionality and edge case handling.
- Create project files for the version comparison library and its tests.
2025-12-22 09:49:53 +02:00

520 lines
13 KiB
Markdown

# Reachability Drift Detection - Operations Guide
**Module:** Scanner
**Version:** 1.0
**Last Updated:** 2025-12-22
---
## 1. Prerequisites
### 1.1 Infrastructure Requirements
| Component | Minimum | Recommended | Notes |
|-----------|---------|-------------|-------|
| CPU | 4 cores | 8 cores | For call graph extraction |
| Memory | 4 GB | 8 GB | Large projects need more |
| PostgreSQL | 16+ | 16+ | With RLS enabled |
| Valkey/Redis | 7.0+ | 7.0+ | For caching (optional) |
| .NET Runtime | 10.0 | 10.0 | Preview features enabled |
### 1.2 Network Requirements
| Direction | Endpoints | Notes |
|-----------|-----------|-------|
| Inbound | Scanner API (8080) | Load balancer health checks |
| Outbound | PostgreSQL (5432) | Database connections |
| Outbound | Valkey (6379) | Cache connections (optional) |
| Outbound | Signer service | For DSSE attestations |
### 1.3 Dependencies
- Scanner WebService deployed and healthy
- PostgreSQL database with Scanner schema migrations applied
- (Optional) Valkey cluster for caching
- (Optional) Signer service for attestation signing
---
## 2. Configuration
### 2.1 Scanner Service Configuration
**File:** `etc/scanner.yaml`
```yaml
scanner:
reachability:
# Enable reachability drift detection
enabled: true
# Languages to analyze (empty = all supported)
languages:
- dotnet
- java
- node
- python
- go
# Call graph extraction options
extraction:
max_depth: 100
max_nodes: 100000
timeout_seconds: 300
include_test_code: false
include_vendored: false
# Drift detection options
drift:
# Auto-compute on scan completion
auto_compute: true
# Base scan selection (previous, tagged, specific)
base_selection: previous
# Emit VEX candidates for unreachable sinks
emit_vex_candidates: true
storage:
postgres:
connection_string: "Host=localhost;Database=stellaops;Username=scanner;Password=${SCANNER_DB_PASSWORD}"
schema: scanner
pool_size: 20
cache:
valkey:
enabled: true
connection: "localhost:6379"
bucket: "stella-callgraph"
ttl_hours: 24
circuit_breaker:
failure_threshold: 5
timeout_seconds: 30
```
### 2.2 Valkey Cache Configuration
```yaml
# Valkey-specific settings
cache:
valkey:
enabled: true
connection: "valkey-cluster.internal:6379"
bucket: "stella-callgraph"
ttl_hours: 24
# Circuit breaker prevents cache storms
circuit_breaker:
failure_threshold: 5
timeout_seconds: 30
half_open_max_attempts: 3
# Compression reduces memory usage
compression:
enabled: true
algorithm: gzip
level: fastest
```
### 2.3 Policy Gate Configuration
**File:** `etc/policy.yaml`
```yaml
smart_diff:
gates:
# Block on KEV becoming reachable
- id: drift_block_kev
condition: "delta_reachable > 0 AND is_kev = true"
action: block
severity: critical
message: "Known Exploited Vulnerability now reachable"
# Block on high-severity sink becoming reachable
- id: drift_block_critical
condition: "delta_reachable > 0 AND max_cvss >= 9.0"
action: block
severity: critical
message: "Critical vulnerability now reachable"
# Warn on any new reachable paths
- id: drift_warn_new_paths
condition: "delta_reachable > 0"
action: warn
severity: medium
message: "New reachable paths detected"
# Auto-allow mitigated paths
- id: drift_allow_mitigated
condition: "delta_unreachable > 0 AND delta_reachable = 0"
action: allow
auto_approve: true
```
---
## 3. Deployment Modes
### 3.1 Standalone Deployment
```bash
# Run Scanner WebService with drift detection
docker run -d \
--name scanner \
-p 8080:8080 \
-e SCANNER_DB_PASSWORD=secret \
-v /etc/scanner:/etc/scanner:ro \
stellaops/scanner:latest
# Verify health
curl http://localhost:8080/health
```
### 3.2 Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scanner
namespace: stellaops
spec:
replicas: 3
selector:
matchLabels:
app: scanner
template:
metadata:
labels:
app: scanner
spec:
containers:
- name: scanner
image: stellaops/scanner:latest
ports:
- containerPort: 8080
env:
- name: SCANNER_DB_PASSWORD
valueFrom:
secretKeyRef:
name: scanner-secrets
key: db-password
volumeMounts:
- name: config
mountPath: /etc/scanner
readOnly: true
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: config
configMap:
name: scanner-config
```
### 3.3 Air-Gapped Deployment
For air-gapped environments:
1. **Disable external lookups:**
```yaml
scanner:
reachability:
offline_mode: true
# No external advisory fetching
```
2. **Pre-load call graph caches:**
```bash
# Export from connected environment
stella cache export --type callgraph --output graphs.tar.gz
# Import in air-gapped environment
stella cache import --input graphs.tar.gz
```
3. **Use local VEX sources:**
```yaml
excititor:
sources:
- type: local
path: /data/vex-bundles/
```
---
## 4. Monitoring & Metrics
### 4.1 Key Metrics
| Metric | Type | Description | Alert Threshold |
|--------|------|-------------|-----------------|
| `scanner_callgraph_extraction_duration_seconds` | histogram | Time to extract call graph | p99 > 300s |
| `scanner_callgraph_node_count` | gauge | Nodes in extracted graph | > 100,000 |
| `scanner_reachability_analysis_duration_seconds` | histogram | BFS analysis time | p99 > 30s |
| `scanner_drift_newly_reachable_total` | counter | Count of newly reachable sinks | > 0 (alert) |
| `scanner_drift_newly_unreachable_total` | counter | Count of mitigated sinks | (info) |
| `scanner_cache_hit_ratio` | gauge | Valkey cache hit rate | < 0.5 |
| `scanner_cache_circuit_breaker_open` | gauge | Circuit breaker state | = 1 (alert) |
### 4.2 Grafana Dashboard
Import dashboard JSON from: `deploy/grafana/scanner-drift-dashboard.json`
Key panels:
- Drift detection rate over time
- Newly reachable sinks by category
- Call graph extraction latency
- Cache hit/miss ratio
- Circuit breaker state
### 4.3 Alert Rules
```yaml
# Prometheus alerting rules
groups:
- name: scanner-drift
rules:
- alert: KevBecameReachable
expr: increase(scanner_drift_kev_reachable_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "KEV vulnerability became reachable"
description: "A Known Exploited Vulnerability is now reachable from public entrypoints"
- alert: HighDriftRate
expr: rate(scanner_drift_newly_reachable_total[1h]) > 10
for: 15m
labels:
severity: warning
annotations:
summary: "High rate of new reachable vulnerabilities"
- alert: CacheCircuitOpen
expr: scanner_cache_circuit_breaker_open == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Valkey cache circuit breaker is open"
```
---
## 5. Troubleshooting
### 5.1 Call Graph Extraction Failures
**Symptom:** `GRAPH_NOT_EXTRACTED` error
**Causes & Solutions:**
| Cause | Solution |
|-------|----------|
| Missing SDK/runtime | Install required SDK (.NET, Node.js, JDK) |
| Build errors in project | Fix compilation errors first |
| Timeout exceeded | Increase `extraction.timeout_seconds` |
| Memory exhaustion | Increase container memory limits |
| Unsupported language | Check language support matrix |
**Debugging:**
```bash
# Check extraction logs
kubectl logs -f deployment/scanner | grep -i extraction
# Manual extraction test
stella scan callgraph \
--project /path/to/project \
--language dotnet \
--verbose
```
### 5.2 Drift Detection Issues
**Symptom:** Drift not computed or incorrect results
**Causes & Solutions:**
| Cause | Solution |
|-------|----------|
| No base scan available | Ensure previous scan exists |
| Different languages | Base and head must have same language |
| Graph digest unchanged | No material code changes detected |
| Cache stale | Clear Valkey cache for scan |
**Debugging:**
```bash
# Check drift computation status
curl "http://scanner:8080/api/scanner/scans/{scanId}/drift"
# Force recomputation
curl -X POST \
"http://scanner:8080/api/scanner/scans/{scanId}/compute-reachability" \
-d '{"forceRecompute": true}'
# View graph digests
psql -c "SELECT scan_id, graph_digest FROM scanner.call_graph_snapshots ORDER BY extracted_at DESC LIMIT 10"
```
### 5.3 Cache Problems
**Symptom:** Slow performance, cache misses, circuit breaker open
**Solutions:**
```bash
# Check Valkey connectivity
redis-cli -h valkey.internal ping
# Check circuit breaker state
curl "http://scanner:8080/health/ready" | jq '.checks.cache'
# Clear specific scan cache
redis-cli DEL "stella-callgraph:scanId:*"
# Reset circuit breaker (restart scanner)
kubectl rollout restart deployment/scanner
```
### 5.4 Common Error Messages
| Error | Meaning | Action |
|-------|---------|--------|
| `ERR_GRAPH_TOO_LARGE` | > 100K nodes | Increase `max_nodes` or split project |
| `ERR_EXTRACTION_TIMEOUT` | Analysis timed out | Increase timeout or reduce scope |
| `ERR_NO_ENTRYPOINTS` | No public entrypoints found | Check framework detection |
| `ERR_BASE_SCAN_MISSING` | Base scan not found | Specify valid `baseScanId` |
| `ERR_CACHE_UNAVAILABLE` | Valkey unreachable | Check network, circuit breaker will activate |
---
## 6. Performance Tuning
### 6.1 Call Graph Extraction
```yaml
scanner:
reachability:
extraction:
# Exclude test code (reduces graph size)
include_test_code: false
# Exclude vendored dependencies
include_vendored: false
# Limit analysis depth
max_depth: 50 # Default: 100
# Parallel project analysis
parallelism: 4
```
### 6.2 Caching Strategy
```yaml
cache:
valkey:
# Longer TTL for stable projects
ttl_hours: 72
# Aggressive compression for large graphs
compression:
level: optimal # vs 'fastest'
# Larger connection pool
pool_size: 20
```
### 6.3 Database Optimization
```sql
-- Ensure indexes exist
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_callgraph_scan_lang
ON scanner.call_graph_snapshots(scan_id, language);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_drift_head_scan
ON scanner.reachability_drift_results(head_scan_id);
-- Vacuum after large imports
VACUUM ANALYZE scanner.call_graph_snapshots;
VACUUM ANALYZE scanner.reachability_drift_results;
```
---
## 7. Backup & Recovery
### 7.1 Database Backup
```bash
# Backup drift-related tables
pg_dump -h postgres.internal -U stellaops \
-t scanner.call_graph_snapshots \
-t scanner.reachability_results \
-t scanner.reachability_drift_results \
-t scanner.drifted_sinks \
-t scanner.code_changes \
> scanner_drift_backup.sql
```
### 7.2 Cache Recovery
```bash
# Export cache to file (if needed)
redis-cli -h valkey.internal --rdb /backup/callgraph-cache.rdb
# Cache is ephemeral - can be regenerated from database
# Recompute after cache loss:
stella scan recompute-reachability --all-pending
```
---
## 8. Security Considerations
### 8.1 Database Access
- Scanner service uses dedicated PostgreSQL user with schema-limited permissions
- Row-Level Security (RLS) enforces tenant isolation
- Connection strings use secrets management (not plaintext)
### 8.2 API Authentication
- All drift endpoints require valid Bearer token
- Scopes: `scanner:read`, `scanner:write`, `scanner:admin`
- Rate limiting prevents abuse
### 8.3 Attestation Signing
- Drift results can be DSSE-signed for audit trails
- Signing keys managed by Signer service
- Optional Rekor transparency logging
---
## 9. References
- **Architecture:** `docs/modules/scanner/reachability-drift.md`
- **API Reference:** `docs/api/scanner-drift-api.md`
- **PostgreSQL Guide:** `docs/operations/postgresql-guide.md`
- **Air-Gap Operations:** `docs/operations/airgap-operations-runbook.md`
- **Reachability Runbook:** `docs/operations/reachability-runbook.md`