Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/scanner/queue.md
+++ b/docs/doctor/articles/scanner/queue.md
@@ -0,0 +1,110 @@
+---
+checkId: check.scanner.queue
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, queue, jobs, processing]
+---
+# Scanner Queue Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/queue/stats` and evaluates job queue health across four dimensions:
+
+- **Queue depth**: warn at 100+ pending jobs, fail at 500+.
+- **Failure rate**: warn at 5%+ of processed jobs failing, fail at 15%+.
+- **Stuck jobs**: any stuck jobs trigger an immediate fail.
+- **Backlog growth**: a growing backlog triggers a warning.
+
+Evidence collected: `queue_depth`, `processing_rate_per_min`, `stuck_jobs`, `failed_jobs`, `failure_rate`, `oldest_job_age_min`, `backlog_growing`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured; otherwise it is skipped.
+
+## Why It Matters
+The scanner queue is the central work pipeline for SBOM generation, vulnerability scanning, and reachability analysis. A backlogged or stuck queue delays security findings, blocks release gates that depend on scan results, and can cascade into approval timeouts. Stuck jobs indicate a worker crash or resource failure that will not self-heal.
+
+## Common Causes
+- Scanner worker process crashed or was OOM-killed
+- Job dependency (registry, database) became unavailable mid-scan
+- Resource exhaustion (CPU, memory, disk) on the scanner host
+- Database connection lost during job processing
+- Sudden spike in image pushes overwhelming worker capacity
+- Processing rate slower than ingest rate during bulk import
+
+## How to Fix
+
+### Docker Compose
+Check scanner worker status and restart if needed:
+
+```bash
+# View scanner container logs for errors
+docker compose -f docker-compose.stella-ops.yml logs --tail 200 scanner
+
+# Restart the scanner service
+docker compose -f docker-compose.stella-ops.yml restart scanner
+
+# Scale scanner workers (if using replicas)
+docker compose -f docker-compose.stella-ops.yml up -d --scale scanner=4
+```
+
+Adjust concurrency via environment variables:
+
+```yaml
+environment:
+  Scanner__Queue__MaxConcurrentJobs: "4"
+  Scanner__Queue__StuckJobTimeoutMinutes: "30"
+```
+
+### Bare Metal / systemd
+```bash
+# Check scanner service status
+sudo systemctl status stellaops-scanner
+
+# View recent logs
+sudo journalctl -u stellaops-scanner --since "1 hour ago"
+
+# Restart the service
+sudo systemctl restart stellaops-scanner
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Queue": {
+    "MaxConcurrentJobs": 4,
+    "StuckJobTimeoutMinutes": 30
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check scanner pod status
+kubectl get pods -l app=stellaops-scanner
+
+# View logs for crash loops
+kubectl logs -l app=stellaops-scanner --tail=200
+
+# Scale scanner deployment
+kubectl scale deployment stellaops-scanner --replicas=4
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  replicas: 4
+  queue:
+    maxConcurrentJobs: 4
+    stuckJobTimeoutMinutes: 30
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.queue
+```
+
+## Related Checks
+- `check.scanner.resources` -- scanner CPU/memory utilization affecting processing rate
+- `check.scanner.sbom` -- SBOM generation failures may originate from queue issues
+- `check.scanner.vuln` -- vulnerability scan health depends on queue throughput
+- `check.operations.job-queue` -- platform-wide job queue health
--- a/docs/doctor/articles/scanner/reachability.md
+++ b/docs/doctor/articles/scanner/reachability.md
@@ -0,0 +1,113 @@
+---
+checkId: check.scanner.reachability
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, reachability, analysis, performance]
+---
+# Reachability Computation Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/reachability/stats` and evaluates reachability analysis performance and accuracy:
+
+- **Computation failures**: fail if failure rate exceeds 10% of total computations.
+- **Average computation time**: warn at 5,000ms, fail at 30,000ms.
+- **Vulnerability filtering effectiveness**: reported as evidence (ratio of unreachable to total vulnerabilities).
+
+Evidence collected: `total_computations`, `computation_failures`, `failure_rate`, `avg_computation_time_ms`, `p95_computation_time_ms`, `reachable_vulns`, `unreachable_vulns`, `filter_rate`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+Reachability analysis is what separates actionable vulnerability findings from noise. It determines which vulnerabilities are actually reachable in the call graph, filtering out false positives that would otherwise block releases or waste triage time. Slow computations delay security feedback loops, and failures mean vulnerabilities are reported without reachability context, inflating finding counts and eroding operator trust.
+
+## Common Causes
+- Invalid or incomplete call graph data from the SBOM/slice pipeline
+- Missing slice cache entries forcing full recomputation
+- Timeout on large codebases with deep dependency trees
+- Memory exhaustion during graph traversal on complex projects
+- Complex call graphs with high fan-out or cyclical references
+- Insufficient CPU/memory allocated to scanner workers
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check scanner logs for reachability errors
+docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "reachability\|computation"
+
+# Warm the slice cache to speed up subsequent computations
+stella scanner cache warm
+
+# Increase scanner resources
+```
+
+```yaml
+services:
+  scanner:
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+          cpus: "4.0"
+    environment:
+      Scanner__Reachability__TimeoutMs: "60000"
+      Scanner__Reachability__MaxGraphDepth: "100"
+```
+
+### Bare Metal / systemd
+```bash
+# View reachability computation errors
+sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i reachability
+
+# Retry failed computations
+stella scanner reachability retry --failed
+
+# Warm the slice cache
+stella scanner cache warm
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Reachability": {
+    "TimeoutMs": 60000,
+    "MaxGraphDepth": 100,
+    "MaxConcurrentComputations": 4
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check scanner pod resource usage
+kubectl top pods -l app=stellaops-scanner
+
+# Scale scanner workers for parallel computation
+kubectl scale deployment stellaops-scanner --replicas=4
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  replicas: 4
+  resources:
+    limits:
+      memory: 4Gi
+      cpu: "4"
+  reachability:
+    timeoutMs: 60000
+    maxGraphDepth: 100
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.reachability
+```
+
+## Related Checks
+- `check.scanner.slice.cache` -- cache misses are a primary cause of slow computations
+- `check.scanner.witness.graph` -- reachability depends on witness graph integrity
+- `check.scanner.sbom` -- SBOM quality directly affects reachability accuracy
+- `check.scanner.resources` -- resource constraints cause computation timeouts
--- a/docs/doctor/articles/scanner/resources.md
+++ b/docs/doctor/articles/scanner/resources.md
@@ -0,0 +1,119 @@
+---
+checkId: check.scanner.resources
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, resources, cpu, memory, workers]
+---
+# Scanner Resource Utilization
+
+## What It Checks
+Queries the Scanner service at `/api/v1/resources/stats` and evaluates CPU, memory, and worker pool health:
+
+- **CPU utilization**: warn at 75%, fail at 90%.
+- **Memory utilization**: warn at 80%, fail at 95%.
+- **Worker pool saturation**: warn when all workers are busy (zero idle workers).
+
+Evidence collected: `cpu_utilization`, `memory_utilization`, `memory_used_mb`, `active_workers`, `total_workers`, `idle_workers`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+The scanner is one of the most resource-intensive services in the Stella Ops stack. It processes container images, generates SBOMs, runs vulnerability matching, and performs reachability analysis. When scanner resources are exhausted, all downstream pipelines stall: queue depth grows, scan latency increases, and release gates time out waiting for scan results. Memory exhaustion can cause OOM kills that lose in-progress work.
+
+## Common Causes
+- High scan volume during bulk import or CI surge
+- Memory leak from accumulated scan artifacts not being garbage collected
+- Large container images (multi-GB layers) being processed concurrently
+- Insufficient CPU/memory allocation relative to workload
+- All workers busy with no capacity for new jobs
+- Worker scaling not keeping up with demand
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check scanner resource usage
+docker stats scanner --no-stream
+
+# Reduce concurrent jobs to lower resource pressure
+# In docker-compose.stella-ops.yml:
+```
+
+```yaml
+services:
+  scanner:
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+          cpus: "4.0"
+    environment:
+      Scanner__MaxConcurrentJobs: "2"
+      Scanner__Workers__Count: "4"
+```
+
+```bash
+# Restart scanner to apply new resource limits
+docker compose -f docker-compose.stella-ops.yml up -d scanner
+```
+
+### Bare Metal / systemd
+```bash
+# Check current resource usage
+top -p $(pgrep -f stellaops-scanner)
+
+# Reduce concurrent processing
+stella scanner config set MaxConcurrentJobs 2
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Scanner": {
+    "MaxConcurrentJobs": 2,
+    "Workers": {
+      "Count": 4
+    }
+  }
+}
+```
+
+```bash
+sudo systemctl restart stellaops-scanner
+```
+
+### Kubernetes / Helm
+```bash
+# Check pod resource usage
+kubectl top pods -l app=stellaops-scanner
+
+# Scale horizontally instead of vertically
+kubectl scale deployment stellaops-scanner --replicas=4
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  replicas: 4
+  resources:
+    requests:
+      memory: 2Gi
+      cpu: "2"
+    limits:
+      memory: 4Gi
+      cpu: "4"
+  maxConcurrentJobs: 2
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.resources
+```
+
+## Related Checks
+- `check.scanner.queue` -- resource exhaustion causes queue backlog growth
+- `check.scanner.sbom` -- memory exhaustion causes SBOM generation failures
+- `check.scanner.reachability` -- CPU constraints slow computation times
+- `check.scanner.slice.cache` -- cache effectiveness reduces resource demand
--- a/docs/doctor/articles/scanner/sbom.md
+++ b/docs/doctor/articles/scanner/sbom.md
@@ -0,0 +1,106 @@
+---
+checkId: check.scanner.sbom
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, sbom, cyclonedx, spdx, compliance]
+---
+# SBOM Generation Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/sbom/stats` and evaluates SBOM generation health:
+
+- **Success rate**: warn when below 95%, fail when below 80%.
+- **Validation failures**: any schema validation failures trigger a warning regardless of success rate.
+
+Evidence collected: `total_generated`, `successful_generations`, `failed_generations`, `success_rate`, `format_cyclonedx`, `format_spdx`, `validation_failures`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+SBOMs are the foundation of the entire Stella Ops security pipeline. Without valid SBOMs, vulnerability scanning produces incomplete results, reachability analysis cannot run, and release gates that require an SBOM attestation will block promotions. Compliance frameworks (e.g., EO 14028, EU CRA) mandate accurate SBOMs for every shipped artifact.
+
+## Common Causes
+- Invalid or corrupted source artifacts (truncated layers, missing manifests)
+- Parser errors for specific ecosystems (e.g., unsupported lockfile format)
+- Memory exhaustion on large monorepo or multi-module projects
+- SBOM schema validation failures due to generator version mismatch
+- Unsupported container base image format
+- Minor parsing issues in transitive dependency resolution
+
+## How to Fix
+
+### Docker Compose
+```bash
+# View recent SBOM generation failures
+docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "sbom.*fail"
+
+# Restart the scanner to clear any cached bad state
+docker compose -f docker-compose.stella-ops.yml restart scanner
+
+# Increase memory limit if OOM is suspected
+# In docker-compose.stella-ops.yml:
+```
+
+```yaml
+services:
+  scanner:
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+    environment:
+      Scanner__Sbom__ValidationMode: "Strict"
+      Scanner__Sbom__MaxArtifactSizeMb: "500"
+```
+
+### Bare Metal / systemd
+```bash
+# Check scanner logs for SBOM errors
+sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i sbom
+
+# Retry failed SBOMs
+stella scanner sbom retry --failed
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Sbom": {
+    "ValidationMode": "Strict",
+    "MaxArtifactSizeMb": 500
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check for OOMKilled scanner pods
+kubectl get pods -l app=stellaops-scanner -o wide
+kubectl describe pod <scanner-pod> | grep -A 5 "Last State"
+
+# View SBOM-related logs
+kubectl logs -l app=stellaops-scanner --tail=200 | grep -i sbom
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  resources:
+    limits:
+      memory: 4Gi
+  sbom:
+    validationMode: Strict
+    maxArtifactSizeMb: 500
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.sbom
+```
+
+## Related Checks
+- `check.scanner.queue` -- queue backlog can delay SBOM generation
+- `check.scanner.witness.graph` -- witness graphs depend on successful SBOM output
+- `check.scanner.resources` -- resource exhaustion is a top cause of SBOM failures
--- a/docs/doctor/articles/scanner/slice-cache.md
+++ b/docs/doctor/articles/scanner/slice-cache.md
@@ -0,0 +1,112 @@
+---
+checkId: check.scanner.slice.cache
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, cache, slice, performance]
+---
+# Slice Cache Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/cache/stats` and evaluates slice cache effectiveness:
+
+- **Storage utilization**: warn at 80% full, fail at 95% full.
+- **Hit rate**: warn below 50%, fail below 20%.
+- **Eviction rate**: reported as evidence.
+
+Evidence collected: `hit_rate`, `hits`, `misses`, `entry_count`, `used_bytes`, `total_bytes`, `storage_utilization`, `eviction_rate`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+The slice cache stores pre-computed code slices used by the reachability engine. A healthy cache avoids re-analyzing the same dependency trees on every scan, reducing computation time from seconds to milliseconds. When the cache hit rate drops, reachability computations slow dramatically, causing queue backlog and delayed security feedback. When storage fills up, evictions accelerate and the cache thrashes, making it effectively useless.
+
+## Common Causes
+- Cache size limit too small for the working set of scanned artifacts
+- TTL configured too long, preventing eviction of stale entries
+- Eviction policy not working (configuration error)
+- Unexpected growth in the number of unique slices (new projects onboarded)
+- Cache was recently cleared (restart, volume reset)
+- Working set larger than cache capacity
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Clear stale cache entries
+stella scanner cache prune --stale
+
+# Warm the cache for active projects
+stella scanner cache warm
+```
+
+Increase cache size in `docker-compose.stella-ops.yml`:
+
+```yaml
+services:
+  scanner:
+    environment:
+      Scanner__Cache__MaxSizeBytes: "4294967296"  # 4 GB
+      Scanner__Cache__TtlHours: "72"
+      Scanner__Cache__EvictionPolicy: "LRU"
+    volumes:
+      - scanner-cache:/data/cache
+```
+
+### Bare Metal / systemd
+```bash
+# Check cache directory size
+du -sh /var/lib/stellaops/scanner/cache
+
+# Prune stale entries
+stella scanner cache prune --stale
+
+# Warm cache
+stella scanner cache warm
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "Cache": {
+    "MaxSizeBytes": 4294967296,
+    "TtlHours": 72,
+    "EvictionPolicy": "LRU",
+    "DataPath": "/var/lib/stellaops/scanner/cache"
+  }
+}
+```
+
+```bash
+sudo systemctl restart stellaops-scanner
+```
+
+### Kubernetes / Helm
+```bash
+# Check PVC usage for cache volume
+kubectl exec -it <scanner-pod> -- df -h /data/cache
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  cache:
+    maxSizeBytes: 4294967296  # 4 GB
+    ttlHours: 72
+    evictionPolicy: LRU
+    persistence:
+      enabled: true
+      size: 10Gi
+      storageClass: fast-ssd
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.slice.cache
+```
+
+## Related Checks
+- `check.scanner.reachability` -- cache misses directly increase computation time
+- `check.scanner.resources` -- cache thrashing increases CPU and memory usage
+- `check.scanner.queue` -- slow cache performance cascades into queue backlog
--- a/docs/doctor/articles/scanner/vuln.md
+++ b/docs/doctor/articles/scanner/vuln.md
@@ -0,0 +1,111 @@
+---
+checkId: check.scanner.vuln
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, vulnerability, cve, database]
+---
+# Vulnerability Scan Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/vuln/stats` and evaluates vulnerability scanning health, focusing on database freshness:
+
+- **Database freshness**: warn when the vulnerability database is older than 24 hours, fail when older than 72 hours.
+- **Scan failure rate**: warn when scan failure rate exceeds 10%.
+
+Evidence collected: `database_age_hours`, `last_db_update`, `total_cves`, `scans_completed`, `scan_failures`, `failure_rate`, `vulnerabilities_found`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+A stale vulnerability database means newly disclosed CVEs are not detected in scans, creating a false sense of security. Artifacts that pass policy gates with an outdated database may contain exploitable vulnerabilities that would have been caught with current data. In regulated environments, scan freshness is an auditable compliance requirement. High scan failure rates mean some artifacts are not being scanned at all.
+
+## Common Causes
+- Vulnerability database sync job failed or is not scheduled
+- Feed source (NVD, OSV, vendor advisory) unavailable or rate-limited
+- Network connectivity issue preventing feed downloads
+- Scheduled sync delayed due to system overload
+- Parsing errors on specific artifact formats
+- Unsupported package ecosystem or lockfile format
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Trigger an immediate database sync
+stella scanner db sync
+
+# Check sync job status
+stella scanner db status
+
+# View scanner logs for sync errors
+docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "sync\|feed\|vuln"
+```
+
+Configure sync schedule in `docker-compose.stella-ops.yml`:
+
+```yaml
+services:
+  scanner:
+    environment:
+      Scanner__VulnDb__SyncIntervalHours: "6"
+      Scanner__VulnDb__FeedSources: "nvd,osv,github"
+      Scanner__VulnDb__RetryCount: "3"
+```
+
+### Bare Metal / systemd
+```bash
+# Trigger manual sync
+stella scanner db sync
+
+# Check sync schedule
+stella scanner db schedule
+
+# View sync logs
+sudo journalctl -u stellaops-scanner --since "24 hours ago" | grep -i "sync\|feed"
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "VulnDb": {
+    "SyncIntervalHours": 6,
+    "FeedSources": ["nvd", "osv", "github"],
+    "RetryCount": 3
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check scanner pod logs for sync status
+kubectl logs -l app=stellaops-scanner --tail=100 | grep -i sync
+
+# Verify CronJob for database sync exists and is running
+kubectl get cronjobs -l app=stellaops-scanner-sync
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  vulnDb:
+    syncIntervalHours: 6
+    feedSources:
+      - nvd
+      - osv
+      - github
+    retryCount: 3
+  syncCronJob:
+    schedule: "0 */6 * * *"
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.vuln
+```
+
+## Related Checks
+- `check.scanner.queue` -- scan failures may originate from queue processing issues
+- `check.scanner.sbom` -- vulnerability matching depends on SBOM quality
+- `check.scanner.reachability` -- reachability analysis uses vulnerability data to filter findings
--- a/docs/doctor/articles/scanner/witness-graph.md
+++ b/docs/doctor/articles/scanner/witness-graph.md
@@ -0,0 +1,108 @@
+---
+checkId: check.scanner.witness.graph
+plugin: stellaops.doctor.scanner
+severity: warn
+tags: [scanner, witness, graph, reachability, evidence]
+---
+# Witness Graph Health
+
+## What It Checks
+Queries the Scanner service at `/api/v1/witness/stats` and evaluates witness graph construction health:
+
+- **Construction failures**: fail if failure rate exceeds 10% of total constructions.
+- **Incomplete graphs**: warn if any graphs are incomplete (missing nodes or edges).
+- **Consistency errors**: warn if any consistency errors are detected (orphaned nodes, version mismatches).
+
+Evidence collected: `total_constructed`, `construction_failures`, `failure_rate`, `incomplete_graphs`, `avg_nodes_per_graph`, `avg_edges_per_graph`, `avg_completeness`, `consistency_errors`.
+
+The check requires `Scanner:Url` or `Services:Scanner:Url` to be configured.
+
+## Why It Matters
+Witness graphs are the evidence artifacts that prove how a vulnerability reachability verdict was reached. They record the call chain from application entry point to vulnerable function. Without intact witness graphs, reachability findings lack provenance, attestation of scan results is weakened, and auditors cannot verify that "unreachable" verdicts are legitimate. Incomplete or inconsistent graphs can cause incorrect reachability conclusions.
+
+## Common Causes
+- Missing SBOM input (SBOM generation failed for the artifact)
+- Parser error on specific artifact types or ecosystems
+- Cyclical dependency detected causing infinite traversal
+- Resource exhaustion during graph construction on large projects
+- Partial SBOM data (some dependencies resolved, others missing)
+- Missing transitive dependencies in the dependency tree
+- Version mismatch between SBOM and slice data
+- Orphaned nodes from stale cache entries
+
+## How to Fix
+
+### Docker Compose
+```bash
+# View recent construction failures
+docker compose -f docker-compose.stella-ops.yml logs scanner | grep -i "witness.*fail\|graph.*error"
+
+# Rebuild failed graphs
+stella scanner witness rebuild --failed
+
+# Check SBOM pipeline health (witness graphs depend on SBOMs)
+stella doctor run --check check.scanner.sbom
+```
+
+```yaml
+services:
+  scanner:
+    environment:
+      Scanner__WitnessGraph__MaxDepth: "50"
+      Scanner__WitnessGraph__TimeoutMs: "30000"
+      Scanner__WitnessGraph__ConsistencyCheckEnabled: "true"
+```
+
+### Bare Metal / systemd
+```bash
+# View construction errors
+sudo journalctl -u stellaops-scanner --since "1 hour ago" | grep -i witness
+
+# Rebuild failed graphs
+stella scanner witness rebuild --failed
+
+# View graph statistics
+stella scanner witness stats
+```
+
+Edit `/etc/stellaops/scanner/appsettings.json`:
+
+```json
+{
+  "WitnessGraph": {
+    "MaxDepth": 50,
+    "TimeoutMs": 30000,
+    "ConsistencyCheckEnabled": true
+  }
+}
+```
+
+### Kubernetes / Helm
+```bash
+# Check scanner logs for witness graph issues
+kubectl logs -l app=stellaops-scanner --tail=200 | grep -i witness
+
+# Rebuild failed graphs
+kubectl exec -it <scanner-pod> -- stella scanner witness rebuild --failed
+```
+
+Set in Helm `values.yaml`:
+
+```yaml
+scanner:
+  witnessGraph:
+    maxDepth: 50
+    timeoutMs: 30000
+    consistencyCheckEnabled: true
+```
+
+## Verification
+```
+stella doctor run --check check.scanner.witness.graph
+```
+
+## Related Checks
+- `check.scanner.sbom` -- witness graphs are constructed from SBOM data
+- `check.scanner.reachability` -- reachability verdicts depend on witness graph integrity
+- `check.scanner.slice.cache` -- stale cache entries can cause consistency errors
+- `check.scanner.resources` -- resource exhaustion causes construction failures