up

2025-12-14 23:20:14 +02:00
parent 3411e825cd
commit b058dbe031
356 changed files with 68310 additions and 1108 deletions
--- a/docs/benchmarks/accuracy-metrics-framework.md
+++ b/docs/benchmarks/accuracy-metrics-framework.md
@@ -0,0 +1,320 @@
+# Accuracy Metrics Framework
+
+## Overview
+
+This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.
+
+## Metric Definitions
+
+### Confusion Matrix
+
+For binary classification tasks (e.g., reachable vs unreachable):
+
+|  | Predicted Positive | Predicted Negative |
+|--|-------------------|-------------------|
+| **Actual Positive** | True Positive (TP) | False Negative (FN) |
+| **Actual Negative** | False Positive (FP) | True Negative (TN) |
+
+### Core Metrics
+
+| Metric | Formula | Description | Target |
+|--------|---------|-------------|--------|
+| **Precision** | TP / (TP + FP) | Of items flagged, how many were correct | >= 90% |
+| **Recall** | TP / (TP + FN) | Of actual positives, how many were found | >= 85% |
+| **F1 Score** | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall | >= 87% |
+| **False Positive Rate** | FP / (FP + TN) | Rate of incorrect positive flags | <= 10% |
+| **Accuracy** | (TP + TN) / Total | Overall correctness | >= 90% |
+
+---
+
+## Reachability Analysis Accuracy
+
+### Definitions
+
+- **True Positive (TP)**: Correctly identified as reachable (code path actually exists)
+- **False Positive (FP)**: Incorrectly identified as reachable (no real code path)
+- **True Negative (TN)**: Correctly identified as unreachable (no code path exists)
+- **False Negative (FN)**: Incorrectly identified as unreachable (code path exists but missed)
+
+### Target Metrics
+
+| Metric | Target | Stretch Goal |
+|--------|--------|--------------|
+| Precision | >= 90% | >= 95% |
+| Recall | >= 85% | >= 90% |
+| F1 Score | >= 87% | >= 92% |
+| False Positive Rate | <= 10% | <= 5% |
+
+### Per-Language Targets
+
+| Language | Precision | Recall | F1 | Notes |
+|----------|-----------|--------|-----|-------|
+| Java | >= 92% | >= 88% | >= 90% | Strong static analysis support |
+| C# | >= 90% | >= 85% | >= 87% | Roslyn-based analysis |
+| Go | >= 88% | >= 82% | >= 85% | Good call graph support |
+| JavaScript | >= 85% | >= 78% | >= 81% | Dynamic typing challenges |
+| Python | >= 83% | >= 75% | >= 79% | Dynamic typing challenges |
+| TypeScript | >= 88% | >= 82% | >= 85% | Better than JS due to types |
+
+---
+
+## Lattice State Accuracy
+
+VEX lattice states have different confidence requirements:
+
+| State | Definition | Target Accuracy | Validation |
+|-------|------------|-----------------|------------|
+| **CR** (Confirmed Reachable) | Runtime evidence + static path | >= 95% | Runtime trace verification |
+| **SR** (Static Reachable) | Static path only | >= 90% | Static analysis coverage |
+| **SU** (Static Unreachable) | No static path found | >= 85% | Negative proof verification |
+| **DT** (Denied by Tool) | Tool analysis confirms not affected | >= 90% | Tool output validation |
+| **DV** (Denied by Vendor) | Vendor VEX statement | >= 95% | VEX signature verification |
+| **U** (Unknown) | Insufficient evidence | Track % | Minimize unknowns |
+
+### Lattice Transition Accuracy
+
+Measure accuracy of automatic state transitions:
+
+| Transition | Trigger | Target Accuracy |
+|------------|---------|-----------------|
+| U -> SR | Static analysis finds path | >= 90% |
+| SR -> CR | Runtime evidence added | >= 95% |
+| U -> SU | Static analysis proves unreachable | >= 85% |
+| SR -> DT | Tool-specific analysis | >= 90% |
+
+---
+
+## SBOM Completeness Metrics
+
+### Component Detection
+
+| Metric | Formula | Target | Notes |
+|--------|---------|--------|-------|
+| **Component Recall** | Found / Total Actual | >= 98% | Find all real components |
+| **Component Precision** | Real / Reported | >= 99% | Minimize phantom components |
+| **Version Accuracy** | Correct Versions / Total | >= 95% | Version string correctness |
+| **License Accuracy** | Correct Licenses / Total | >= 90% | License detection accuracy |
+
+### Per-Ecosystem Targets
+
+| Ecosystem | Comp. Recall | Comp. Precision | Version Acc. |
+|-----------|--------------|-----------------|--------------|
+| Alpine APK | >= 99% | >= 99% | >= 98% |
+| Debian DEB | >= 99% | >= 99% | >= 98% |
+| npm | >= 97% | >= 98% | >= 95% |
+| Maven | >= 98% | >= 99% | >= 96% |
+| NuGet | >= 98% | >= 99% | >= 96% |
+| PyPI | >= 96% | >= 98% | >= 94% |
+| Go Modules | >= 97% | >= 98% | >= 95% |
+| Cargo (Rust) | >= 98% | >= 99% | >= 96% |
+
+---
+
+## Vulnerability Detection Accuracy
+
+### CVE Matching
+
+| Metric | Formula | Target |
+|--------|---------|--------|
+| **CVE Recall** | Found CVEs / Actual CVEs | >= 95% |
+| **CVE Precision** | Correct CVEs / Reported CVEs | >= 98% |
+| **Version Range Accuracy** | Correct Affected / Total | >= 93% |
+
+### False Positive Categories
+
+Track and minimize specific FP types:
+
+| FP Type | Description | Target Rate |
+|---------|-------------|-------------|
+| **Phantom Component** | CVE for component not present | <= 1% |
+| **Version Mismatch** | CVE for wrong version | <= 3% |
+| **Ecosystem Confusion** | Wrong package with same name | <= 1% |
+| **Stale Advisory** | Already fixed but flagged | <= 2% |
+
+---
+
+## Measurement Methodology
+
+### Ground Truth Establishment
+
+1. **Manual Curation**
+   - Expert review of sample applications
+   - Documented decision rationale
+   - Multiple reviewer consensus
+
+2. **Automated Verification**
+   - Cross-reference with authoritative sources
+   - NVD, OSV, GitHub Advisory Database
+   - Vendor security bulletins
+
+3. **Runtime Validation**
+   - Dynamic analysis confirmation
+   - Exploit proof-of-concept testing
+   - Production monitoring correlation
+
+### Test Corpus Requirements
+
+| Category | Minimum Samples | Diversity Requirements |
+|----------|-----------------|----------------------|
+| Reachability | 50 per language | Mix of libraries, frameworks |
+| SBOM | 100 images | All major ecosystems |
+| CVE Detection | 500 CVEs | Mix of severities, ages |
+| Performance | 10 reference images | Various sizes |
+
+### Measurement Process
+
+```
+1. Select ground truth corpus
+   └── Minimum samples per category
+   └── Representative of production workloads
+
+2. Run scanner with deterministic manifest
+   └── Fixed advisory database version
+   └── Reproducible configuration
+
+3. Compare results to ground truth
+   └── Automated diff tooling
+   └── Manual review of discrepancies
+
+4. Compute metrics per category
+   └── Generate confusion matrices
+   └── Calculate precision/recall/F1
+
+5. Aggregate and publish
+   └── Per-ecosystem breakdown
+   └── Overall summary metrics
+   └── Trend analysis
+```
+
+---
+
+## Reporting Format
+
+### Quarterly Benchmark Report
+
+```json
+{
+  "report_version": "1.0",
+  "scanner_version": "1.3.0",
+  "report_date": "2025-12-14",
+  "ground_truth_version": "2025-Q4",
+
+  "reachability": {
+    "overall": {
+      "precision": 0.91,
+      "recall": 0.86,
+      "f1": 0.88,
+      "samples": 450
+    },
+    "by_language": {
+      "java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
+      "csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
+      "go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
+    }
+  },
+
+  "sbom": {
+    "component_recall": 0.98,
+    "component_precision": 0.99,
+    "version_accuracy": 0.96
+  },
+
+  "vulnerability": {
+    "cve_recall": 0.96,
+    "cve_precision": 0.98,
+    "false_positive_rate": 0.02
+  },
+
+  "lattice_states": {
+    "cr_accuracy": 0.96,
+    "sr_accuracy": 0.91,
+    "su_accuracy": 0.87
+  }
+}
+```
+
+---
+
+## Regression Detection
+
+### Thresholds
+
+A regression is flagged when:
+
+| Metric | Regression Threshold | Action |
+|--------|---------------------|--------|
+| Precision | > 3% decrease | Block release |
+| Recall | > 5% decrease | Block release |
+| F1 | > 4% decrease | Block release |
+| FPR | > 2% increase | Block release |
+| Any metric | > 1% change | Investigate |
+
+### CI Integration
+
+```yaml
+# .gitea/workflows/accuracy-check.yml
+accuracy-benchmark:
+  runs-on: ubuntu-latest
+  steps:
+    - name: Run accuracy benchmark
+      run: make benchmark-accuracy
+
+    - name: Check for regressions
+      run: |
+        stellaops benchmark compare \
+          --baseline results/baseline.json \
+          --current results/current.json \
+          --threshold-precision 0.03 \
+          --threshold-recall 0.05 \
+          --fail-on-regression
+```
+
+---
+
+## Ground Truth Sources
+
+### Internal
+
+- `datasets/reachability/samples/` - Reachability ground truth
+- `datasets/sbom/reference/` - Known-good SBOMs
+- `bench/findings/` - CVE finding ground truth
+
+### External
+
+- **NIST SARD** - Software Assurance Reference Dataset
+- **OSV Test Suite** - Open Source Vulnerability test cases
+- **OWASP Benchmark** - Security testing benchmark
+- **Juliet Test Suite** - CWE coverage testing
+
+---
+
+## Improvement Tracking
+
+### Gap Analysis
+
+Identify and prioritize accuracy improvements:
+
+| Gap | Current | Target | Priority | Improvement Plan |
+|-----|---------|--------|----------|------------------|
+| Python recall | 73% | 78% | High | Improve type inference |
+| npm precision | 96% | 98% | Medium | Fix aliasing issues |
+| Version accuracy | 94% | 96% | Medium | Better version parsing |
+
+### Quarterly Goals
+
+Track progress against improvement targets:
+
+| Quarter | Focus Area | Metric | Target | Actual |
+|---------|------------|--------|--------|--------|
+| Q4 2025 | Java reachability | Recall | 88% | TBD |
+| Q1 2026 | Python support | F1 | 80% | TBD |
+| Q1 2026 | SBOM completeness | Recall | 99% | TBD |
+
+---
+
+## References
+
+- [FIRST CVSS v4.0 Specification](https://www.first.org/cvss/v4.0/specification-document)
+- [NIST NVD API](https://nvd.nist.gov/developers)
+- [OSV Schema](https://ossf.github.io/osv-schema/)
+- [StellaOps Reachability Architecture](../modules/scanner/reachability.md)
--- a/docs/benchmarks/performance-baselines.md
+++ b/docs/benchmarks/performance-baselines.md
@@ -0,0 +1,355 @@
+# Performance Baselines
+
+## Overview
+
+This document defines performance baselines for StellaOps scanner operations. All metrics are measured against reference images and workloads to ensure consistent, reproducible benchmarks.
+
+**Last Updated:** 2025-12-14
+**Next Review:** 2026-03-14
+
+---
+
+## Reference Images
+
+Standard images used for performance benchmarking:
+
+| Image | Size | Components | Expected Vulns | Category |
+|-------|------|------------|----------------|----------|
+| `alpine:3.19` | 7MB | ~15 | ~5 | Minimal |
+| `debian:12-slim` | 75MB | ~90 | ~40 | Minimal |
+| `ubuntu:22.04` | 77MB | ~100 | ~50 | Standard |
+| `node:20-alpine` | 180MB | ~200 | ~100 | Application |
+| `python:3.12` | 1GB | ~300 | ~150 | Application |
+| `mcr.microsoft.com/dotnet/aspnet:8.0` | 220MB | ~150 | ~75 | Application |
+| `nginx:1.25` | 190MB | ~120 | ~60 | Application |
+| `postgres:16-alpine` | 240MB | ~140 | ~70 | Database |
+
+---
+
+## Scan Performance Targets
+
+### Container Image Scanning
+
+| Image Category | P50 Time | P95 Time | Max Memory | CPU Cores |
+|---------------|----------|----------|------------|-----------|
+| Minimal (<100MB) | < 5s | < 10s | < 256MB | 1 |
+| Standard (100-500MB) | < 15s | < 30s | < 512MB | 2 |
+| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB | 2 |
+| Very Large (>2GB) | < 120s | < 240s | < 2GB | 4 |
+
+### Per-Image Targets
+
+| Image | P50 Time | P95 Time | Max Memory |
+|-------|----------|----------|------------|
+| alpine:3.19 | < 3s | < 8s | < 200MB |
+| debian:12-slim | < 8s | < 15s | < 300MB |
+| ubuntu:22.04 | < 10s | < 20s | < 400MB |
+| node:20-alpine | < 20s | < 40s | < 600MB |
+| python:3.12 | < 35s | < 70s | < 1.2GB |
+| dotnet/aspnet:8.0 | < 25s | < 50s | < 800MB |
+| nginx:1.25 | < 18s | < 35s | < 500MB |
+| postgres:16-alpine | < 22s | < 45s | < 600MB |
+
+---
+
+## Reachability Analysis Targets
+
+### By Codebase Size
+
+| Codebase Size | P50 Time | P95 Time | Memory | Notes |
+|---------------|----------|----------|--------|-------|
+| Tiny (<5k LOC) | < 10s | < 20s | < 256MB | Single service |
+| Small (5-20k LOC) | < 30s | < 60s | < 512MB | Small service |
+| Medium (20-50k LOC) | < 2min | < 4min | < 1GB | Typical microservice |
+| Large (50-100k LOC) | < 5min | < 10min | < 2GB | Large service |
+| Very Large (100-500k LOC) | < 15min | < 30min | < 4GB | Monolith |
+| Monorepo (>500k LOC) | < 45min | < 90min | < 8GB | Enterprise monorepo |
+
+### By Language
+
+| Language | Relative Speed | Notes |
+|----------|---------------|-------|
+| Go | 1.0x (baseline) | Fast due to simple module system |
+| Java | 1.2x | Maven/Gradle resolution adds overhead |
+| C# | 1.3x | MSBuild/NuGet resolution |
+| TypeScript | 1.5x | npm/yarn resolution, complex imports |
+| Python | 1.8x | Virtual env resolution, dynamic imports |
+| JavaScript | 2.0x | Complex bundler configurations |
+
+---
+
+## SBOM Generation Targets
+
+| Format | P50 Time | P95 Time | Output Size | Notes |
+|--------|----------|----------|-------------|-------|
+| CycloneDX 1.6 (JSON) | < 1s | < 3s | ~50KB/100 components | Standard |
+| CycloneDX 1.6 (XML) | < 1.5s | < 4s | ~80KB/100 components | Verbose |
+| SPDX 3.0.1 (JSON) | < 1s | < 3s | ~60KB/100 components | Standard |
+| SPDX 3.0.1 (Tag-Value) | < 1.2s | < 3.5s | ~70KB/100 components | Legacy format |
+
+### Combined Operations
+
+| Operation | P50 Time | P95 Time |
+|-----------|----------|----------|
+| Scan + SBOM | scan_time + 1s | scan_time + 3s |
+| Scan + SBOM + Reachability | scan_time + reach_time + 2s | scan_time + reach_time + 5s |
+| Full attestation pipeline | total_time + 2s | total_time + 5s |
+
+---
+
+## VEX Processing Targets
+
+| Operation | P50 Time | P95 Time | Notes |
+|-----------|----------|----------|-------|
+| VEX document parsing | < 50ms | < 150ms | Per document |
+| Lattice state computation | < 100ms | < 300ms | Per 100 vulnerabilities |
+| VEX consensus merge | < 200ms | < 500ms | 3-5 sources |
+| State transition | < 10ms | < 30ms | Single transition |
+
+---
+
+## CVSS Scoring Targets
+
+| Operation | P50 Time | P95 Time | Notes |
+|-----------|----------|----------|-------|
+| MacroVector lookup | < 1μs | < 5μs | Dictionary lookup |
+| CVSS v4.0 base score | < 10μs | < 50μs | Full computation |
+| CVSS v4.0 full score | < 20μs | < 100μs | Base + threat + env |
+| Vector parsing | < 5μs | < 20μs | String parsing |
+| Receipt generation | < 100μs | < 500μs | Includes hashing |
+| Batch scoring (100 vulns) | < 5ms | < 15ms | Parallel processing |
+
+---
+
+## Attestation Targets
+
+| Operation | P50 Time | P95 Time | Notes |
+|-----------|----------|----------|-------|
+| DSSE envelope creation | < 50ms | < 150ms | Includes signing |
+| DSSE verification | < 30ms | < 100ms | Signature check |
+| Rekor submission | < 500ms | < 2s | Network dependent |
+| Rekor verification | < 300ms | < 1s | Network dependent |
+| in-toto predicate | < 20ms | < 80ms | JSON serialization |
+
+---
+
+## Database Operation Targets
+
+| Operation | P50 Time | P95 Time | Notes |
+|-----------|----------|----------|-------|
+| Receipt insert | < 5ms | < 20ms | Single record |
+| Receipt query (by ID) | < 2ms | < 10ms | Indexed lookup |
+| Receipt query (by tenant) | < 10ms | < 50ms | Index scan |
+| EPSS lookup (single) | < 1ms | < 5ms | Indexed |
+| EPSS lookup (batch 100) | < 10ms | < 50ms | Batch query |
+| Risk score insert | < 5ms | < 20ms | Single record |
+| Risk score update | < 3ms | < 15ms | Single record |
+
+---
+
+## Regression Thresholds
+
+Performance regression is detected when metrics exceed these thresholds compared to baseline:
+
+| Metric | Warning Threshold | Blocking Threshold | Action |
+|--------|------------------|-------------------|--------|
+| P50 Time | > 15% increase | > 25% increase | Block release |
+| P95 Time | > 20% increase | > 35% increase | Block release |
+| Memory Usage | > 20% increase | > 30% increase | Block release |
+| CPU Time | > 15% increase | > 25% increase | Investigate |
+| Throughput | > 10% decrease | > 20% decrease | Block release |
+
+### Regression Detection Rules
+
+1. **Warning**: Alert engineering team, add to release notes
+2. **Blocking**: Cannot merge/release until resolved or waived
+3. **Waiver**: Requires documented justification and SME approval
+
+---
+
+## Measurement Methodology
+
+### Environment Setup
+
+```bash
+# Standard test environment
+# - CPU: 8 cores (x86_64)
+# - Memory: 16GB RAM
+# - Storage: NVMe SSD
+# - OS: Ubuntu 22.04 LTS
+# - Docker: 24.x
+
+# Clear caches before cold start tests
+docker system prune -af
+sync && echo 3 > /proc/sys/vm/drop_caches
+```
+
+### Scan Performance
+
+```bash
+# Cold start measurement
+time stellaops scan --image alpine:3.19 --format json > /dev/null
+
+# Warm cache measurement (run 3x, take average)
+for i in {1..3}; do
+  time stellaops scan --image alpine:3.19 --format json > /dev/null
+done
+
+# Memory profiling
+/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
+  grep "Maximum resident set size"
+
+# CPU profiling
+perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
+```
+
+### Reachability Analysis
+
+```bash
+# Time measurement
+time stellaops reach --project ./src --language csharp --out reach.json
+
+# Memory profiling
+/usr/bin/time -v stellaops reach --project ./src --language csharp --out reach.json 2>&1
+
+# With detailed timing
+stellaops reach --project ./src --language csharp --out reach.json --timing
+```
+
+### SBOM Generation
+
+```bash
+# Time measurement
+time stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json
+
+# Output size
+stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json && \
+  ls -lh sbom.json
+```
+
+### Batch Operations
+
+```bash
+# Process multiple images in parallel
+time stellaops scan --images images.txt --parallel 4 --format json --out-dir ./results
+
+# Throughput test (images per minute)
+START=$(date +%s)
+for i in {1..10}; do
+  stellaops scan --image alpine:3.19 --format json > /dev/null
+done
+END=$(date +%s)
+echo "Throughput: $(( 10 * 60 / (END - START) )) images/minute"
+```
+
+---
+
+## CI Integration
+
+### Benchmark Workflow
+
+```yaml
+# .gitea/workflows/performance-benchmark.yml
+name: Performance Benchmark
+
+on:
+  pull_request:
+    branches: [main]
+  schedule:
+    - cron: '0 2 * * 1'  # Weekly Monday 2am
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Run benchmarks
+        run: make benchmark-performance
+
+      - name: Check for regressions
+        run: |
+          stellaops benchmark compare \
+            --baseline results/baseline.json \
+            --current results/current.json \
+            --threshold-p50 0.15 \
+            --threshold-p95 0.20 \
+            --threshold-memory 0.20 \
+            --fail-on-regression
+
+      - name: Upload results
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-results
+          path: results/
+```
+
+### Local Testing
+
+```bash
+# Run full benchmark suite
+make benchmark-performance
+
+# Run specific image benchmark
+make benchmark-image IMAGE=alpine:3.19
+
+# Generate baseline
+make benchmark-baseline
+
+# Compare against baseline
+make benchmark-compare
+```
+
+---
+
+## Optimization Guidelines
+
+### For Scan Performance
+
+1. **Pre-pull images** for consistent timing
+2. **Use layered caching** for repeat scans
+3. **Enable parallel analysis** for multi-ecosystem images
+4. **Consider selective scanning** for known-safe layers
+
+### For Reachability
+
+1. **Incremental analysis** for unchanged files
+2. **Cache resolved dependencies**
+3. **Use language-specific optimizations** (e.g., Roslyn for C#)
+4. **Limit call graph depth** for very large codebases
+
+### For Memory
+
+1. **Stream large SBOMs** instead of loading fully
+2. **Use batched database operations**
+3. **Release intermediate data structures early**
+4. **Configure GC appropriately for workload**
+
+---
+
+## Historical Baselines
+
+### Version History
+
+| Version | Date | P50 Scan (alpine) | P50 Reach (50k LOC) | Notes |
+|---------|------|-------------------|---------------------|-------|
+| 1.3.0 | 2025-12-14 | TBD | TBD | Current |
+| 1.2.0 | 2025-09-01 | TBD | TBD | Previous |
+| 1.1.0 | 2025-06-01 | TBD | TBD | Baseline |
+
+### Improvement Targets
+
+| Quarter | Focus Area | Target | Status |
+|---------|------------|--------|--------|
+| Q1 2026 | Scan cold start | -20% | Planned |
+| Q1 2026 | Reachability memory | -15% | Planned |
+| Q2 2026 | SBOM generation | -10% | Planned |
+
+---
+
+## References
+
+- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
+- [Benchmark Submission Guide](submission-guide.md) (pending)
+- [Scanner Architecture](../modules/scanner/architecture.md)
+- [Reachability Module](../modules/scanner/reachability.md)
--- a/docs/benchmarks/submission-guide.md
+++ b/docs/benchmarks/submission-guide.md
@@ -0,0 +1,653 @@
+# Benchmark Submission Guide
+
+**Last Updated:** 2025-12-14
+**Next Review:** 2026-03-14
+
+---
+
+## Overview
+
+StellaOps publishes benchmarks for:
+- **Reachability Analysis** - Accuracy of static and runtime path detection
+- **SBOM Completeness** - Component detection and version accuracy
+- **Vulnerability Detection** - Precision, recall, and F1 scores
+- **Scan Performance** - Time, memory, and CPU metrics
+- **Determinism** - Reproducibility of scan outputs
+
+This guide explains how to reproduce, validate, and submit benchmark results.
+
+---
+
+## 1. PREREQUISITES
+
+### 1.1 System Requirements
+
+| Requirement | Minimum | Recommended |
+|-------------|---------|-------------|
+| CPU | 4 cores | 8 cores |
+| Memory | 8 GB | 16 GB |
+| Storage | 50 GB SSD | 100 GB NVMe |
+| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
+| Docker | 24.x | 24.x |
+| .NET | 10.0 | 10.0 |
+
+### 1.2 Environment Setup
+
+```bash
+# Clone the repository
+git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
+cd git.stella-ops.org
+
+# Install .NET 10 SDK
+sudo apt-get update
+sudo apt-get install -y dotnet-sdk-10.0
+
+# Install Docker (if not present)
+curl -fsSL https://get.docker.com | sh
+
+# Install benchmark dependencies
+sudo apt-get install -y \
+    jq \
+    b3sum \
+    hyperfine \
+    time
+
+# Set determinism environment variables
+export TZ=UTC
+export LC_ALL=C
+export STELLAOPS_DETERMINISM_SEED=42
+export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"
+```
+
+### 1.3 Pull Reference Images
+
+```bash
+# Download standard benchmark images
+make benchmark-pull-images
+
+# Or manually:
+docker pull alpine:3.19
+docker pull debian:12-slim
+docker pull ubuntu:22.04
+docker pull node:20-alpine
+docker pull python:3.12
+docker pull mcr.microsoft.com/dotnet/aspnet:8.0
+docker pull nginx:1.25
+docker pull postgres:16-alpine
+```
+
+---
+
+## 2. RUNNING BENCHMARKS
+
+### 2.1 Full Benchmark Suite
+
+```bash
+# Run all benchmarks (takes ~30-60 minutes)
+make benchmark-all
+
+# Output: results/benchmark-all-$(date +%Y%m%d).json
+```
+
+### 2.2 Category-Specific Benchmarks
+
+#### Reachability Benchmark
+
+```bash
+# Run reachability accuracy benchmarks
+make benchmark-reachability
+
+# With specific language filter
+make benchmark-reachability LANG=csharp
+
+# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json
+```
+
+#### Performance Benchmark
+
+```bash
+# Run scan performance benchmarks
+make benchmark-performance
+
+# Single image
+make benchmark-image IMAGE=alpine:3.19
+
+# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json
+```
+
+#### SBOM Benchmark
+
+```bash
+# Run SBOM completeness benchmarks
+make benchmark-sbom
+
+# Specific format
+make benchmark-sbom FORMAT=cyclonedx
+
+# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json
+```
+
+#### Determinism Benchmark
+
+```bash
+# Run determinism verification
+make benchmark-determinism
+
+# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json
+```
+
+### 2.3 CLI Benchmark Commands
+
+```bash
+# Performance timing with hyperfine (10 runs)
+hyperfine --warmup 2 --runs 10 \
+  'stellaops scan --image alpine:3.19 --format json --output /dev/null'
+
+# Memory profiling
+/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
+  grep "Maximum resident set size"
+
+# CPU profiling (Linux)
+perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
+
+# Determinism check (run twice, compare hashes)
+stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
+stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
+diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"
+```
+
+---
+
+## 3. OUTPUT FORMATS
+
+### 3.1 Reachability Results Schema
+
+```json
+{
+  "benchmark": "reachability-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "scanner_commit": "abc123def",
+  "environment": {
+    "os": "ubuntu-22.04",
+    "arch": "amd64",
+    "cpu": "Intel Xeon E-2288G",
+    "memory_gb": 16
+  },
+  "summary": {
+    "total_samples": 200,
+    "precision": 0.92,
+    "recall": 0.87,
+    "f1": 0.894,
+    "false_positive_rate": 0.08,
+    "false_negative_rate": 0.13
+  },
+  "by_language": {
+    "java": {
+      "samples": 50,
+      "precision": 0.94,
+      "recall": 0.88,
+      "f1": 0.909,
+      "confusion_matrix": {
+        "tp": 44, "fp": 3, "tn": 2, "fn": 1
+      }
+    },
+    "csharp": {
+      "samples": 50,
+      "precision": 0.91,
+      "recall": 0.86,
+      "f1": 0.884,
+      "confusion_matrix": {
+        "tp": 43, "fp": 4, "tn": 2, "fn": 1
+      }
+    },
+    "typescript": {
+      "samples": 50,
+      "precision": 0.89,
+      "recall": 0.84,
+      "f1": 0.864,
+      "confusion_matrix": {
+        "tp": 42, "fp": 5, "tn": 2, "fn": 1
+      }
+    },
+    "python": {
+      "samples": 50,
+      "precision": 0.88,
+      "recall": 0.83,
+      "f1": 0.854,
+      "confusion_matrix": {
+        "tp": 41, "fp": 5, "tn": 3, "fn": 1
+      }
+    }
+  },
+  "ground_truth_ref": "datasets/reachability/v2025.12",
+  "raw_results_ref": "results/reachability/raw/2025-12-14/"
+}
+```
+
+### 3.2 Performance Results Schema
+
+```json
+{
+  "benchmark": "performance-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "scanner_commit": "abc123def",
+  "environment": {
+    "os": "ubuntu-22.04",
+    "arch": "amd64",
+    "cpu": "Intel Xeon E-2288G",
+    "memory_gb": 16,
+    "storage": "nvme"
+  },
+  "images": [
+    {
+      "image": "alpine:3.19",
+      "size_mb": 7,
+      "components": 15,
+      "vulnerabilities": 5,
+      "runs": 10,
+      "cold_start": {
+        "p50_ms": 2800,
+        "p95_ms": 4200,
+        "mean_ms": 3100
+      },
+      "warm_cache": {
+        "p50_ms": 1500,
+        "p95_ms": 2100,
+        "mean_ms": 1650
+      },
+      "memory_peak_mb": 180,
+      "cpu_time_ms": 1200
+    },
+    {
+      "image": "python:3.12",
+      "size_mb": 1024,
+      "components": 300,
+      "vulnerabilities": 150,
+      "runs": 10,
+      "cold_start": {
+        "p50_ms": 32000,
+        "p95_ms": 48000,
+        "mean_ms": 35000
+      },
+      "warm_cache": {
+        "p50_ms": 18000,
+        "p95_ms": 25000,
+        "mean_ms": 19500
+      },
+      "memory_peak_mb": 1100,
+      "cpu_time_ms": 28000
+    }
+  ],
+  "aggregated": {
+    "total_images": 8,
+    "total_runs": 80,
+    "avg_time_per_mb_ms": 35,
+    "avg_memory_per_component_kb": 400
+  }
+}
+```
+
+### 3.3 SBOM Results Schema
+
+```json
+{
+  "benchmark": "sbom-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "summary": {
+    "total_images": 8,
+    "component_recall": 0.98,
+    "component_precision": 0.995,
+    "version_accuracy": 0.96
+  },
+  "by_ecosystem": {
+    "apk": {
+      "ground_truth_components": 100,
+      "detected_components": 99,
+      "correct_versions": 96,
+      "recall": 0.99,
+      "precision": 0.99,
+      "version_accuracy": 0.96
+    },
+    "npm": {
+      "ground_truth_components": 500,
+      "detected_components": 492,
+      "correct_versions": 475,
+      "recall": 0.984,
+      "precision": 0.998,
+      "version_accuracy": 0.965
+    }
+  },
+  "formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
+}
+```
+
+### 3.4 Determinism Results Schema
+
+```json
+{
+  "benchmark": "determinism-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "summary": {
+    "total_runs": 100,
+    "bitwise_identical": 100,
+    "bitwise_fidelity": 1.0,
+    "semantic_identical": 100,
+    "semantic_fidelity": 1.0
+  },
+  "by_image": {
+    "alpine:3.19": {
+      "runs": 20,
+      "bitwise_identical": 20,
+      "output_hash": "sha256:abc123..."
+    },
+    "python:3.12": {
+      "runs": 20,
+      "bitwise_identical": 20,
+      "output_hash": "sha256:def456..."
+    }
+  },
+  "seed": 42,
+  "timestamp_frozen": "2025-01-01T00:00:00Z"
+}
+```
+
+---
+
+## 4. SUBMISSION PROCESS
+
+### 4.1 Internal Submission (StellaOps Team)
+
+Benchmark results are automatically collected by CI:
+
+```yaml
+# .gitea/workflows/weekly-benchmark.yml triggers:
+# - Weekly benchmark runs
+# - Results stored in internal dashboard
+# - Regression detection against baselines
+```
+
+Manual submission:
+```bash
+# Upload to internal dashboard
+make benchmark-submit
+
+# Or via CLI
+stellaops benchmark submit \
+  --file results/benchmark-all-20251214.json \
+  --dashboard internal
+```
+
+### 4.2 External Validation Submission
+
+Third parties can validate and submit benchmark results:
+
+#### Step 1: Fork and Clone
+
+```bash
+# Fork the benchmark repository
+# https://git.stella-ops.org/stella-ops.org/benchmarks
+
+git clone https://git.stella-ops.org/<your-org>/benchmarks.git
+cd benchmarks
+```
+
+#### Step 2: Run Benchmarks
+
+```bash
+# With StellaOps scanner
+make benchmark-all SCANNER=stellaops
+
+# Or with your own tool for comparison
+make benchmark-all SCANNER=your-tool
+```
+
+#### Step 3: Prepare Submission
+
+```bash
+# Results directory structure
+mkdir -p submissions/<your-org>/<date>
+
+# Copy results
+cp results/*.json submissions/<your-org>/<date>/
+
+# Add reproduction README
+cat > submissions/<your-org>/<date>/README.md <<EOF
+# Benchmark Results: <Your Org>
+
+**Date:** $(date -u +%Y-%m-%d)
+**Scanner:** <tool-name>
+**Version:** <version>
+
+## Environment
+- OS: <os>
+- CPU: <cpu>
+- Memory: <memory>
+
+## Reproduction Steps
+<steps>
+
+## Notes
+<any observations>
+EOF
+```
+
+#### Step 4: Submit Pull Request
+
+```bash
+git checkout -b benchmark-results-$(date +%Y%m%d)
+git add submissions/
+git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
+git push origin benchmark-results-$(date +%Y%m%d)
+
+# Create PR via web interface or gh CLI
+gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
+  --body "Benchmark results for external validation"
+```
+
+### 4.3 Submission Review Process
+
+| Step | Action | Timeline |
+|------|--------|----------|
+| 1 | PR submitted | Day 0 |
+| 2 | Automated validation runs | Day 0 (CI) |
+| 3 | Maintainer review | Day 1-3 |
+| 4 | Results published (if valid) | Day 3-5 |
+| 5 | Dashboard updated | Day 5 |
+
+---
+
+## 5. BENCHMARK CATEGORIES
+
+### 5.1 Reachability Benchmark
+
+**Purpose:** Measure accuracy of static and runtime reachability analysis.
+
+**Ground Truth Source:** `datasets/reachability/`
+
+**Test Cases:**
+- 50+ samples per language (Java, C#, TypeScript, Python, Go)
+- Known-reachable vulnerable paths
+- Known-unreachable vulnerable code
+- Runtime-only reachable code
+
+**Scoring:**
+```
+Precision = TP / (TP + FP)
+Recall = TP / (TP + FN)
+F1 = 2 * (Precision * Recall) / (Precision + Recall)
+```
+
+**Targets:**
+| Metric | Target | Blocking |
+|--------|--------|----------|
+| Precision | >= 90% | >= 85% |
+| Recall | >= 85% | >= 80% |
+| F1 | >= 87% | >= 82% |
+
+### 5.2 Performance Benchmark
+
+**Purpose:** Measure scan time, memory usage, and CPU utilization.
+
+**Reference Images:** See [Performance Baselines](performance-baselines.md)
+
+**Metrics:**
+- P50/P95 scan time (cold and warm)
+- Peak memory usage
+- CPU time
+- Throughput (images/minute)
+
+**Targets:**
+| Image Category | P50 Time | P95 Time | Max Memory |
+|----------------|----------|----------|------------|
+| Minimal (<100MB) | < 5s | < 10s | < 256MB |
+| Standard (100-500MB) | < 15s | < 30s | < 512MB |
+| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB |
+
+### 5.3 SBOM Benchmark
+
+**Purpose:** Measure component detection completeness and accuracy.
+
+**Ground Truth Source:** Manual SBOM audits of reference images.
+
+**Metrics:**
+- Component recall (found / total)
+- Component precision (real / reported)
+- Version accuracy (correct / total)
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Component Recall | >= 98% |
+| Component Precision | >= 99% |
+| Version Accuracy | >= 95% |
+
+### 5.4 Vulnerability Detection Benchmark
+
+**Purpose:** Measure CVE detection accuracy against known-vulnerable images.
+
+**Ground Truth Source:** `datasets/vulns/` curated CVE lists.
+
+**Metrics:**
+- True positive rate
+- False positive rate
+- False negative rate
+- Precision/Recall/F1
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Precision | >= 95% |
+| Recall | >= 90% |
+| F1 | >= 92% |
+
+### 5.5 Determinism Benchmark
+
+**Purpose:** Verify reproducible scan outputs.
+
+**Methodology:**
+1. Run same scan N times (default: 20)
+2. Compare output hashes
+3. Calculate bitwise fidelity
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Bitwise Fidelity | 100% |
+| Semantic Fidelity | 100% |
+
+---
+
+## 6. COMPARING RESULTS
+
+### 6.1 Against Baselines
+
+```bash
+# Compare current run against stored baseline
+stellaops benchmark compare \
+  --baseline results/baseline/2025-Q4.json \
+  --current results/benchmark-all-20251214.json \
+  --threshold-p50 0.15 \
+  --threshold-precision 0.02 \
+  --fail-on-regression
+
+# Output:
+# Performance: PASS (P50 within 15% of baseline)
+# Accuracy: PASS (Precision within 2% of baseline)
+# Determinism: PASS (100% fidelity)
+```
+
+### 6.2 Against Other Tools
+
+```bash
+# Generate comparison report
+stellaops benchmark compare-tools \
+  --stellaops results/stellaops/2025-12-14.json \
+  --trivy results/trivy/2025-12-14.json \
+  --grype results/grype/2025-12-14.json \
+  --output comparison-report.html
+```
+
+### 6.3 Historical Trends
+
+```bash
+# Generate trend report (last 12 months)
+stellaops benchmark trend \
+  --period 12m \
+  --metrics precision,recall,p50_time \
+  --output trend-report.html
+```
+
+---
+
+## 7. TROUBLESHOOTING
+
+### 7.1 Common Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Non-deterministic output | Locale not set | Set `LC_ALL=C` |
+| Memory OOM | Large image | Increase memory limit |
+| Slow performance | Cold cache | Pre-pull images |
+| Missing components | Ecosystem not supported | Check supported ecosystems |
+
+### 7.2 Debug Mode
+
+```bash
+# Enable verbose benchmark logging
+make benchmark-all DEBUG=1
+
+# Enable timing breakdown
+export STELLAOPS_BENCHMARK_TIMING=1
+make benchmark-performance
+```
+
+### 7.3 Validation Failures
+
+```bash
+# Check result schema validity
+stellaops benchmark validate --file results/benchmark-all.json
+
+# Check against ground truth
+stellaops benchmark validate-ground-truth \
+  --results results/reachability.json \
+  --ground-truth datasets/reachability/v2025.12
+```
+
+---
+
+## 8. REFERENCES
+
+- [Performance Baselines](performance-baselines.md)
+- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
+- [Offline Parity Verification](../airgap/offline-parity-verification.md)
+- [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md)
+- [Ground Truth Datasets](../datasets/README.md)
+
+---
+
+**Document Version**: 1.0
+**Target Platform**: .NET 10, PostgreSQL >=16