up

2025-12-14 23:20:14 +02:00
parent 3411e825cd
commit b058dbe031
356 changed files with 68310 additions and 1108 deletions
--- a/docs/benchmarks/submission-guide.md
+++ b/docs/benchmarks/submission-guide.md
@@ -0,0 +1,653 @@
+# Benchmark Submission Guide
+
+**Last Updated:** 2025-12-14
+**Next Review:** 2026-03-14
+
+---
+
+## Overview
+
+StellaOps publishes benchmarks for:
+- **Reachability Analysis** - Accuracy of static and runtime path detection
+- **SBOM Completeness** - Component detection and version accuracy
+- **Vulnerability Detection** - Precision, recall, and F1 scores
+- **Scan Performance** - Time, memory, and CPU metrics
+- **Determinism** - Reproducibility of scan outputs
+
+This guide explains how to reproduce, validate, and submit benchmark results.
+
+---
+
+## 1. PREREQUISITES
+
+### 1.1 System Requirements
+
+| Requirement | Minimum | Recommended |
+|-------------|---------|-------------|
+| CPU | 4 cores | 8 cores |
+| Memory | 8 GB | 16 GB |
+| Storage | 50 GB SSD | 100 GB NVMe |
+| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
+| Docker | 24.x | 24.x |
+| .NET | 10.0 | 10.0 |
+
+### 1.2 Environment Setup
+
+```bash
+# Clone the repository
+git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
+cd git.stella-ops.org
+
+# Install .NET 10 SDK
+sudo apt-get update
+sudo apt-get install -y dotnet-sdk-10.0
+
+# Install Docker (if not present)
+curl -fsSL https://get.docker.com | sh
+
+# Install benchmark dependencies
+sudo apt-get install -y \
+    jq \
+    b3sum \
+    hyperfine \
+    time
+
+# Set determinism environment variables
+export TZ=UTC
+export LC_ALL=C
+export STELLAOPS_DETERMINISM_SEED=42
+export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"
+```
+
+### 1.3 Pull Reference Images
+
+```bash
+# Download standard benchmark images
+make benchmark-pull-images
+
+# Or manually:
+docker pull alpine:3.19
+docker pull debian:12-slim
+docker pull ubuntu:22.04
+docker pull node:20-alpine
+docker pull python:3.12
+docker pull mcr.microsoft.com/dotnet/aspnet:8.0
+docker pull nginx:1.25
+docker pull postgres:16-alpine
+```
+
+---
+
+## 2. RUNNING BENCHMARKS
+
+### 2.1 Full Benchmark Suite
+
+```bash
+# Run all benchmarks (takes ~30-60 minutes)
+make benchmark-all
+
+# Output: results/benchmark-all-$(date +%Y%m%d).json
+```
+
+### 2.2 Category-Specific Benchmarks
+
+#### Reachability Benchmark
+
+```bash
+# Run reachability accuracy benchmarks
+make benchmark-reachability
+
+# With specific language filter
+make benchmark-reachability LANG=csharp
+
+# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json
+```
+
+#### Performance Benchmark
+
+```bash
+# Run scan performance benchmarks
+make benchmark-performance
+
+# Single image
+make benchmark-image IMAGE=alpine:3.19
+
+# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json
+```
+
+#### SBOM Benchmark
+
+```bash
+# Run SBOM completeness benchmarks
+make benchmark-sbom
+
+# Specific format
+make benchmark-sbom FORMAT=cyclonedx
+
+# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json
+```
+
+#### Determinism Benchmark
+
+```bash
+# Run determinism verification
+make benchmark-determinism
+
+# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json
+```
+
+### 2.3 CLI Benchmark Commands
+
+```bash
+# Performance timing with hyperfine (10 runs)
+hyperfine --warmup 2 --runs 10 \
+  'stellaops scan --image alpine:3.19 --format json --output /dev/null'
+
+# Memory profiling
+/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
+  grep "Maximum resident set size"
+
+# CPU profiling (Linux)
+perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
+
+# Determinism check (run twice, compare hashes)
+stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
+stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
+diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"
+```
+
+---
+
+## 3. OUTPUT FORMATS
+
+### 3.1 Reachability Results Schema
+
+```json
+{
+  "benchmark": "reachability-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "scanner_commit": "abc123def",
+  "environment": {
+    "os": "ubuntu-22.04",
+    "arch": "amd64",
+    "cpu": "Intel Xeon E-2288G",
+    "memory_gb": 16
+  },
+  "summary": {
+    "total_samples": 200,
+    "precision": 0.92,
+    "recall": 0.87,
+    "f1": 0.894,
+    "false_positive_rate": 0.08,
+    "false_negative_rate": 0.13
+  },
+  "by_language": {
+    "java": {
+      "samples": 50,
+      "precision": 0.94,
+      "recall": 0.88,
+      "f1": 0.909,
+      "confusion_matrix": {
+        "tp": 44, "fp": 3, "tn": 2, "fn": 1
+      }
+    },
+    "csharp": {
+      "samples": 50,
+      "precision": 0.91,
+      "recall": 0.86,
+      "f1": 0.884,
+      "confusion_matrix": {
+        "tp": 43, "fp": 4, "tn": 2, "fn": 1
+      }
+    },
+    "typescript": {
+      "samples": 50,
+      "precision": 0.89,
+      "recall": 0.84,
+      "f1": 0.864,
+      "confusion_matrix": {
+        "tp": 42, "fp": 5, "tn": 2, "fn": 1
+      }
+    },
+    "python": {
+      "samples": 50,
+      "precision": 0.88,
+      "recall": 0.83,
+      "f1": 0.854,
+      "confusion_matrix": {
+        "tp": 41, "fp": 5, "tn": 3, "fn": 1
+      }
+    }
+  },
+  "ground_truth_ref": "datasets/reachability/v2025.12",
+  "raw_results_ref": "results/reachability/raw/2025-12-14/"
+}
+```
+
+### 3.2 Performance Results Schema
+
+```json
+{
+  "benchmark": "performance-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "scanner_commit": "abc123def",
+  "environment": {
+    "os": "ubuntu-22.04",
+    "arch": "amd64",
+    "cpu": "Intel Xeon E-2288G",
+    "memory_gb": 16,
+    "storage": "nvme"
+  },
+  "images": [
+    {
+      "image": "alpine:3.19",
+      "size_mb": 7,
+      "components": 15,
+      "vulnerabilities": 5,
+      "runs": 10,
+      "cold_start": {
+        "p50_ms": 2800,
+        "p95_ms": 4200,
+        "mean_ms": 3100
+      },
+      "warm_cache": {
+        "p50_ms": 1500,
+        "p95_ms": 2100,
+        "mean_ms": 1650
+      },
+      "memory_peak_mb": 180,
+      "cpu_time_ms": 1200
+    },
+    {
+      "image": "python:3.12",
+      "size_mb": 1024,
+      "components": 300,
+      "vulnerabilities": 150,
+      "runs": 10,
+      "cold_start": {
+        "p50_ms": 32000,
+        "p95_ms": 48000,
+        "mean_ms": 35000
+      },
+      "warm_cache": {
+        "p50_ms": 18000,
+        "p95_ms": 25000,
+        "mean_ms": 19500
+      },
+      "memory_peak_mb": 1100,
+      "cpu_time_ms": 28000
+    }
+  ],
+  "aggregated": {
+    "total_images": 8,
+    "total_runs": 80,
+    "avg_time_per_mb_ms": 35,
+    "avg_memory_per_component_kb": 400
+  }
+}
+```
+
+### 3.3 SBOM Results Schema
+
+```json
+{
+  "benchmark": "sbom-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "summary": {
+    "total_images": 8,
+    "component_recall": 0.98,
+    "component_precision": 0.995,
+    "version_accuracy": 0.96
+  },
+  "by_ecosystem": {
+    "apk": {
+      "ground_truth_components": 100,
+      "detected_components": 99,
+      "correct_versions": 96,
+      "recall": 0.99,
+      "precision": 0.99,
+      "version_accuracy": 0.96
+    },
+    "npm": {
+      "ground_truth_components": 500,
+      "detected_components": 492,
+      "correct_versions": 475,
+      "recall": 0.984,
+      "precision": 0.998,
+      "version_accuracy": 0.965
+    }
+  },
+  "formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
+}
+```
+
+### 3.4 Determinism Results Schema
+
+```json
+{
+  "benchmark": "determinism-v1",
+  "date": "2025-12-14T00:00:00Z",
+  "scanner_version": "1.3.0",
+  "summary": {
+    "total_runs": 100,
+    "bitwise_identical": 100,
+    "bitwise_fidelity": 1.0,
+    "semantic_identical": 100,
+    "semantic_fidelity": 1.0
+  },
+  "by_image": {
+    "alpine:3.19": {
+      "runs": 20,
+      "bitwise_identical": 20,
+      "output_hash": "sha256:abc123..."
+    },
+    "python:3.12": {
+      "runs": 20,
+      "bitwise_identical": 20,
+      "output_hash": "sha256:def456..."
+    }
+  },
+  "seed": 42,
+  "timestamp_frozen": "2025-01-01T00:00:00Z"
+}
+```
+
+---
+
+## 4. SUBMISSION PROCESS
+
+### 4.1 Internal Submission (StellaOps Team)
+
+Benchmark results are automatically collected by CI:
+
+```yaml
+# .gitea/workflows/weekly-benchmark.yml triggers:
+# - Weekly benchmark runs
+# - Results stored in internal dashboard
+# - Regression detection against baselines
+```
+
+Manual submission:
+```bash
+# Upload to internal dashboard
+make benchmark-submit
+
+# Or via CLI
+stellaops benchmark submit \
+  --file results/benchmark-all-20251214.json \
+  --dashboard internal
+```
+
+### 4.2 External Validation Submission
+
+Third parties can validate and submit benchmark results:
+
+#### Step 1: Fork and Clone
+
+```bash
+# Fork the benchmark repository
+# https://git.stella-ops.org/stella-ops.org/benchmarks
+
+git clone https://git.stella-ops.org/<your-org>/benchmarks.git
+cd benchmarks
+```
+
+#### Step 2: Run Benchmarks
+
+```bash
+# With StellaOps scanner
+make benchmark-all SCANNER=stellaops
+
+# Or with your own tool for comparison
+make benchmark-all SCANNER=your-tool
+```
+
+#### Step 3: Prepare Submission
+
+```bash
+# Results directory structure
+mkdir -p submissions/<your-org>/<date>
+
+# Copy results
+cp results/*.json submissions/<your-org>/<date>/
+
+# Add reproduction README
+cat > submissions/<your-org>/<date>/README.md <<EOF
+# Benchmark Results: <Your Org>
+
+**Date:** $(date -u +%Y-%m-%d)
+**Scanner:** <tool-name>
+**Version:** <version>
+
+## Environment
+- OS: <os>
+- CPU: <cpu>
+- Memory: <memory>
+
+## Reproduction Steps
+<steps>
+
+## Notes
+<any observations>
+EOF
+```
+
+#### Step 4: Submit Pull Request
+
+```bash
+git checkout -b benchmark-results-$(date +%Y%m%d)
+git add submissions/
+git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
+git push origin benchmark-results-$(date +%Y%m%d)
+
+# Create PR via web interface or gh CLI
+gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
+  --body "Benchmark results for external validation"
+```
+
+### 4.3 Submission Review Process
+
+| Step | Action | Timeline |
+|------|--------|----------|
+| 1 | PR submitted | Day 0 |
+| 2 | Automated validation runs | Day 0 (CI) |
+| 3 | Maintainer review | Day 1-3 |
+| 4 | Results published (if valid) | Day 3-5 |
+| 5 | Dashboard updated | Day 5 |
+
+---
+
+## 5. BENCHMARK CATEGORIES
+
+### 5.1 Reachability Benchmark
+
+**Purpose:** Measure accuracy of static and runtime reachability analysis.
+
+**Ground Truth Source:** `datasets/reachability/`
+
+**Test Cases:**
+- 50+ samples per language (Java, C#, TypeScript, Python, Go)
+- Known-reachable vulnerable paths
+- Known-unreachable vulnerable code
+- Runtime-only reachable code
+
+**Scoring:**
+```
+Precision = TP / (TP + FP)
+Recall = TP / (TP + FN)
+F1 = 2 * (Precision * Recall) / (Precision + Recall)
+```
+
+**Targets:**
+| Metric | Target | Blocking |
+|--------|--------|----------|
+| Precision | >= 90% | >= 85% |
+| Recall | >= 85% | >= 80% |
+| F1 | >= 87% | >= 82% |
+
+### 5.2 Performance Benchmark
+
+**Purpose:** Measure scan time, memory usage, and CPU utilization.
+
+**Reference Images:** See [Performance Baselines](performance-baselines.md)
+
+**Metrics:**
+- P50/P95 scan time (cold and warm)
+- Peak memory usage
+- CPU time
+- Throughput (images/minute)
+
+**Targets:**
+| Image Category | P50 Time | P95 Time | Max Memory |
+|----------------|----------|----------|------------|
+| Minimal (<100MB) | < 5s | < 10s | < 256MB |
+| Standard (100-500MB) | < 15s | < 30s | < 512MB |
+| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB |
+
+### 5.3 SBOM Benchmark
+
+**Purpose:** Measure component detection completeness and accuracy.
+
+**Ground Truth Source:** Manual SBOM audits of reference images.
+
+**Metrics:**
+- Component recall (found / total)
+- Component precision (real / reported)
+- Version accuracy (correct / total)
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Component Recall | >= 98% |
+| Component Precision | >= 99% |
+| Version Accuracy | >= 95% |
+
+### 5.4 Vulnerability Detection Benchmark
+
+**Purpose:** Measure CVE detection accuracy against known-vulnerable images.
+
+**Ground Truth Source:** `datasets/vulns/` curated CVE lists.
+
+**Metrics:**
+- True positive rate
+- False positive rate
+- False negative rate
+- Precision/Recall/F1
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Precision | >= 95% |
+| Recall | >= 90% |
+| F1 | >= 92% |
+
+### 5.5 Determinism Benchmark
+
+**Purpose:** Verify reproducible scan outputs.
+
+**Methodology:**
+1. Run same scan N times (default: 20)
+2. Compare output hashes
+3. Calculate bitwise fidelity
+
+**Targets:**
+| Metric | Target |
+|--------|--------|
+| Bitwise Fidelity | 100% |
+| Semantic Fidelity | 100% |
+
+---
+
+## 6. COMPARING RESULTS
+
+### 6.1 Against Baselines
+
+```bash
+# Compare current run against stored baseline
+stellaops benchmark compare \
+  --baseline results/baseline/2025-Q4.json \
+  --current results/benchmark-all-20251214.json \
+  --threshold-p50 0.15 \
+  --threshold-precision 0.02 \
+  --fail-on-regression
+
+# Output:
+# Performance: PASS (P50 within 15% of baseline)
+# Accuracy: PASS (Precision within 2% of baseline)
+# Determinism: PASS (100% fidelity)
+```
+
+### 6.2 Against Other Tools
+
+```bash
+# Generate comparison report
+stellaops benchmark compare-tools \
+  --stellaops results/stellaops/2025-12-14.json \
+  --trivy results/trivy/2025-12-14.json \
+  --grype results/grype/2025-12-14.json \
+  --output comparison-report.html
+```
+
+### 6.3 Historical Trends
+
+```bash
+# Generate trend report (last 12 months)
+stellaops benchmark trend \
+  --period 12m \
+  --metrics precision,recall,p50_time \
+  --output trend-report.html
+```
+
+---
+
+## 7. TROUBLESHOOTING
+
+### 7.1 Common Issues
+
+| Issue | Cause | Resolution |
+|-------|-------|------------|
+| Non-deterministic output | Locale not set | Set `LC_ALL=C` |
+| Memory OOM | Large image | Increase memory limit |
+| Slow performance | Cold cache | Pre-pull images |
+| Missing components | Ecosystem not supported | Check supported ecosystems |
+
+### 7.2 Debug Mode
+
+```bash
+# Enable verbose benchmark logging
+make benchmark-all DEBUG=1
+
+# Enable timing breakdown
+export STELLAOPS_BENCHMARK_TIMING=1
+make benchmark-performance
+```
+
+### 7.3 Validation Failures
+
+```bash
+# Check result schema validity
+stellaops benchmark validate --file results/benchmark-all.json
+
+# Check against ground truth
+stellaops benchmark validate-ground-truth \
+  --results results/reachability.json \
+  --ground-truth datasets/reachability/v2025.12
+```
+
+---
+
+## 8. REFERENCES
+
+- [Performance Baselines](performance-baselines.md)
+- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
+- [Offline Parity Verification](../airgap/offline-parity-verification.md)
+- [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md)
+- [Ground Truth Datasets](../datasets/README.md)
+
+---
+
+**Document Version**: 1.0
+**Target Platform**: .NET 10, PostgreSQL >=16