# Benchmark Submission Guide **Last Updated:** 2025-12-14 **Next Review:** 2026-03-14 --- ## Overview StellaOps publishes benchmarks for: - **Reachability Analysis** - Accuracy of static and runtime path detection - **SBOM Completeness** - Component detection and version accuracy - **Vulnerability Detection** - Precision, recall, and F1 scores - **Scan Performance** - Time, memory, and CPU metrics - **Determinism** - Reproducibility of scan outputs This guide explains how to reproduce, validate, and submit benchmark results. --- ## 1. PREREQUISITES ### 1.1 System Requirements | Requirement | Minimum | Recommended | |-------------|---------|-------------| | CPU | 4 cores | 8 cores | | Memory | 8 GB | 16 GB | | Storage | 50 GB SSD | 100 GB NVMe | | OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS | | Docker | 24.x | 24.x | | .NET | 10.0 | 10.0 | ### 1.2 Environment Setup ```bash # Clone the repository git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git cd git.stella-ops.org # Install .NET 10 SDK sudo apt-get update sudo apt-get install -y dotnet-sdk-10.0 # Install Docker (if not present) curl -fsSL https://get.docker.com | sh # Install benchmark dependencies sudo apt-get install -y \ jq \ b3sum \ hyperfine \ time # Set determinism environment variables export TZ=UTC export LC_ALL=C export STELLAOPS_DETERMINISM_SEED=42 export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z" ``` ### 1.3 Pull Reference Images ```bash # Download standard benchmark images make benchmark-pull-images # Or manually: docker pull alpine:3.19 docker pull debian:12-slim docker pull ubuntu:22.04 docker pull node:20-alpine docker pull python:3.12 docker pull mcr.microsoft.com/dotnet/aspnet:8.0 docker pull nginx:1.25 docker pull postgres:16-alpine ``` --- ## 2. RUNNING BENCHMARKS ### 2.1 Full Benchmark Suite ```bash # Run all benchmarks (takes ~30-60 minutes) make benchmark-all # Output: results/benchmark-all-$(date +%Y%m%d).json ``` ### 2.2 Category-Specific Benchmarks #### Reachability Benchmark ```bash # Run reachability accuracy benchmarks make benchmark-reachability # With specific language filter make benchmark-reachability LANG=csharp # Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json ``` #### Performance Benchmark ```bash # Run scan performance benchmarks make benchmark-performance # Single image make benchmark-image IMAGE=alpine:3.19 # Output: results/performance/benchmark-performance-$(date +%Y%m%d).json ``` #### SBOM Benchmark ```bash # Run SBOM completeness benchmarks make benchmark-sbom # Specific format make benchmark-sbom FORMAT=cyclonedx # Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json ``` #### Determinism Benchmark ```bash # Run determinism verification make benchmark-determinism # Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json ``` ### 2.3 CLI Benchmark Commands ```bash # Performance timing with hyperfine (10 runs) hyperfine --warmup 2 --runs 10 \ 'stellaops scan --image alpine:3.19 --format json --output /dev/null' # Memory profiling /usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \ grep "Maximum resident set size" # CPU profiling (Linux) perf stat stellaops scan --image alpine:3.19 --format json > /dev/null # Determinism check (run twice, compare hashes) stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC" ``` --- ## 3. OUTPUT FORMATS ### 3.1 Reachability Results Schema ```json { "benchmark": "reachability-v1", "date": "2025-12-14T00:00:00Z", "scanner_version": "1.3.0", "scanner_commit": "abc123def", "environment": { "os": "ubuntu-22.04", "arch": "amd64", "cpu": "Intel Xeon E-2288G", "memory_gb": 16 }, "summary": { "total_samples": 200, "precision": 0.92, "recall": 0.87, "f1": 0.894, "false_positive_rate": 0.08, "false_negative_rate": 0.13 }, "by_language": { "java": { "samples": 50, "precision": 0.94, "recall": 0.88, "f1": 0.909, "confusion_matrix": { "tp": 44, "fp": 3, "tn": 2, "fn": 1 } }, "csharp": { "samples": 50, "precision": 0.91, "recall": 0.86, "f1": 0.884, "confusion_matrix": { "tp": 43, "fp": 4, "tn": 2, "fn": 1 } }, "typescript": { "samples": 50, "precision": 0.89, "recall": 0.84, "f1": 0.864, "confusion_matrix": { "tp": 42, "fp": 5, "tn": 2, "fn": 1 } }, "python": { "samples": 50, "precision": 0.88, "recall": 0.83, "f1": 0.854, "confusion_matrix": { "tp": 41, "fp": 5, "tn": 3, "fn": 1 } } }, "ground_truth_ref": "datasets/reachability/v2025.12", "raw_results_ref": "results/reachability/raw/2025-12-14/" } ``` ### 3.2 Performance Results Schema ```json { "benchmark": "performance-v1", "date": "2025-12-14T00:00:00Z", "scanner_version": "1.3.0", "scanner_commit": "abc123def", "environment": { "os": "ubuntu-22.04", "arch": "amd64", "cpu": "Intel Xeon E-2288G", "memory_gb": 16, "storage": "nvme" }, "images": [ { "image": "alpine:3.19", "size_mb": 7, "components": 15, "vulnerabilities": 5, "runs": 10, "cold_start": { "p50_ms": 2800, "p95_ms": 4200, "mean_ms": 3100 }, "warm_cache": { "p50_ms": 1500, "p95_ms": 2100, "mean_ms": 1650 }, "memory_peak_mb": 180, "cpu_time_ms": 1200 }, { "image": "python:3.12", "size_mb": 1024, "components": 300, "vulnerabilities": 150, "runs": 10, "cold_start": { "p50_ms": 32000, "p95_ms": 48000, "mean_ms": 35000 }, "warm_cache": { "p50_ms": 18000, "p95_ms": 25000, "mean_ms": 19500 }, "memory_peak_mb": 1100, "cpu_time_ms": 28000 } ], "aggregated": { "total_images": 8, "total_runs": 80, "avg_time_per_mb_ms": 35, "avg_memory_per_component_kb": 400 } } ``` ### 3.3 SBOM Results Schema ```json { "benchmark": "sbom-v1", "date": "2025-12-14T00:00:00Z", "scanner_version": "1.3.0", "summary": { "total_images": 8, "component_recall": 0.98, "component_precision": 0.995, "version_accuracy": 0.96 }, "by_ecosystem": { "apk": { "ground_truth_components": 100, "detected_components": 99, "correct_versions": 96, "recall": 0.99, "precision": 0.99, "version_accuracy": 0.96 }, "npm": { "ground_truth_components": 500, "detected_components": 492, "correct_versions": 475, "recall": 0.984, "precision": 0.998, "version_accuracy": 0.965 } }, "formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"] } ``` ### 3.4 Determinism Results Schema ```json { "benchmark": "determinism-v1", "date": "2025-12-14T00:00:00Z", "scanner_version": "1.3.0", "summary": { "total_runs": 100, "bitwise_identical": 100, "bitwise_fidelity": 1.0, "semantic_identical": 100, "semantic_fidelity": 1.0 }, "by_image": { "alpine:3.19": { "runs": 20, "bitwise_identical": 20, "output_hash": "sha256:abc123..." }, "python:3.12": { "runs": 20, "bitwise_identical": 20, "output_hash": "sha256:def456..." } }, "seed": 42, "timestamp_frozen": "2025-01-01T00:00:00Z" } ``` --- ## 4. SUBMISSION PROCESS ### 4.1 Internal Submission (StellaOps Team) Benchmark results are automatically collected by CI: ```yaml # .gitea/workflows/weekly-benchmark.yml triggers: # - Weekly benchmark runs # - Results stored in internal dashboard # - Regression detection against baselines ``` Manual submission: ```bash # Upload to internal dashboard make benchmark-submit # Or via CLI stellaops benchmark submit \ --file results/benchmark-all-20251214.json \ --dashboard internal ``` ### 4.2 External Validation Submission Third parties can validate and submit benchmark results: #### Step 1: Fork and Clone ```bash # Fork the benchmark repository # https://git.stella-ops.org/stella-ops.org/benchmarks git clone https://git.stella-ops.org//benchmarks.git cd benchmarks ``` #### Step 2: Run Benchmarks ```bash # With StellaOps scanner make benchmark-all SCANNER=stellaops # Or with your own tool for comparison make benchmark-all SCANNER=your-tool ``` #### Step 3: Prepare Submission ```bash # Results directory structure mkdir -p submissions// # Copy results cp results/*.json submissions/// # Add reproduction README cat > submissions///README.md < **Date:** $(date -u +%Y-%m-%d) **Scanner:** **Version:** ## Environment - OS: - CPU: - Memory: ## Reproduction Steps ## Notes EOF ``` #### Step 4: Submit Pull Request ```bash git checkout -b benchmark-results-$(date +%Y%m%d) git add submissions/ git commit -m "Add benchmark results from $(date +%Y-%m-%d)" git push origin benchmark-results-$(date +%Y%m%d) # Create PR via web interface or gh CLI gh pr create --title "Benchmark: $(date +%Y-%m-%d)" \ --body "Benchmark results for external validation" ``` ### 4.3 Submission Review Process | Step | Action | Timeline | |------|--------|----------| | 1 | PR submitted | Day 0 | | 2 | Automated validation runs | Day 0 (CI) | | 3 | Maintainer review | Day 1-3 | | 4 | Results published (if valid) | Day 3-5 | | 5 | Dashboard updated | Day 5 | --- ## 5. BENCHMARK CATEGORIES ### 5.1 Reachability Benchmark **Purpose:** Measure accuracy of static and runtime reachability analysis. **Ground Truth Source:** `datasets/reachability/` **Test Cases:** - 50+ samples per language (Java, C#, TypeScript, Python, Go) - Known-reachable vulnerable paths - Known-unreachable vulnerable code - Runtime-only reachable code **Scoring:** ``` Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2 * (Precision * Recall) / (Precision + Recall) ``` **Targets:** | Metric | Target | Blocking | |--------|--------|----------| | Precision | >= 90% | >= 85% | | Recall | >= 85% | >= 80% | | F1 | >= 87% | >= 82% | ### 5.2 Performance Benchmark **Purpose:** Measure scan time, memory usage, and CPU utilization. **Reference Images:** See [Performance Baselines](performance-baselines.md) **Metrics:** - P50/P95 scan time (cold and warm) - Peak memory usage - CPU time - Throughput (images/minute) **Targets:** | Image Category | P50 Time | P95 Time | Max Memory | |----------------|----------|----------|------------| | Minimal (<100MB) | < 5s | < 10s | < 256MB | | Standard (100-500MB) | < 15s | < 30s | < 512MB | | Large (500MB-2GB) | < 45s | < 90s | < 1.5GB | ### 5.3 SBOM Benchmark **Purpose:** Measure component detection completeness and accuracy. **Ground Truth Source:** Manual SBOM audits of reference images. **Metrics:** - Component recall (found / total) - Component precision (real / reported) - Version accuracy (correct / total) **Targets:** | Metric | Target | |--------|--------| | Component Recall | >= 98% | | Component Precision | >= 99% | | Version Accuracy | >= 95% | ### 5.4 Vulnerability Detection Benchmark **Purpose:** Measure CVE detection accuracy against known-vulnerable images. **Ground Truth Source:** `datasets/vulns/` curated CVE lists. **Metrics:** - True positive rate - False positive rate - False negative rate - Precision/Recall/F1 **Targets:** | Metric | Target | |--------|--------| | Precision | >= 95% | | Recall | >= 90% | | F1 | >= 92% | ### 5.5 Determinism Benchmark **Purpose:** Verify reproducible scan outputs. **Methodology:** 1. Run same scan N times (default: 20) 2. Compare output hashes 3. Calculate bitwise fidelity **Targets:** | Metric | Target | |--------|--------| | Bitwise Fidelity | 100% | | Semantic Fidelity | 100% | --- ## 6. COMPARING RESULTS ### 6.1 Against Baselines ```bash # Compare current run against stored baseline stellaops benchmark compare \ --baseline results/baseline/2025-Q4.json \ --current results/benchmark-all-20251214.json \ --threshold-p50 0.15 \ --threshold-precision 0.02 \ --fail-on-regression # Output: # Performance: PASS (P50 within 15% of baseline) # Accuracy: PASS (Precision within 2% of baseline) # Determinism: PASS (100% fidelity) ``` ### 6.2 Against Other Tools ```bash # Generate comparison report stellaops benchmark compare-tools \ --stellaops results/stellaops/2025-12-14.json \ --trivy results/trivy/2025-12-14.json \ --grype results/grype/2025-12-14.json \ --output comparison-report.html ``` ### 6.3 Historical Trends ```bash # Generate trend report (last 12 months) stellaops benchmark trend \ --period 12m \ --metrics precision,recall,p50_time \ --output trend-report.html ``` --- ## 7. TROUBLESHOOTING ### 7.1 Common Issues | Issue | Cause | Resolution | |-------|-------|------------| | Non-deterministic output | Locale not set | Set `LC_ALL=C` | | Memory OOM | Large image | Increase memory limit | | Slow performance | Cold cache | Pre-pull images | | Missing components | Ecosystem not supported | Check supported ecosystems | ### 7.2 Debug Mode ```bash # Enable verbose benchmark logging make benchmark-all DEBUG=1 # Enable timing breakdown export STELLAOPS_BENCHMARK_TIMING=1 make benchmark-performance ``` ### 7.3 Validation Failures ```bash # Check result schema validity stellaops benchmark validate --file results/benchmark-all.json # Check against ground truth stellaops benchmark validate-ground-truth \ --results results/reachability.json \ --ground-truth datasets/reachability/v2025.12 ``` --- ## 8. REFERENCES - [Performance Baselines](performance-baselines.md) - [Accuracy Metrics Framework](accuracy-metrics-framework.md) - [Offline Parity Verification](../airgap/offline-parity-verification.md) - [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md) - [Ground Truth Datasets](../datasets/README.md) --- **Document Version**: 1.0 **Target Platform**: .NET 10, PostgreSQL >=16