Files
git.stella-ops.org/docs/benchmarks/submission-guide.md
StellaOps Bot b058dbe031 up
2025-12-14 23:20:14 +02:00

14 KiB

Benchmark Submission Guide

Last Updated: 2025-12-14 Next Review: 2026-03-14


Overview

StellaOps publishes benchmarks for:

  • Reachability Analysis - Accuracy of static and runtime path detection
  • SBOM Completeness - Component detection and version accuracy
  • Vulnerability Detection - Precision, recall, and F1 scores
  • Scan Performance - Time, memory, and CPU metrics
  • Determinism - Reproducibility of scan outputs

This guide explains how to reproduce, validate, and submit benchmark results.


1. PREREQUISITES

1.1 System Requirements

Requirement Minimum Recommended
CPU 4 cores 8 cores
Memory 8 GB 16 GB
Storage 50 GB SSD 100 GB NVMe
OS Ubuntu 22.04 LTS Ubuntu 22.04 LTS
Docker 24.x 24.x
.NET 10.0 10.0

1.2 Environment Setup

# Clone the repository
git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
cd git.stella-ops.org

# Install .NET 10 SDK
sudo apt-get update
sudo apt-get install -y dotnet-sdk-10.0

# Install Docker (if not present)
curl -fsSL https://get.docker.com | sh

# Install benchmark dependencies
sudo apt-get install -y \
    jq \
    b3sum \
    hyperfine \
    time

# Set determinism environment variables
export TZ=UTC
export LC_ALL=C
export STELLAOPS_DETERMINISM_SEED=42
export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"

1.3 Pull Reference Images

# Download standard benchmark images
make benchmark-pull-images

# Or manually:
docker pull alpine:3.19
docker pull debian:12-slim
docker pull ubuntu:22.04
docker pull node:20-alpine
docker pull python:3.12
docker pull mcr.microsoft.com/dotnet/aspnet:8.0
docker pull nginx:1.25
docker pull postgres:16-alpine

2. RUNNING BENCHMARKS

2.1 Full Benchmark Suite

# Run all benchmarks (takes ~30-60 minutes)
make benchmark-all

# Output: results/benchmark-all-$(date +%Y%m%d).json

2.2 Category-Specific Benchmarks

Reachability Benchmark

# Run reachability accuracy benchmarks
make benchmark-reachability

# With specific language filter
make benchmark-reachability LANG=csharp

# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json

Performance Benchmark

# Run scan performance benchmarks
make benchmark-performance

# Single image
make benchmark-image IMAGE=alpine:3.19

# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json

SBOM Benchmark

# Run SBOM completeness benchmarks
make benchmark-sbom

# Specific format
make benchmark-sbom FORMAT=cyclonedx

# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json

Determinism Benchmark

# Run determinism verification
make benchmark-determinism

# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json

2.3 CLI Benchmark Commands

# Performance timing with hyperfine (10 runs)
hyperfine --warmup 2 --runs 10 \
  'stellaops scan --image alpine:3.19 --format json --output /dev/null'

# Memory profiling
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
  grep "Maximum resident set size"

# CPU profiling (Linux)
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null

# Determinism check (run twice, compare hashes)
stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"

3. OUTPUT FORMATS

3.1 Reachability Results Schema

{
  "benchmark": "reachability-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "scanner_commit": "abc123def",
  "environment": {
    "os": "ubuntu-22.04",
    "arch": "amd64",
    "cpu": "Intel Xeon E-2288G",
    "memory_gb": 16
  },
  "summary": {
    "total_samples": 200,
    "precision": 0.92,
    "recall": 0.87,
    "f1": 0.894,
    "false_positive_rate": 0.08,
    "false_negative_rate": 0.13
  },
  "by_language": {
    "java": {
      "samples": 50,
      "precision": 0.94,
      "recall": 0.88,
      "f1": 0.909,
      "confusion_matrix": {
        "tp": 44, "fp": 3, "tn": 2, "fn": 1
      }
    },
    "csharp": {
      "samples": 50,
      "precision": 0.91,
      "recall": 0.86,
      "f1": 0.884,
      "confusion_matrix": {
        "tp": 43, "fp": 4, "tn": 2, "fn": 1
      }
    },
    "typescript": {
      "samples": 50,
      "precision": 0.89,
      "recall": 0.84,
      "f1": 0.864,
      "confusion_matrix": {
        "tp": 42, "fp": 5, "tn": 2, "fn": 1
      }
    },
    "python": {
      "samples": 50,
      "precision": 0.88,
      "recall": 0.83,
      "f1": 0.854,
      "confusion_matrix": {
        "tp": 41, "fp": 5, "tn": 3, "fn": 1
      }
    }
  },
  "ground_truth_ref": "datasets/reachability/v2025.12",
  "raw_results_ref": "results/reachability/raw/2025-12-14/"
}

3.2 Performance Results Schema

{
  "benchmark": "performance-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "scanner_commit": "abc123def",
  "environment": {
    "os": "ubuntu-22.04",
    "arch": "amd64",
    "cpu": "Intel Xeon E-2288G",
    "memory_gb": 16,
    "storage": "nvme"
  },
  "images": [
    {
      "image": "alpine:3.19",
      "size_mb": 7,
      "components": 15,
      "vulnerabilities": 5,
      "runs": 10,
      "cold_start": {
        "p50_ms": 2800,
        "p95_ms": 4200,
        "mean_ms": 3100
      },
      "warm_cache": {
        "p50_ms": 1500,
        "p95_ms": 2100,
        "mean_ms": 1650
      },
      "memory_peak_mb": 180,
      "cpu_time_ms": 1200
    },
    {
      "image": "python:3.12",
      "size_mb": 1024,
      "components": 300,
      "vulnerabilities": 150,
      "runs": 10,
      "cold_start": {
        "p50_ms": 32000,
        "p95_ms": 48000,
        "mean_ms": 35000
      },
      "warm_cache": {
        "p50_ms": 18000,
        "p95_ms": 25000,
        "mean_ms": 19500
      },
      "memory_peak_mb": 1100,
      "cpu_time_ms": 28000
    }
  ],
  "aggregated": {
    "total_images": 8,
    "total_runs": 80,
    "avg_time_per_mb_ms": 35,
    "avg_memory_per_component_kb": 400
  }
}

3.3 SBOM Results Schema

{
  "benchmark": "sbom-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "summary": {
    "total_images": 8,
    "component_recall": 0.98,
    "component_precision": 0.995,
    "version_accuracy": 0.96
  },
  "by_ecosystem": {
    "apk": {
      "ground_truth_components": 100,
      "detected_components": 99,
      "correct_versions": 96,
      "recall": 0.99,
      "precision": 0.99,
      "version_accuracy": 0.96
    },
    "npm": {
      "ground_truth_components": 500,
      "detected_components": 492,
      "correct_versions": 475,
      "recall": 0.984,
      "precision": 0.998,
      "version_accuracy": 0.965
    }
  },
  "formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
}

3.4 Determinism Results Schema

{
  "benchmark": "determinism-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "summary": {
    "total_runs": 100,
    "bitwise_identical": 100,
    "bitwise_fidelity": 1.0,
    "semantic_identical": 100,
    "semantic_fidelity": 1.0
  },
  "by_image": {
    "alpine:3.19": {
      "runs": 20,
      "bitwise_identical": 20,
      "output_hash": "sha256:abc123..."
    },
    "python:3.12": {
      "runs": 20,
      "bitwise_identical": 20,
      "output_hash": "sha256:def456..."
    }
  },
  "seed": 42,
  "timestamp_frozen": "2025-01-01T00:00:00Z"
}

4. SUBMISSION PROCESS

4.1 Internal Submission (StellaOps Team)

Benchmark results are automatically collected by CI:

# .gitea/workflows/weekly-benchmark.yml triggers:
# - Weekly benchmark runs
# - Results stored in internal dashboard
# - Regression detection against baselines

Manual submission:

# Upload to internal dashboard
make benchmark-submit

# Or via CLI
stellaops benchmark submit \
  --file results/benchmark-all-20251214.json \
  --dashboard internal

4.2 External Validation Submission

Third parties can validate and submit benchmark results:

Step 1: Fork and Clone

# Fork the benchmark repository
# https://git.stella-ops.org/stella-ops.org/benchmarks

git clone https://git.stella-ops.org/<your-org>/benchmarks.git
cd benchmarks

Step 2: Run Benchmarks

# With StellaOps scanner
make benchmark-all SCANNER=stellaops

# Or with your own tool for comparison
make benchmark-all SCANNER=your-tool

Step 3: Prepare Submission

# Results directory structure
mkdir -p submissions/<your-org>/<date>

# Copy results
cp results/*.json submissions/<your-org>/<date>/

# Add reproduction README
cat > submissions/<your-org>/<date>/README.md <<EOF
# Benchmark Results: <Your Org>

**Date:** $(date -u +%Y-%m-%d)
**Scanner:** <tool-name>
**Version:** <version>

## Environment
- OS: <os>
- CPU: <cpu>
- Memory: <memory>

## Reproduction Steps
<steps>

## Notes
<any observations>
EOF

Step 4: Submit Pull Request

git checkout -b benchmark-results-$(date +%Y%m%d)
git add submissions/
git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
git push origin benchmark-results-$(date +%Y%m%d)

# Create PR via web interface or gh CLI
gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
  --body "Benchmark results for external validation"

4.3 Submission Review Process

Step Action Timeline
1 PR submitted Day 0
2 Automated validation runs Day 0 (CI)
3 Maintainer review Day 1-3
4 Results published (if valid) Day 3-5
5 Dashboard updated Day 5

5. BENCHMARK CATEGORIES

5.1 Reachability Benchmark

Purpose: Measure accuracy of static and runtime reachability analysis.

Ground Truth Source: datasets/reachability/

Test Cases:

  • 50+ samples per language (Java, C#, TypeScript, Python, Go)
  • Known-reachable vulnerable paths
  • Known-unreachable vulnerable code
  • Runtime-only reachable code

Scoring:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)

Targets:

Metric Target Blocking
Precision >= 90% >= 85%
Recall >= 85% >= 80%
F1 >= 87% >= 82%

5.2 Performance Benchmark

Purpose: Measure scan time, memory usage, and CPU utilization.

Reference Images: See Performance Baselines

Metrics:

  • P50/P95 scan time (cold and warm)
  • Peak memory usage
  • CPU time
  • Throughput (images/minute)

Targets:

Image Category P50 Time P95 Time Max Memory
Minimal (<100MB) < 5s < 10s < 256MB
Standard (100-500MB) < 15s < 30s < 512MB
Large (500MB-2GB) < 45s < 90s < 1.5GB

5.3 SBOM Benchmark

Purpose: Measure component detection completeness and accuracy.

Ground Truth Source: Manual SBOM audits of reference images.

Metrics:

  • Component recall (found / total)
  • Component precision (real / reported)
  • Version accuracy (correct / total)

Targets:

Metric Target
Component Recall >= 98%
Component Precision >= 99%
Version Accuracy >= 95%

5.4 Vulnerability Detection Benchmark

Purpose: Measure CVE detection accuracy against known-vulnerable images.

Ground Truth Source: datasets/vulns/ curated CVE lists.

Metrics:

  • True positive rate
  • False positive rate
  • False negative rate
  • Precision/Recall/F1

Targets:

Metric Target
Precision >= 95%
Recall >= 90%
F1 >= 92%

5.5 Determinism Benchmark

Purpose: Verify reproducible scan outputs.

Methodology:

  1. Run same scan N times (default: 20)
  2. Compare output hashes
  3. Calculate bitwise fidelity

Targets:

Metric Target
Bitwise Fidelity 100%
Semantic Fidelity 100%

6. COMPARING RESULTS

6.1 Against Baselines

# Compare current run against stored baseline
stellaops benchmark compare \
  --baseline results/baseline/2025-Q4.json \
  --current results/benchmark-all-20251214.json \
  --threshold-p50 0.15 \
  --threshold-precision 0.02 \
  --fail-on-regression

# Output:
# Performance: PASS (P50 within 15% of baseline)
# Accuracy: PASS (Precision within 2% of baseline)
# Determinism: PASS (100% fidelity)

6.2 Against Other Tools

# Generate comparison report
stellaops benchmark compare-tools \
  --stellaops results/stellaops/2025-12-14.json \
  --trivy results/trivy/2025-12-14.json \
  --grype results/grype/2025-12-14.json \
  --output comparison-report.html
# Generate trend report (last 12 months)
stellaops benchmark trend \
  --period 12m \
  --metrics precision,recall,p50_time \
  --output trend-report.html

7. TROUBLESHOOTING

7.1 Common Issues

Issue Cause Resolution
Non-deterministic output Locale not set Set LC_ALL=C
Memory OOM Large image Increase memory limit
Slow performance Cold cache Pre-pull images
Missing components Ecosystem not supported Check supported ecosystems

7.2 Debug Mode

# Enable verbose benchmark logging
make benchmark-all DEBUG=1

# Enable timing breakdown
export STELLAOPS_BENCHMARK_TIMING=1
make benchmark-performance

7.3 Validation Failures

# Check result schema validity
stellaops benchmark validate --file results/benchmark-all.json

# Check against ground truth
stellaops benchmark validate-ground-truth \
  --results results/reachability.json \
  --ground-truth datasets/reachability/v2025.12

8. REFERENCES


Document Version: 1.0 Target Platform: .NET 10, PostgreSQL >=16