stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot b058dbe031 up

2025-12-14 23:20:14 +02:00

14 KiB

Raw Permalink Blame History

Benchmark Submission Guide

Last Updated: 2025-12-14 Next Review: 2026-03-14

Overview

StellaOps publishes benchmarks for:

Reachability Analysis - Accuracy of static and runtime path detection
SBOM Completeness - Component detection and version accuracy
Vulnerability Detection - Precision, recall, and F1 scores
Scan Performance - Time, memory, and CPU metrics
Determinism - Reproducibility of scan outputs

This guide explains how to reproduce, validate, and submit benchmark results.

1. PREREQUISITES

1.1 System Requirements

Requirement	Minimum	Recommended
CPU	4 cores	8 cores
Memory	8 GB	16 GB
Storage	50 GB SSD	100 GB NVMe
OS	Ubuntu 22.04 LTS	Ubuntu 22.04 LTS
Docker	24.x	24.x
.NET	10.0	10.0

1.2 Environment Setup

# Clone the repository
git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
cd git.stella-ops.org

# Install .NET 10 SDK
sudo apt-get update
sudo apt-get install -y dotnet-sdk-10.0

# Install Docker (if not present)
curl -fsSL https://get.docker.com | sh

# Install benchmark dependencies
sudo apt-get install -y \
    jq \
    b3sum \
    hyperfine \
    time

# Set determinism environment variables
export TZ=UTC
export LC_ALL=C
export STELLAOPS_DETERMINISM_SEED=42
export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"

1.3 Pull Reference Images

# Download standard benchmark images
make benchmark-pull-images

# Or manually:
docker pull alpine:3.19
docker pull debian:12-slim
docker pull ubuntu:22.04
docker pull node:20-alpine
docker pull python:3.12
docker pull mcr.microsoft.com/dotnet/aspnet:8.0
docker pull nginx:1.25
docker pull postgres:16-alpine

2. RUNNING BENCHMARKS

2.1 Full Benchmark Suite

# Run all benchmarks (takes ~30-60 minutes)
make benchmark-all

# Output: results/benchmark-all-$(date +%Y%m%d).json

2.2 Category-Specific Benchmarks

Reachability Benchmark

# Run reachability accuracy benchmarks
make benchmark-reachability

# With specific language filter
make benchmark-reachability LANG=csharp

# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json

Performance Benchmark

# Run scan performance benchmarks
make benchmark-performance

# Single image
make benchmark-image IMAGE=alpine:3.19

# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json

SBOM Benchmark

# Run SBOM completeness benchmarks
make benchmark-sbom

# Specific format
make benchmark-sbom FORMAT=cyclonedx

# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json

Determinism Benchmark

# Run determinism verification
make benchmark-determinism

# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json

2.3 CLI Benchmark Commands

# Performance timing with hyperfine (10 runs)
hyperfine --warmup 2 --runs 10 \
  'stellaops scan --image alpine:3.19 --format json --output /dev/null'

# Memory profiling
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
  grep "Maximum resident set size"

# CPU profiling (Linux)
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null

# Determinism check (run twice, compare hashes)
stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"

3. OUTPUT FORMATS

3.1 Reachability Results Schema

{
  "benchmark": "reachability-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "scanner_commit": "abc123def",
  "environment": {
    "os": "ubuntu-22.04",
    "arch": "amd64",
    "cpu": "Intel Xeon E-2288G",
    "memory_gb": 16
  },
  "summary": {
    "total_samples": 200,
    "precision": 0.92,
    "recall": 0.87,
    "f1": 0.894,
    "false_positive_rate": 0.08,
    "false_negative_rate": 0.13
  },
  "by_language": {
    "java": {
      "samples": 50,
      "precision": 0.94,
      "recall": 0.88,
      "f1": 0.909,
      "confusion_matrix": {
        "tp": 44, "fp": 3, "tn": 2, "fn": 1
      }
    },
    "csharp": {
      "samples": 50,
      "precision": 0.91,
      "recall": 0.86,
      "f1": 0.884,
      "confusion_matrix": {
        "tp": 43, "fp": 4, "tn": 2, "fn": 1
      }
    },
    "typescript": {
      "samples": 50,
      "precision": 0.89,
      "recall": 0.84,
      "f1": 0.864,
      "confusion_matrix": {
        "tp": 42, "fp": 5, "tn": 2, "fn": 1
      }
    },
    "python": {
      "samples": 50,
      "precision": 0.88,
      "recall": 0.83,
      "f1": 0.854,
      "confusion_matrix": {
        "tp": 41, "fp": 5, "tn": 3, "fn": 1
      }
    }
  },
  "ground_truth_ref": "datasets/reachability/v2025.12",
  "raw_results_ref": "results/reachability/raw/2025-12-14/"
}

3.2 Performance Results Schema

{
  "benchmark": "performance-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "scanner_commit": "abc123def",
  "environment": {
    "os": "ubuntu-22.04",
    "arch": "amd64",
    "cpu": "Intel Xeon E-2288G",
    "memory_gb": 16,
    "storage": "nvme"
  },
  "images": [
    {
      "image": "alpine:3.19",
      "size_mb": 7,
      "components": 15,
      "vulnerabilities": 5,
      "runs": 10,
      "cold_start": {
        "p50_ms": 2800,
        "p95_ms": 4200,
        "mean_ms": 3100
      },
      "warm_cache": {
        "p50_ms": 1500,
        "p95_ms": 2100,
        "mean_ms": 1650
      },
      "memory_peak_mb": 180,
      "cpu_time_ms": 1200
    },
    {
      "image": "python:3.12",
      "size_mb": 1024,
      "components": 300,
      "vulnerabilities": 150,
      "runs": 10,
      "cold_start": {
        "p50_ms": 32000,
        "p95_ms": 48000,
        "mean_ms": 35000
      },
      "warm_cache": {
        "p50_ms": 18000,
        "p95_ms": 25000,
        "mean_ms": 19500
      },
      "memory_peak_mb": 1100,
      "cpu_time_ms": 28000
    }
  ],
  "aggregated": {
    "total_images": 8,
    "total_runs": 80,
    "avg_time_per_mb_ms": 35,
    "avg_memory_per_component_kb": 400
  }
}

3.3 SBOM Results Schema

{
  "benchmark": "sbom-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "summary": {
    "total_images": 8,
    "component_recall": 0.98,
    "component_precision": 0.995,
    "version_accuracy": 0.96
  },
  "by_ecosystem": {
    "apk": {
      "ground_truth_components": 100,
      "detected_components": 99,
      "correct_versions": 96,
      "recall": 0.99,
      "precision": 0.99,
      "version_accuracy": 0.96
    },
    "npm": {
      "ground_truth_components": 500,
      "detected_components": 492,
      "correct_versions": 475,
      "recall": 0.984,
      "precision": 0.998,
      "version_accuracy": 0.965
    }
  },
  "formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
}

3.4 Determinism Results Schema

{
  "benchmark": "determinism-v1",
  "date": "2025-12-14T00:00:00Z",
  "scanner_version": "1.3.0",
  "summary": {
    "total_runs": 100,
    "bitwise_identical": 100,
    "bitwise_fidelity": 1.0,
    "semantic_identical": 100,
    "semantic_fidelity": 1.0
  },
  "by_image": {
    "alpine:3.19": {
      "runs": 20,
      "bitwise_identical": 20,
      "output_hash": "sha256:abc123..."
    },
    "python:3.12": {
      "runs": 20,
      "bitwise_identical": 20,
      "output_hash": "sha256:def456..."
    }
  },
  "seed": 42,
  "timestamp_frozen": "2025-01-01T00:00:00Z"
}

4. SUBMISSION PROCESS

4.1 Internal Submission (StellaOps Team)

Benchmark results are automatically collected by CI:

# .gitea/workflows/weekly-benchmark.yml triggers:
# - Weekly benchmark runs
# - Results stored in internal dashboard
# - Regression detection against baselines

Manual submission:

# Upload to internal dashboard
make benchmark-submit

# Or via CLI
stellaops benchmark submit \
  --file results/benchmark-all-20251214.json \
  --dashboard internal

4.2 External Validation Submission

Third parties can validate and submit benchmark results:

Step 1: Fork and Clone

# Fork the benchmark repository
# https://git.stella-ops.org/stella-ops.org/benchmarks

git clone https://git.stella-ops.org/<your-org>/benchmarks.git
cd benchmarks

Step 2: Run Benchmarks

# With StellaOps scanner
make benchmark-all SCANNER=stellaops

# Or with your own tool for comparison
make benchmark-all SCANNER=your-tool

Step 3: Prepare Submission

# Results directory structure
mkdir -p submissions/<your-org>/<date>

# Copy results
cp results/*.json submissions/<your-org>/<date>/

# Add reproduction README
cat > submissions/<your-org>/<date>/README.md <<EOF
# Benchmark Results: <Your Org>

**Date:** $(date -u +%Y-%m-%d)
**Scanner:** <tool-name>
**Version:** <version>

## Environment
- OS: <os>
- CPU: <cpu>
- Memory: <memory>

## Reproduction Steps
<steps>

## Notes
<any observations>
EOF

Step 4: Submit Pull Request

git checkout -b benchmark-results-$(date +%Y%m%d)
git add submissions/
git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
git push origin benchmark-results-$(date +%Y%m%d)

# Create PR via web interface or gh CLI
gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
  --body "Benchmark results for external validation"

4.3 Submission Review Process

Step	Action	Timeline
1	PR submitted	Day 0
2	Automated validation runs	Day 0 (CI)
3	Maintainer review	Day 1-3
4	Results published (if valid)	Day 3-5
5	Dashboard updated	Day 5

5. BENCHMARK CATEGORIES

5.1 Reachability Benchmark

Purpose: Measure accuracy of static and runtime reachability analysis.

Ground Truth Source: datasets/reachability/

Test Cases:

50+ samples per language (Java, C#, TypeScript, Python, Go)
Known-reachable vulnerable paths
Known-unreachable vulnerable code
Runtime-only reachable code

Scoring:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)

Targets:

Metric	Target	Blocking
Precision	>= 90%	>= 85%
Recall	>= 85%	>= 80%
F1	>= 87%	>= 82%

5.2 Performance Benchmark

Purpose: Measure scan time, memory usage, and CPU utilization.

Reference Images: See Performance Baselines

Metrics:

P50/P95 scan time (cold and warm)
Peak memory usage
CPU time
Throughput (images/minute)

Targets:

Image Category	P50 Time	P95 Time	Max Memory
Minimal (<100MB)	< 5s	< 10s	< 256MB
Standard (100-500MB)	< 15s	< 30s	< 512MB
Large (500MB-2GB)	< 45s	< 90s	< 1.5GB

5.3 SBOM Benchmark

Purpose: Measure component detection completeness and accuracy.

Ground Truth Source: Manual SBOM audits of reference images.

Metrics:

Component recall (found / total)
Component precision (real / reported)
Version accuracy (correct / total)

Targets:

Metric	Target
Component Recall	>= 98%
Component Precision	>= 99%
Version Accuracy	>= 95%

5.4 Vulnerability Detection Benchmark

Purpose: Measure CVE detection accuracy against known-vulnerable images.

Ground Truth Source: datasets/vulns/ curated CVE lists.

Metrics:

True positive rate
False positive rate
False negative rate
Precision/Recall/F1

Targets:

Metric	Target
Precision	>= 95%
Recall	>= 90%
F1	>= 92%

5.5 Determinism Benchmark

Purpose: Verify reproducible scan outputs.

Methodology:

Run same scan N times (default: 20)
Compare output hashes
Calculate bitwise fidelity

Targets:

Metric	Target
Bitwise Fidelity	100%
Semantic Fidelity	100%

6. COMPARING RESULTS

6.1 Against Baselines

# Compare current run against stored baseline
stellaops benchmark compare \
  --baseline results/baseline/2025-Q4.json \
  --current results/benchmark-all-20251214.json \
  --threshold-p50 0.15 \
  --threshold-precision 0.02 \
  --fail-on-regression

# Output:
# Performance: PASS (P50 within 15% of baseline)
# Accuracy: PASS (Precision within 2% of baseline)
# Determinism: PASS (100% fidelity)

6.2 Against Other Tools

# Generate comparison report
stellaops benchmark compare-tools \
  --stellaops results/stellaops/2025-12-14.json \
  --trivy results/trivy/2025-12-14.json \
  --grype results/grype/2025-12-14.json \
  --output comparison-report.html

6.3 Historical Trends

# Generate trend report (last 12 months)
stellaops benchmark trend \
  --period 12m \
  --metrics precision,recall,p50_time \
  --output trend-report.html

7. TROUBLESHOOTING

7.1 Common Issues

Issue	Cause	Resolution
Non-deterministic output	Locale not set	Set `LC_ALL=C`
Memory OOM	Large image	Increase memory limit
Slow performance	Cold cache	Pre-pull images
Missing components	Ecosystem not supported	Check supported ecosystems

7.2 Debug Mode

# Enable verbose benchmark logging
make benchmark-all DEBUG=1

# Enable timing breakdown
export STELLAOPS_BENCHMARK_TIMING=1
make benchmark-performance

7.3 Validation Failures

# Check result schema validity
stellaops benchmark validate --file results/benchmark-all.json

# Check against ground truth
stellaops benchmark validate-ground-truth \
  --results results/reachability.json \
  --ground-truth datasets/reachability/v2025.12

8. REFERENCES

Document Version: 1.0 Target Platform: .NET 10, PostgreSQL >=16

14 KiB Raw Permalink Blame History

Benchmark Submission Guide

Overview

1. PREREQUISITES

1.1 System Requirements

1.2 Environment Setup

1.3 Pull Reference Images

2. RUNNING BENCHMARKS

2.1 Full Benchmark Suite

2.2 Category-Specific Benchmarks

Reachability Benchmark

Performance Benchmark

SBOM Benchmark

Determinism Benchmark

2.3 CLI Benchmark Commands

3. OUTPUT FORMATS

3.1 Reachability Results Schema

3.2 Performance Results Schema

3.3 SBOM Results Schema

3.4 Determinism Results Schema

4. SUBMISSION PROCESS

4.1 Internal Submission (StellaOps Team)

4.2 External Validation Submission

Step 1: Fork and Clone

Step 2: Run Benchmarks

Step 3: Prepare Submission

Step 4: Submit Pull Request

4.3 Submission Review Process

5. BENCHMARK CATEGORIES

5.1 Reachability Benchmark

5.2 Performance Benchmark

5.3 SBOM Benchmark

5.4 Vulnerability Detection Benchmark

5.5 Determinism Benchmark

6. COMPARING RESULTS

6.1 Against Baselines

6.2 Against Other Tools

6.3 Historical Trends

7. TROUBLESHOOTING

7.1 Common Issues

7.2 Debug Mode

7.3 Validation Failures

8. REFERENCES

14 KiB

Raw Permalink Blame History