up
This commit is contained in:
653
docs/benchmarks/submission-guide.md
Normal file
653
docs/benchmarks/submission-guide.md
Normal file
@@ -0,0 +1,653 @@
|
||||
# Benchmark Submission Guide
|
||||
|
||||
**Last Updated:** 2025-12-14
|
||||
**Next Review:** 2026-03-14
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
StellaOps publishes benchmarks for:
|
||||
- **Reachability Analysis** - Accuracy of static and runtime path detection
|
||||
- **SBOM Completeness** - Component detection and version accuracy
|
||||
- **Vulnerability Detection** - Precision, recall, and F1 scores
|
||||
- **Scan Performance** - Time, memory, and CPU metrics
|
||||
- **Determinism** - Reproducibility of scan outputs
|
||||
|
||||
This guide explains how to reproduce, validate, and submit benchmark results.
|
||||
|
||||
---
|
||||
|
||||
## 1. PREREQUISITES
|
||||
|
||||
### 1.1 System Requirements
|
||||
|
||||
| Requirement | Minimum | Recommended |
|
||||
|-------------|---------|-------------|
|
||||
| CPU | 4 cores | 8 cores |
|
||||
| Memory | 8 GB | 16 GB |
|
||||
| Storage | 50 GB SSD | 100 GB NVMe |
|
||||
| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
|
||||
| Docker | 24.x | 24.x |
|
||||
| .NET | 10.0 | 10.0 |
|
||||
|
||||
### 1.2 Environment Setup
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
|
||||
cd git.stella-ops.org
|
||||
|
||||
# Install .NET 10 SDK
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y dotnet-sdk-10.0
|
||||
|
||||
# Install Docker (if not present)
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
|
||||
# Install benchmark dependencies
|
||||
sudo apt-get install -y \
|
||||
jq \
|
||||
b3sum \
|
||||
hyperfine \
|
||||
time
|
||||
|
||||
# Set determinism environment variables
|
||||
export TZ=UTC
|
||||
export LC_ALL=C
|
||||
export STELLAOPS_DETERMINISM_SEED=42
|
||||
export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"
|
||||
```
|
||||
|
||||
### 1.3 Pull Reference Images
|
||||
|
||||
```bash
|
||||
# Download standard benchmark images
|
||||
make benchmark-pull-images
|
||||
|
||||
# Or manually:
|
||||
docker pull alpine:3.19
|
||||
docker pull debian:12-slim
|
||||
docker pull ubuntu:22.04
|
||||
docker pull node:20-alpine
|
||||
docker pull python:3.12
|
||||
docker pull mcr.microsoft.com/dotnet/aspnet:8.0
|
||||
docker pull nginx:1.25
|
||||
docker pull postgres:16-alpine
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. RUNNING BENCHMARKS
|
||||
|
||||
### 2.1 Full Benchmark Suite
|
||||
|
||||
```bash
|
||||
# Run all benchmarks (takes ~30-60 minutes)
|
||||
make benchmark-all
|
||||
|
||||
# Output: results/benchmark-all-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### 2.2 Category-Specific Benchmarks
|
||||
|
||||
#### Reachability Benchmark
|
||||
|
||||
```bash
|
||||
# Run reachability accuracy benchmarks
|
||||
make benchmark-reachability
|
||||
|
||||
# With specific language filter
|
||||
make benchmark-reachability LANG=csharp
|
||||
|
||||
# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### Performance Benchmark
|
||||
|
||||
```bash
|
||||
# Run scan performance benchmarks
|
||||
make benchmark-performance
|
||||
|
||||
# Single image
|
||||
make benchmark-image IMAGE=alpine:3.19
|
||||
|
||||
# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### SBOM Benchmark
|
||||
|
||||
```bash
|
||||
# Run SBOM completeness benchmarks
|
||||
make benchmark-sbom
|
||||
|
||||
# Specific format
|
||||
make benchmark-sbom FORMAT=cyclonedx
|
||||
|
||||
# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### Determinism Benchmark
|
||||
|
||||
```bash
|
||||
# Run determinism verification
|
||||
make benchmark-determinism
|
||||
|
||||
# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### 2.3 CLI Benchmark Commands
|
||||
|
||||
```bash
|
||||
# Performance timing with hyperfine (10 runs)
|
||||
hyperfine --warmup 2 --runs 10 \
|
||||
'stellaops scan --image alpine:3.19 --format json --output /dev/null'
|
||||
|
||||
# Memory profiling
|
||||
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
|
||||
grep "Maximum resident set size"
|
||||
|
||||
# CPU profiling (Linux)
|
||||
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
|
||||
# Determinism check (run twice, compare hashes)
|
||||
stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
|
||||
stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
|
||||
diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. OUTPUT FORMATS
|
||||
|
||||
### 3.1 Reachability Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "reachability-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"scanner_commit": "abc123def",
|
||||
"environment": {
|
||||
"os": "ubuntu-22.04",
|
||||
"arch": "amd64",
|
||||
"cpu": "Intel Xeon E-2288G",
|
||||
"memory_gb": 16
|
||||
},
|
||||
"summary": {
|
||||
"total_samples": 200,
|
||||
"precision": 0.92,
|
||||
"recall": 0.87,
|
||||
"f1": 0.894,
|
||||
"false_positive_rate": 0.08,
|
||||
"false_negative_rate": 0.13
|
||||
},
|
||||
"by_language": {
|
||||
"java": {
|
||||
"samples": 50,
|
||||
"precision": 0.94,
|
||||
"recall": 0.88,
|
||||
"f1": 0.909,
|
||||
"confusion_matrix": {
|
||||
"tp": 44, "fp": 3, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"csharp": {
|
||||
"samples": 50,
|
||||
"precision": 0.91,
|
||||
"recall": 0.86,
|
||||
"f1": 0.884,
|
||||
"confusion_matrix": {
|
||||
"tp": 43, "fp": 4, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"typescript": {
|
||||
"samples": 50,
|
||||
"precision": 0.89,
|
||||
"recall": 0.84,
|
||||
"f1": 0.864,
|
||||
"confusion_matrix": {
|
||||
"tp": 42, "fp": 5, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"python": {
|
||||
"samples": 50,
|
||||
"precision": 0.88,
|
||||
"recall": 0.83,
|
||||
"f1": 0.854,
|
||||
"confusion_matrix": {
|
||||
"tp": 41, "fp": 5, "tn": 3, "fn": 1
|
||||
}
|
||||
}
|
||||
},
|
||||
"ground_truth_ref": "datasets/reachability/v2025.12",
|
||||
"raw_results_ref": "results/reachability/raw/2025-12-14/"
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 Performance Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "performance-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"scanner_commit": "abc123def",
|
||||
"environment": {
|
||||
"os": "ubuntu-22.04",
|
||||
"arch": "amd64",
|
||||
"cpu": "Intel Xeon E-2288G",
|
||||
"memory_gb": 16,
|
||||
"storage": "nvme"
|
||||
},
|
||||
"images": [
|
||||
{
|
||||
"image": "alpine:3.19",
|
||||
"size_mb": 7,
|
||||
"components": 15,
|
||||
"vulnerabilities": 5,
|
||||
"runs": 10,
|
||||
"cold_start": {
|
||||
"p50_ms": 2800,
|
||||
"p95_ms": 4200,
|
||||
"mean_ms": 3100
|
||||
},
|
||||
"warm_cache": {
|
||||
"p50_ms": 1500,
|
||||
"p95_ms": 2100,
|
||||
"mean_ms": 1650
|
||||
},
|
||||
"memory_peak_mb": 180,
|
||||
"cpu_time_ms": 1200
|
||||
},
|
||||
{
|
||||
"image": "python:3.12",
|
||||
"size_mb": 1024,
|
||||
"components": 300,
|
||||
"vulnerabilities": 150,
|
||||
"runs": 10,
|
||||
"cold_start": {
|
||||
"p50_ms": 32000,
|
||||
"p95_ms": 48000,
|
||||
"mean_ms": 35000
|
||||
},
|
||||
"warm_cache": {
|
||||
"p50_ms": 18000,
|
||||
"p95_ms": 25000,
|
||||
"mean_ms": 19500
|
||||
},
|
||||
"memory_peak_mb": 1100,
|
||||
"cpu_time_ms": 28000
|
||||
}
|
||||
],
|
||||
"aggregated": {
|
||||
"total_images": 8,
|
||||
"total_runs": 80,
|
||||
"avg_time_per_mb_ms": 35,
|
||||
"avg_memory_per_component_kb": 400
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 SBOM Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "sbom-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"summary": {
|
||||
"total_images": 8,
|
||||
"component_recall": 0.98,
|
||||
"component_precision": 0.995,
|
||||
"version_accuracy": 0.96
|
||||
},
|
||||
"by_ecosystem": {
|
||||
"apk": {
|
||||
"ground_truth_components": 100,
|
||||
"detected_components": 99,
|
||||
"correct_versions": 96,
|
||||
"recall": 0.99,
|
||||
"precision": 0.99,
|
||||
"version_accuracy": 0.96
|
||||
},
|
||||
"npm": {
|
||||
"ground_truth_components": 500,
|
||||
"detected_components": 492,
|
||||
"correct_versions": 475,
|
||||
"recall": 0.984,
|
||||
"precision": 0.998,
|
||||
"version_accuracy": 0.965
|
||||
}
|
||||
},
|
||||
"formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Determinism Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "determinism-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"summary": {
|
||||
"total_runs": 100,
|
||||
"bitwise_identical": 100,
|
||||
"bitwise_fidelity": 1.0,
|
||||
"semantic_identical": 100,
|
||||
"semantic_fidelity": 1.0
|
||||
},
|
||||
"by_image": {
|
||||
"alpine:3.19": {
|
||||
"runs": 20,
|
||||
"bitwise_identical": 20,
|
||||
"output_hash": "sha256:abc123..."
|
||||
},
|
||||
"python:3.12": {
|
||||
"runs": 20,
|
||||
"bitwise_identical": 20,
|
||||
"output_hash": "sha256:def456..."
|
||||
}
|
||||
},
|
||||
"seed": 42,
|
||||
"timestamp_frozen": "2025-01-01T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. SUBMISSION PROCESS
|
||||
|
||||
### 4.1 Internal Submission (StellaOps Team)
|
||||
|
||||
Benchmark results are automatically collected by CI:
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/weekly-benchmark.yml triggers:
|
||||
# - Weekly benchmark runs
|
||||
# - Results stored in internal dashboard
|
||||
# - Regression detection against baselines
|
||||
```
|
||||
|
||||
Manual submission:
|
||||
```bash
|
||||
# Upload to internal dashboard
|
||||
make benchmark-submit
|
||||
|
||||
# Or via CLI
|
||||
stellaops benchmark submit \
|
||||
--file results/benchmark-all-20251214.json \
|
||||
--dashboard internal
|
||||
```
|
||||
|
||||
### 4.2 External Validation Submission
|
||||
|
||||
Third parties can validate and submit benchmark results:
|
||||
|
||||
#### Step 1: Fork and Clone
|
||||
|
||||
```bash
|
||||
# Fork the benchmark repository
|
||||
# https://git.stella-ops.org/stella-ops.org/benchmarks
|
||||
|
||||
git clone https://git.stella-ops.org/<your-org>/benchmarks.git
|
||||
cd benchmarks
|
||||
```
|
||||
|
||||
#### Step 2: Run Benchmarks
|
||||
|
||||
```bash
|
||||
# With StellaOps scanner
|
||||
make benchmark-all SCANNER=stellaops
|
||||
|
||||
# Or with your own tool for comparison
|
||||
make benchmark-all SCANNER=your-tool
|
||||
```
|
||||
|
||||
#### Step 3: Prepare Submission
|
||||
|
||||
```bash
|
||||
# Results directory structure
|
||||
mkdir -p submissions/<your-org>/<date>
|
||||
|
||||
# Copy results
|
||||
cp results/*.json submissions/<your-org>/<date>/
|
||||
|
||||
# Add reproduction README
|
||||
cat > submissions/<your-org>/<date>/README.md <<EOF
|
||||
# Benchmark Results: <Your Org>
|
||||
|
||||
**Date:** $(date -u +%Y-%m-%d)
|
||||
**Scanner:** <tool-name>
|
||||
**Version:** <version>
|
||||
|
||||
## Environment
|
||||
- OS: <os>
|
||||
- CPU: <cpu>
|
||||
- Memory: <memory>
|
||||
|
||||
## Reproduction Steps
|
||||
<steps>
|
||||
|
||||
## Notes
|
||||
<any observations>
|
||||
EOF
|
||||
```
|
||||
|
||||
#### Step 4: Submit Pull Request
|
||||
|
||||
```bash
|
||||
git checkout -b benchmark-results-$(date +%Y%m%d)
|
||||
git add submissions/
|
||||
git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
|
||||
git push origin benchmark-results-$(date +%Y%m%d)
|
||||
|
||||
# Create PR via web interface or gh CLI
|
||||
gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
|
||||
--body "Benchmark results for external validation"
|
||||
```
|
||||
|
||||
### 4.3 Submission Review Process
|
||||
|
||||
| Step | Action | Timeline |
|
||||
|------|--------|----------|
|
||||
| 1 | PR submitted | Day 0 |
|
||||
| 2 | Automated validation runs | Day 0 (CI) |
|
||||
| 3 | Maintainer review | Day 1-3 |
|
||||
| 4 | Results published (if valid) | Day 3-5 |
|
||||
| 5 | Dashboard updated | Day 5 |
|
||||
|
||||
---
|
||||
|
||||
## 5. BENCHMARK CATEGORIES
|
||||
|
||||
### 5.1 Reachability Benchmark
|
||||
|
||||
**Purpose:** Measure accuracy of static and runtime reachability analysis.
|
||||
|
||||
**Ground Truth Source:** `datasets/reachability/`
|
||||
|
||||
**Test Cases:**
|
||||
- 50+ samples per language (Java, C#, TypeScript, Python, Go)
|
||||
- Known-reachable vulnerable paths
|
||||
- Known-unreachable vulnerable code
|
||||
- Runtime-only reachable code
|
||||
|
||||
**Scoring:**
|
||||
```
|
||||
Precision = TP / (TP + FP)
|
||||
Recall = TP / (TP + FN)
|
||||
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
||||
```
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target | Blocking |
|
||||
|--------|--------|----------|
|
||||
| Precision | >= 90% | >= 85% |
|
||||
| Recall | >= 85% | >= 80% |
|
||||
| F1 | >= 87% | >= 82% |
|
||||
|
||||
### 5.2 Performance Benchmark
|
||||
|
||||
**Purpose:** Measure scan time, memory usage, and CPU utilization.
|
||||
|
||||
**Reference Images:** See [Performance Baselines](performance-baselines.md)
|
||||
|
||||
**Metrics:**
|
||||
- P50/P95 scan time (cold and warm)
|
||||
- Peak memory usage
|
||||
- CPU time
|
||||
- Throughput (images/minute)
|
||||
|
||||
**Targets:**
|
||||
| Image Category | P50 Time | P95 Time | Max Memory |
|
||||
|----------------|----------|----------|------------|
|
||||
| Minimal (<100MB) | < 5s | < 10s | < 256MB |
|
||||
| Standard (100-500MB) | < 15s | < 30s | < 512MB |
|
||||
| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB |
|
||||
|
||||
### 5.3 SBOM Benchmark
|
||||
|
||||
**Purpose:** Measure component detection completeness and accuracy.
|
||||
|
||||
**Ground Truth Source:** Manual SBOM audits of reference images.
|
||||
|
||||
**Metrics:**
|
||||
- Component recall (found / total)
|
||||
- Component precision (real / reported)
|
||||
- Version accuracy (correct / total)
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Component Recall | >= 98% |
|
||||
| Component Precision | >= 99% |
|
||||
| Version Accuracy | >= 95% |
|
||||
|
||||
### 5.4 Vulnerability Detection Benchmark
|
||||
|
||||
**Purpose:** Measure CVE detection accuracy against known-vulnerable images.
|
||||
|
||||
**Ground Truth Source:** `datasets/vulns/` curated CVE lists.
|
||||
|
||||
**Metrics:**
|
||||
- True positive rate
|
||||
- False positive rate
|
||||
- False negative rate
|
||||
- Precision/Recall/F1
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Precision | >= 95% |
|
||||
| Recall | >= 90% |
|
||||
| F1 | >= 92% |
|
||||
|
||||
### 5.5 Determinism Benchmark
|
||||
|
||||
**Purpose:** Verify reproducible scan outputs.
|
||||
|
||||
**Methodology:**
|
||||
1. Run same scan N times (default: 20)
|
||||
2. Compare output hashes
|
||||
3. Calculate bitwise fidelity
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Bitwise Fidelity | 100% |
|
||||
| Semantic Fidelity | 100% |
|
||||
|
||||
---
|
||||
|
||||
## 6. COMPARING RESULTS
|
||||
|
||||
### 6.1 Against Baselines
|
||||
|
||||
```bash
|
||||
# Compare current run against stored baseline
|
||||
stellaops benchmark compare \
|
||||
--baseline results/baseline/2025-Q4.json \
|
||||
--current results/benchmark-all-20251214.json \
|
||||
--threshold-p50 0.15 \
|
||||
--threshold-precision 0.02 \
|
||||
--fail-on-regression
|
||||
|
||||
# Output:
|
||||
# Performance: PASS (P50 within 15% of baseline)
|
||||
# Accuracy: PASS (Precision within 2% of baseline)
|
||||
# Determinism: PASS (100% fidelity)
|
||||
```
|
||||
|
||||
### 6.2 Against Other Tools
|
||||
|
||||
```bash
|
||||
# Generate comparison report
|
||||
stellaops benchmark compare-tools \
|
||||
--stellaops results/stellaops/2025-12-14.json \
|
||||
--trivy results/trivy/2025-12-14.json \
|
||||
--grype results/grype/2025-12-14.json \
|
||||
--output comparison-report.html
|
||||
```
|
||||
|
||||
### 6.3 Historical Trends
|
||||
|
||||
```bash
|
||||
# Generate trend report (last 12 months)
|
||||
stellaops benchmark trend \
|
||||
--period 12m \
|
||||
--metrics precision,recall,p50_time \
|
||||
--output trend-report.html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. TROUBLESHOOTING
|
||||
|
||||
### 7.1 Common Issues
|
||||
|
||||
| Issue | Cause | Resolution |
|
||||
|-------|-------|------------|
|
||||
| Non-deterministic output | Locale not set | Set `LC_ALL=C` |
|
||||
| Memory OOM | Large image | Increase memory limit |
|
||||
| Slow performance | Cold cache | Pre-pull images |
|
||||
| Missing components | Ecosystem not supported | Check supported ecosystems |
|
||||
|
||||
### 7.2 Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable verbose benchmark logging
|
||||
make benchmark-all DEBUG=1
|
||||
|
||||
# Enable timing breakdown
|
||||
export STELLAOPS_BENCHMARK_TIMING=1
|
||||
make benchmark-performance
|
||||
```
|
||||
|
||||
### 7.3 Validation Failures
|
||||
|
||||
```bash
|
||||
# Check result schema validity
|
||||
stellaops benchmark validate --file results/benchmark-all.json
|
||||
|
||||
# Check against ground truth
|
||||
stellaops benchmark validate-ground-truth \
|
||||
--results results/reachability.json \
|
||||
--ground-truth datasets/reachability/v2025.12
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. REFERENCES
|
||||
|
||||
- [Performance Baselines](performance-baselines.md)
|
||||
- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
|
||||
- [Offline Parity Verification](../airgap/offline-parity-verification.md)
|
||||
- [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md)
|
||||
- [Ground Truth Datasets](../datasets/README.md)
|
||||
|
||||
---
|
||||
|
||||
**Document Version**: 1.0
|
||||
**Target Platform**: .NET 10, PostgreSQL >=16
|
||||
Reference in New Issue
Block a user