This commit is contained in:
StellaOps Bot
2025-12-14 23:20:14 +02:00
parent 3411e825cd
commit b058dbe031
356 changed files with 68310 additions and 1108 deletions

View File

@@ -0,0 +1,320 @@
# Accuracy Metrics Framework
## Overview
This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.
## Metric Definitions
### Confusion Matrix
For binary classification tasks (e.g., reachable vs unreachable):
| | Predicted Positive | Predicted Negative |
|--|-------------------|-------------------|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |
### Core Metrics
| Metric | Formula | Description | Target |
|--------|---------|-------------|--------|
| **Precision** | TP / (TP + FP) | Of items flagged, how many were correct | >= 90% |
| **Recall** | TP / (TP + FN) | Of actual positives, how many were found | >= 85% |
| **F1 Score** | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall | >= 87% |
| **False Positive Rate** | FP / (FP + TN) | Rate of incorrect positive flags | <= 10% |
| **Accuracy** | (TP + TN) / Total | Overall correctness | >= 90% |
---
## Reachability Analysis Accuracy
### Definitions
- **True Positive (TP)**: Correctly identified as reachable (code path actually exists)
- **False Positive (FP)**: Incorrectly identified as reachable (no real code path)
- **True Negative (TN)**: Correctly identified as unreachable (no code path exists)
- **False Negative (FN)**: Incorrectly identified as unreachable (code path exists but missed)
### Target Metrics
| Metric | Target | Stretch Goal |
|--------|--------|--------------|
| Precision | >= 90% | >= 95% |
| Recall | >= 85% | >= 90% |
| F1 Score | >= 87% | >= 92% |
| False Positive Rate | <= 10% | <= 5% |
### Per-Language Targets
| Language | Precision | Recall | F1 | Notes |
|----------|-----------|--------|-----|-------|
| Java | >= 92% | >= 88% | >= 90% | Strong static analysis support |
| C# | >= 90% | >= 85% | >= 87% | Roslyn-based analysis |
| Go | >= 88% | >= 82% | >= 85% | Good call graph support |
| JavaScript | >= 85% | >= 78% | >= 81% | Dynamic typing challenges |
| Python | >= 83% | >= 75% | >= 79% | Dynamic typing challenges |
| TypeScript | >= 88% | >= 82% | >= 85% | Better than JS due to types |
---
## Lattice State Accuracy
VEX lattice states have different confidence requirements:
| State | Definition | Target Accuracy | Validation |
|-------|------------|-----------------|------------|
| **CR** (Confirmed Reachable) | Runtime evidence + static path | >= 95% | Runtime trace verification |
| **SR** (Static Reachable) | Static path only | >= 90% | Static analysis coverage |
| **SU** (Static Unreachable) | No static path found | >= 85% | Negative proof verification |
| **DT** (Denied by Tool) | Tool analysis confirms not affected | >= 90% | Tool output validation |
| **DV** (Denied by Vendor) | Vendor VEX statement | >= 95% | VEX signature verification |
| **U** (Unknown) | Insufficient evidence | Track % | Minimize unknowns |
### Lattice Transition Accuracy
Measure accuracy of automatic state transitions:
| Transition | Trigger | Target Accuracy |
|------------|---------|-----------------|
| U -> SR | Static analysis finds path | >= 90% |
| SR -> CR | Runtime evidence added | >= 95% |
| U -> SU | Static analysis proves unreachable | >= 85% |
| SR -> DT | Tool-specific analysis | >= 90% |
---
## SBOM Completeness Metrics
### Component Detection
| Metric | Formula | Target | Notes |
|--------|---------|--------|-------|
| **Component Recall** | Found / Total Actual | >= 98% | Find all real components |
| **Component Precision** | Real / Reported | >= 99% | Minimize phantom components |
| **Version Accuracy** | Correct Versions / Total | >= 95% | Version string correctness |
| **License Accuracy** | Correct Licenses / Total | >= 90% | License detection accuracy |
### Per-Ecosystem Targets
| Ecosystem | Comp. Recall | Comp. Precision | Version Acc. |
|-----------|--------------|-----------------|--------------|
| Alpine APK | >= 99% | >= 99% | >= 98% |
| Debian DEB | >= 99% | >= 99% | >= 98% |
| npm | >= 97% | >= 98% | >= 95% |
| Maven | >= 98% | >= 99% | >= 96% |
| NuGet | >= 98% | >= 99% | >= 96% |
| PyPI | >= 96% | >= 98% | >= 94% |
| Go Modules | >= 97% | >= 98% | >= 95% |
| Cargo (Rust) | >= 98% | >= 99% | >= 96% |
---
## Vulnerability Detection Accuracy
### CVE Matching
| Metric | Formula | Target |
|--------|---------|--------|
| **CVE Recall** | Found CVEs / Actual CVEs | >= 95% |
| **CVE Precision** | Correct CVEs / Reported CVEs | >= 98% |
| **Version Range Accuracy** | Correct Affected / Total | >= 93% |
### False Positive Categories
Track and minimize specific FP types:
| FP Type | Description | Target Rate |
|---------|-------------|-------------|
| **Phantom Component** | CVE for component not present | <= 1% |
| **Version Mismatch** | CVE for wrong version | <= 3% |
| **Ecosystem Confusion** | Wrong package with same name | <= 1% |
| **Stale Advisory** | Already fixed but flagged | <= 2% |
---
## Measurement Methodology
### Ground Truth Establishment
1. **Manual Curation**
- Expert review of sample applications
- Documented decision rationale
- Multiple reviewer consensus
2. **Automated Verification**
- Cross-reference with authoritative sources
- NVD, OSV, GitHub Advisory Database
- Vendor security bulletins
3. **Runtime Validation**
- Dynamic analysis confirmation
- Exploit proof-of-concept testing
- Production monitoring correlation
### Test Corpus Requirements
| Category | Minimum Samples | Diversity Requirements |
|----------|-----------------|----------------------|
| Reachability | 50 per language | Mix of libraries, frameworks |
| SBOM | 100 images | All major ecosystems |
| CVE Detection | 500 CVEs | Mix of severities, ages |
| Performance | 10 reference images | Various sizes |
### Measurement Process
```
1. Select ground truth corpus
└── Minimum samples per category
└── Representative of production workloads
2. Run scanner with deterministic manifest
└── Fixed advisory database version
└── Reproducible configuration
3. Compare results to ground truth
└── Automated diff tooling
└── Manual review of discrepancies
4. Compute metrics per category
└── Generate confusion matrices
└── Calculate precision/recall/F1
5. Aggregate and publish
└── Per-ecosystem breakdown
└── Overall summary metrics
└── Trend analysis
```
---
## Reporting Format
### Quarterly Benchmark Report
```json
{
"report_version": "1.0",
"scanner_version": "1.3.0",
"report_date": "2025-12-14",
"ground_truth_version": "2025-Q4",
"reachability": {
"overall": {
"precision": 0.91,
"recall": 0.86,
"f1": 0.88,
"samples": 450
},
"by_language": {
"java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
"csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
"go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
}
},
"sbom": {
"component_recall": 0.98,
"component_precision": 0.99,
"version_accuracy": 0.96
},
"vulnerability": {
"cve_recall": 0.96,
"cve_precision": 0.98,
"false_positive_rate": 0.02
},
"lattice_states": {
"cr_accuracy": 0.96,
"sr_accuracy": 0.91,
"su_accuracy": 0.87
}
}
```
---
## Regression Detection
### Thresholds
A regression is flagged when:
| Metric | Regression Threshold | Action |
|--------|---------------------|--------|
| Precision | > 3% decrease | Block release |
| Recall | > 5% decrease | Block release |
| F1 | > 4% decrease | Block release |
| FPR | > 2% increase | Block release |
| Any metric | > 1% change | Investigate |
### CI Integration
```yaml
# .gitea/workflows/accuracy-check.yml
accuracy-benchmark:
runs-on: ubuntu-latest
steps:
- name: Run accuracy benchmark
run: make benchmark-accuracy
- name: Check for regressions
run: |
stellaops benchmark compare \
--baseline results/baseline.json \
--current results/current.json \
--threshold-precision 0.03 \
--threshold-recall 0.05 \
--fail-on-regression
```
---
## Ground Truth Sources
### Internal
- `datasets/reachability/samples/` - Reachability ground truth
- `datasets/sbom/reference/` - Known-good SBOMs
- `bench/findings/` - CVE finding ground truth
### External
- **NIST SARD** - Software Assurance Reference Dataset
- **OSV Test Suite** - Open Source Vulnerability test cases
- **OWASP Benchmark** - Security testing benchmark
- **Juliet Test Suite** - CWE coverage testing
---
## Improvement Tracking
### Gap Analysis
Identify and prioritize accuracy improvements:
| Gap | Current | Target | Priority | Improvement Plan |
|-----|---------|--------|----------|------------------|
| Python recall | 73% | 78% | High | Improve type inference |
| npm precision | 96% | 98% | Medium | Fix aliasing issues |
| Version accuracy | 94% | 96% | Medium | Better version parsing |
### Quarterly Goals
Track progress against improvement targets:
| Quarter | Focus Area | Metric | Target | Actual |
|---------|------------|--------|--------|--------|
| Q4 2025 | Java reachability | Recall | 88% | TBD |
| Q1 2026 | Python support | F1 | 80% | TBD |
| Q1 2026 | SBOM completeness | Recall | 99% | TBD |
---
## References
- [FIRST CVSS v4.0 Specification](https://www.first.org/cvss/v4.0/specification-document)
- [NIST NVD API](https://nvd.nist.gov/developers)
- [OSV Schema](https://ossf.github.io/osv-schema/)
- [StellaOps Reachability Architecture](../modules/scanner/reachability.md)

View File

@@ -0,0 +1,355 @@
# Performance Baselines
## Overview
This document defines performance baselines for StellaOps scanner operations. All metrics are measured against reference images and workloads to ensure consistent, reproducible benchmarks.
**Last Updated:** 2025-12-14
**Next Review:** 2026-03-14
---
## Reference Images
Standard images used for performance benchmarking:
| Image | Size | Components | Expected Vulns | Category |
|-------|------|------------|----------------|----------|
| `alpine:3.19` | 7MB | ~15 | ~5 | Minimal |
| `debian:12-slim` | 75MB | ~90 | ~40 | Minimal |
| `ubuntu:22.04` | 77MB | ~100 | ~50 | Standard |
| `node:20-alpine` | 180MB | ~200 | ~100 | Application |
| `python:3.12` | 1GB | ~300 | ~150 | Application |
| `mcr.microsoft.com/dotnet/aspnet:8.0` | 220MB | ~150 | ~75 | Application |
| `nginx:1.25` | 190MB | ~120 | ~60 | Application |
| `postgres:16-alpine` | 240MB | ~140 | ~70 | Database |
---
## Scan Performance Targets
### Container Image Scanning
| Image Category | P50 Time | P95 Time | Max Memory | CPU Cores |
|---------------|----------|----------|------------|-----------|
| Minimal (<100MB) | < 5s | < 10s | < 256MB | 1 |
| Standard (100-500MB) | < 15s | < 30s | < 512MB | 2 |
| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB | 2 |
| Very Large (>2GB) | < 120s | < 240s | < 2GB | 4 |
### Per-Image Targets
| Image | P50 Time | P95 Time | Max Memory |
|-------|----------|----------|------------|
| alpine:3.19 | < 3s | < 8s | < 200MB |
| debian:12-slim | < 8s | < 15s | < 300MB |
| ubuntu:22.04 | < 10s | < 20s | < 400MB |
| node:20-alpine | < 20s | < 40s | < 600MB |
| python:3.12 | < 35s | < 70s | < 1.2GB |
| dotnet/aspnet:8.0 | < 25s | < 50s | < 800MB |
| nginx:1.25 | < 18s | < 35s | < 500MB |
| postgres:16-alpine | < 22s | < 45s | < 600MB |
---
## Reachability Analysis Targets
### By Codebase Size
| Codebase Size | P50 Time | P95 Time | Memory | Notes |
|---------------|----------|----------|--------|-------|
| Tiny (<5k LOC) | < 10s | < 20s | < 256MB | Single service |
| Small (5-20k LOC) | < 30s | < 60s | < 512MB | Small service |
| Medium (20-50k LOC) | < 2min | < 4min | < 1GB | Typical microservice |
| Large (50-100k LOC) | < 5min | < 10min | < 2GB | Large service |
| Very Large (100-500k LOC) | < 15min | < 30min | < 4GB | Monolith |
| Monorepo (>500k LOC) | < 45min | < 90min | < 8GB | Enterprise monorepo |
### By Language
| Language | Relative Speed | Notes |
|----------|---------------|-------|
| Go | 1.0x (baseline) | Fast due to simple module system |
| Java | 1.2x | Maven/Gradle resolution adds overhead |
| C# | 1.3x | MSBuild/NuGet resolution |
| TypeScript | 1.5x | npm/yarn resolution, complex imports |
| Python | 1.8x | Virtual env resolution, dynamic imports |
| JavaScript | 2.0x | Complex bundler configurations |
---
## SBOM Generation Targets
| Format | P50 Time | P95 Time | Output Size | Notes |
|--------|----------|----------|-------------|-------|
| CycloneDX 1.6 (JSON) | < 1s | < 3s | ~50KB/100 components | Standard |
| CycloneDX 1.6 (XML) | < 1.5s | < 4s | ~80KB/100 components | Verbose |
| SPDX 3.0.1 (JSON) | < 1s | < 3s | ~60KB/100 components | Standard |
| SPDX 3.0.1 (Tag-Value) | < 1.2s | < 3.5s | ~70KB/100 components | Legacy format |
### Combined Operations
| Operation | P50 Time | P95 Time |
|-----------|----------|----------|
| Scan + SBOM | scan_time + 1s | scan_time + 3s |
| Scan + SBOM + Reachability | scan_time + reach_time + 2s | scan_time + reach_time + 5s |
| Full attestation pipeline | total_time + 2s | total_time + 5s |
---
## VEX Processing Targets
| Operation | P50 Time | P95 Time | Notes |
|-----------|----------|----------|-------|
| VEX document parsing | < 50ms | < 150ms | Per document |
| Lattice state computation | < 100ms | < 300ms | Per 100 vulnerabilities |
| VEX consensus merge | < 200ms | < 500ms | 3-5 sources |
| State transition | < 10ms | < 30ms | Single transition |
---
## CVSS Scoring Targets
| Operation | P50 Time | P95 Time | Notes |
|-----------|----------|----------|-------|
| MacroVector lookup | < 1μs | < 5μs | Dictionary lookup |
| CVSS v4.0 base score | < 10μs | < 50μs | Full computation |
| CVSS v4.0 full score | < 20μs | < 100μs | Base + threat + env |
| Vector parsing | < 5μs | < 20μs | String parsing |
| Receipt generation | < 100μs | < 500μs | Includes hashing |
| Batch scoring (100 vulns) | < 5ms | < 15ms | Parallel processing |
---
## Attestation Targets
| Operation | P50 Time | P95 Time | Notes |
|-----------|----------|----------|-------|
| DSSE envelope creation | < 50ms | < 150ms | Includes signing |
| DSSE verification | < 30ms | < 100ms | Signature check |
| Rekor submission | < 500ms | < 2s | Network dependent |
| Rekor verification | < 300ms | < 1s | Network dependent |
| in-toto predicate | < 20ms | < 80ms | JSON serialization |
---
## Database Operation Targets
| Operation | P50 Time | P95 Time | Notes |
|-----------|----------|----------|-------|
| Receipt insert | < 5ms | < 20ms | Single record |
| Receipt query (by ID) | < 2ms | < 10ms | Indexed lookup |
| Receipt query (by tenant) | < 10ms | < 50ms | Index scan |
| EPSS lookup (single) | < 1ms | < 5ms | Indexed |
| EPSS lookup (batch 100) | < 10ms | < 50ms | Batch query |
| Risk score insert | < 5ms | < 20ms | Single record |
| Risk score update | < 3ms | < 15ms | Single record |
---
## Regression Thresholds
Performance regression is detected when metrics exceed these thresholds compared to baseline:
| Metric | Warning Threshold | Blocking Threshold | Action |
|--------|------------------|-------------------|--------|
| P50 Time | > 15% increase | > 25% increase | Block release |
| P95 Time | > 20% increase | > 35% increase | Block release |
| Memory Usage | > 20% increase | > 30% increase | Block release |
| CPU Time | > 15% increase | > 25% increase | Investigate |
| Throughput | > 10% decrease | > 20% decrease | Block release |
### Regression Detection Rules
1. **Warning**: Alert engineering team, add to release notes
2. **Blocking**: Cannot merge/release until resolved or waived
3. **Waiver**: Requires documented justification and SME approval
---
## Measurement Methodology
### Environment Setup
```bash
# Standard test environment
# - CPU: 8 cores (x86_64)
# - Memory: 16GB RAM
# - Storage: NVMe SSD
# - OS: Ubuntu 22.04 LTS
# - Docker: 24.x
# Clear caches before cold start tests
docker system prune -af
sync && echo 3 > /proc/sys/vm/drop_caches
```
### Scan Performance
```bash
# Cold start measurement
time stellaops scan --image alpine:3.19 --format json > /dev/null
# Warm cache measurement (run 3x, take average)
for i in {1..3}; do
time stellaops scan --image alpine:3.19 --format json > /dev/null
done
# Memory profiling
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
grep "Maximum resident set size"
# CPU profiling
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
```
### Reachability Analysis
```bash
# Time measurement
time stellaops reach --project ./src --language csharp --out reach.json
# Memory profiling
/usr/bin/time -v stellaops reach --project ./src --language csharp --out reach.json 2>&1
# With detailed timing
stellaops reach --project ./src --language csharp --out reach.json --timing
```
### SBOM Generation
```bash
# Time measurement
time stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json
# Output size
stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json && \
ls -lh sbom.json
```
### Batch Operations
```bash
# Process multiple images in parallel
time stellaops scan --images images.txt --parallel 4 --format json --out-dir ./results
# Throughput test (images per minute)
START=$(date +%s)
for i in {1..10}; do
stellaops scan --image alpine:3.19 --format json > /dev/null
done
END=$(date +%s)
echo "Throughput: $(( 10 * 60 / (END - START) )) images/minute"
```
---
## CI Integration
### Benchmark Workflow
```yaml
# .gitea/workflows/performance-benchmark.yml
name: Performance Benchmark
on:
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * 1' # Weekly Monday 2am
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run benchmarks
run: make benchmark-performance
- name: Check for regressions
run: |
stellaops benchmark compare \
--baseline results/baseline.json \
--current results/current.json \
--threshold-p50 0.15 \
--threshold-p95 0.20 \
--threshold-memory 0.20 \
--fail-on-regression
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results/
```
### Local Testing
```bash
# Run full benchmark suite
make benchmark-performance
# Run specific image benchmark
make benchmark-image IMAGE=alpine:3.19
# Generate baseline
make benchmark-baseline
# Compare against baseline
make benchmark-compare
```
---
## Optimization Guidelines
### For Scan Performance
1. **Pre-pull images** for consistent timing
2. **Use layered caching** for repeat scans
3. **Enable parallel analysis** for multi-ecosystem images
4. **Consider selective scanning** for known-safe layers
### For Reachability
1. **Incremental analysis** for unchanged files
2. **Cache resolved dependencies**
3. **Use language-specific optimizations** (e.g., Roslyn for C#)
4. **Limit call graph depth** for very large codebases
### For Memory
1. **Stream large SBOMs** instead of loading fully
2. **Use batched database operations**
3. **Release intermediate data structures early**
4. **Configure GC appropriately for workload**
---
## Historical Baselines
### Version History
| Version | Date | P50 Scan (alpine) | P50 Reach (50k LOC) | Notes |
|---------|------|-------------------|---------------------|-------|
| 1.3.0 | 2025-12-14 | TBD | TBD | Current |
| 1.2.0 | 2025-09-01 | TBD | TBD | Previous |
| 1.1.0 | 2025-06-01 | TBD | TBD | Baseline |
### Improvement Targets
| Quarter | Focus Area | Target | Status |
|---------|------------|--------|--------|
| Q1 2026 | Scan cold start | -20% | Planned |
| Q1 2026 | Reachability memory | -15% | Planned |
| Q2 2026 | SBOM generation | -10% | Planned |
---
## References
- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
- [Benchmark Submission Guide](submission-guide.md) (pending)
- [Scanner Architecture](../modules/scanner/architecture.md)
- [Reachability Module](../modules/scanner/reachability.md)

View File

@@ -0,0 +1,653 @@
# Benchmark Submission Guide
**Last Updated:** 2025-12-14
**Next Review:** 2026-03-14
---
## Overview
StellaOps publishes benchmarks for:
- **Reachability Analysis** - Accuracy of static and runtime path detection
- **SBOM Completeness** - Component detection and version accuracy
- **Vulnerability Detection** - Precision, recall, and F1 scores
- **Scan Performance** - Time, memory, and CPU metrics
- **Determinism** - Reproducibility of scan outputs
This guide explains how to reproduce, validate, and submit benchmark results.
---
## 1. PREREQUISITES
### 1.1 System Requirements
| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| CPU | 4 cores | 8 cores |
| Memory | 8 GB | 16 GB |
| Storage | 50 GB SSD | 100 GB NVMe |
| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
| Docker | 24.x | 24.x |
| .NET | 10.0 | 10.0 |
### 1.2 Environment Setup
```bash
# Clone the repository
git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
cd git.stella-ops.org
# Install .NET 10 SDK
sudo apt-get update
sudo apt-get install -y dotnet-sdk-10.0
# Install Docker (if not present)
curl -fsSL https://get.docker.com | sh
# Install benchmark dependencies
sudo apt-get install -y \
jq \
b3sum \
hyperfine \
time
# Set determinism environment variables
export TZ=UTC
export LC_ALL=C
export STELLAOPS_DETERMINISM_SEED=42
export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"
```
### 1.3 Pull Reference Images
```bash
# Download standard benchmark images
make benchmark-pull-images
# Or manually:
docker pull alpine:3.19
docker pull debian:12-slim
docker pull ubuntu:22.04
docker pull node:20-alpine
docker pull python:3.12
docker pull mcr.microsoft.com/dotnet/aspnet:8.0
docker pull nginx:1.25
docker pull postgres:16-alpine
```
---
## 2. RUNNING BENCHMARKS
### 2.1 Full Benchmark Suite
```bash
# Run all benchmarks (takes ~30-60 minutes)
make benchmark-all
# Output: results/benchmark-all-$(date +%Y%m%d).json
```
### 2.2 Category-Specific Benchmarks
#### Reachability Benchmark
```bash
# Run reachability accuracy benchmarks
make benchmark-reachability
# With specific language filter
make benchmark-reachability LANG=csharp
# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json
```
#### Performance Benchmark
```bash
# Run scan performance benchmarks
make benchmark-performance
# Single image
make benchmark-image IMAGE=alpine:3.19
# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json
```
#### SBOM Benchmark
```bash
# Run SBOM completeness benchmarks
make benchmark-sbom
# Specific format
make benchmark-sbom FORMAT=cyclonedx
# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json
```
#### Determinism Benchmark
```bash
# Run determinism verification
make benchmark-determinism
# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json
```
### 2.3 CLI Benchmark Commands
```bash
# Performance timing with hyperfine (10 runs)
hyperfine --warmup 2 --runs 10 \
'stellaops scan --image alpine:3.19 --format json --output /dev/null'
# Memory profiling
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
grep "Maximum resident set size"
# CPU profiling (Linux)
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
# Determinism check (run twice, compare hashes)
stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"
```
---
## 3. OUTPUT FORMATS
### 3.1 Reachability Results Schema
```json
{
"benchmark": "reachability-v1",
"date": "2025-12-14T00:00:00Z",
"scanner_version": "1.3.0",
"scanner_commit": "abc123def",
"environment": {
"os": "ubuntu-22.04",
"arch": "amd64",
"cpu": "Intel Xeon E-2288G",
"memory_gb": 16
},
"summary": {
"total_samples": 200,
"precision": 0.92,
"recall": 0.87,
"f1": 0.894,
"false_positive_rate": 0.08,
"false_negative_rate": 0.13
},
"by_language": {
"java": {
"samples": 50,
"precision": 0.94,
"recall": 0.88,
"f1": 0.909,
"confusion_matrix": {
"tp": 44, "fp": 3, "tn": 2, "fn": 1
}
},
"csharp": {
"samples": 50,
"precision": 0.91,
"recall": 0.86,
"f1": 0.884,
"confusion_matrix": {
"tp": 43, "fp": 4, "tn": 2, "fn": 1
}
},
"typescript": {
"samples": 50,
"precision": 0.89,
"recall": 0.84,
"f1": 0.864,
"confusion_matrix": {
"tp": 42, "fp": 5, "tn": 2, "fn": 1
}
},
"python": {
"samples": 50,
"precision": 0.88,
"recall": 0.83,
"f1": 0.854,
"confusion_matrix": {
"tp": 41, "fp": 5, "tn": 3, "fn": 1
}
}
},
"ground_truth_ref": "datasets/reachability/v2025.12",
"raw_results_ref": "results/reachability/raw/2025-12-14/"
}
```
### 3.2 Performance Results Schema
```json
{
"benchmark": "performance-v1",
"date": "2025-12-14T00:00:00Z",
"scanner_version": "1.3.0",
"scanner_commit": "abc123def",
"environment": {
"os": "ubuntu-22.04",
"arch": "amd64",
"cpu": "Intel Xeon E-2288G",
"memory_gb": 16,
"storage": "nvme"
},
"images": [
{
"image": "alpine:3.19",
"size_mb": 7,
"components": 15,
"vulnerabilities": 5,
"runs": 10,
"cold_start": {
"p50_ms": 2800,
"p95_ms": 4200,
"mean_ms": 3100
},
"warm_cache": {
"p50_ms": 1500,
"p95_ms": 2100,
"mean_ms": 1650
},
"memory_peak_mb": 180,
"cpu_time_ms": 1200
},
{
"image": "python:3.12",
"size_mb": 1024,
"components": 300,
"vulnerabilities": 150,
"runs": 10,
"cold_start": {
"p50_ms": 32000,
"p95_ms": 48000,
"mean_ms": 35000
},
"warm_cache": {
"p50_ms": 18000,
"p95_ms": 25000,
"mean_ms": 19500
},
"memory_peak_mb": 1100,
"cpu_time_ms": 28000
}
],
"aggregated": {
"total_images": 8,
"total_runs": 80,
"avg_time_per_mb_ms": 35,
"avg_memory_per_component_kb": 400
}
}
```
### 3.3 SBOM Results Schema
```json
{
"benchmark": "sbom-v1",
"date": "2025-12-14T00:00:00Z",
"scanner_version": "1.3.0",
"summary": {
"total_images": 8,
"component_recall": 0.98,
"component_precision": 0.995,
"version_accuracy": 0.96
},
"by_ecosystem": {
"apk": {
"ground_truth_components": 100,
"detected_components": 99,
"correct_versions": 96,
"recall": 0.99,
"precision": 0.99,
"version_accuracy": 0.96
},
"npm": {
"ground_truth_components": 500,
"detected_components": 492,
"correct_versions": 475,
"recall": 0.984,
"precision": 0.998,
"version_accuracy": 0.965
}
},
"formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
}
```
### 3.4 Determinism Results Schema
```json
{
"benchmark": "determinism-v1",
"date": "2025-12-14T00:00:00Z",
"scanner_version": "1.3.0",
"summary": {
"total_runs": 100,
"bitwise_identical": 100,
"bitwise_fidelity": 1.0,
"semantic_identical": 100,
"semantic_fidelity": 1.0
},
"by_image": {
"alpine:3.19": {
"runs": 20,
"bitwise_identical": 20,
"output_hash": "sha256:abc123..."
},
"python:3.12": {
"runs": 20,
"bitwise_identical": 20,
"output_hash": "sha256:def456..."
}
},
"seed": 42,
"timestamp_frozen": "2025-01-01T00:00:00Z"
}
```
---
## 4. SUBMISSION PROCESS
### 4.1 Internal Submission (StellaOps Team)
Benchmark results are automatically collected by CI:
```yaml
# .gitea/workflows/weekly-benchmark.yml triggers:
# - Weekly benchmark runs
# - Results stored in internal dashboard
# - Regression detection against baselines
```
Manual submission:
```bash
# Upload to internal dashboard
make benchmark-submit
# Or via CLI
stellaops benchmark submit \
--file results/benchmark-all-20251214.json \
--dashboard internal
```
### 4.2 External Validation Submission
Third parties can validate and submit benchmark results:
#### Step 1: Fork and Clone
```bash
# Fork the benchmark repository
# https://git.stella-ops.org/stella-ops.org/benchmarks
git clone https://git.stella-ops.org/<your-org>/benchmarks.git
cd benchmarks
```
#### Step 2: Run Benchmarks
```bash
# With StellaOps scanner
make benchmark-all SCANNER=stellaops
# Or with your own tool for comparison
make benchmark-all SCANNER=your-tool
```
#### Step 3: Prepare Submission
```bash
# Results directory structure
mkdir -p submissions/<your-org>/<date>
# Copy results
cp results/*.json submissions/<your-org>/<date>/
# Add reproduction README
cat > submissions/<your-org>/<date>/README.md <<EOF
# Benchmark Results: <Your Org>
**Date:** $(date -u +%Y-%m-%d)
**Scanner:** <tool-name>
**Version:** <version>
## Environment
- OS: <os>
- CPU: <cpu>
- Memory: <memory>
## Reproduction Steps
<steps>
## Notes
<any observations>
EOF
```
#### Step 4: Submit Pull Request
```bash
git checkout -b benchmark-results-$(date +%Y%m%d)
git add submissions/
git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
git push origin benchmark-results-$(date +%Y%m%d)
# Create PR via web interface or gh CLI
gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
--body "Benchmark results for external validation"
```
### 4.3 Submission Review Process
| Step | Action | Timeline |
|------|--------|----------|
| 1 | PR submitted | Day 0 |
| 2 | Automated validation runs | Day 0 (CI) |
| 3 | Maintainer review | Day 1-3 |
| 4 | Results published (if valid) | Day 3-5 |
| 5 | Dashboard updated | Day 5 |
---
## 5. BENCHMARK CATEGORIES
### 5.1 Reachability Benchmark
**Purpose:** Measure accuracy of static and runtime reachability analysis.
**Ground Truth Source:** `datasets/reachability/`
**Test Cases:**
- 50+ samples per language (Java, C#, TypeScript, Python, Go)
- Known-reachable vulnerable paths
- Known-unreachable vulnerable code
- Runtime-only reachable code
**Scoring:**
```
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)
```
**Targets:**
| Metric | Target | Blocking |
|--------|--------|----------|
| Precision | >= 90% | >= 85% |
| Recall | >= 85% | >= 80% |
| F1 | >= 87% | >= 82% |
### 5.2 Performance Benchmark
**Purpose:** Measure scan time, memory usage, and CPU utilization.
**Reference Images:** See [Performance Baselines](performance-baselines.md)
**Metrics:**
- P50/P95 scan time (cold and warm)
- Peak memory usage
- CPU time
- Throughput (images/minute)
**Targets:**
| Image Category | P50 Time | P95 Time | Max Memory |
|----------------|----------|----------|------------|
| Minimal (<100MB) | < 5s | < 10s | < 256MB |
| Standard (100-500MB) | < 15s | < 30s | < 512MB |
| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB |
### 5.3 SBOM Benchmark
**Purpose:** Measure component detection completeness and accuracy.
**Ground Truth Source:** Manual SBOM audits of reference images.
**Metrics:**
- Component recall (found / total)
- Component precision (real / reported)
- Version accuracy (correct / total)
**Targets:**
| Metric | Target |
|--------|--------|
| Component Recall | >= 98% |
| Component Precision | >= 99% |
| Version Accuracy | >= 95% |
### 5.4 Vulnerability Detection Benchmark
**Purpose:** Measure CVE detection accuracy against known-vulnerable images.
**Ground Truth Source:** `datasets/vulns/` curated CVE lists.
**Metrics:**
- True positive rate
- False positive rate
- False negative rate
- Precision/Recall/F1
**Targets:**
| Metric | Target |
|--------|--------|
| Precision | >= 95% |
| Recall | >= 90% |
| F1 | >= 92% |
### 5.5 Determinism Benchmark
**Purpose:** Verify reproducible scan outputs.
**Methodology:**
1. Run same scan N times (default: 20)
2. Compare output hashes
3. Calculate bitwise fidelity
**Targets:**
| Metric | Target |
|--------|--------|
| Bitwise Fidelity | 100% |
| Semantic Fidelity | 100% |
---
## 6. COMPARING RESULTS
### 6.1 Against Baselines
```bash
# Compare current run against stored baseline
stellaops benchmark compare \
--baseline results/baseline/2025-Q4.json \
--current results/benchmark-all-20251214.json \
--threshold-p50 0.15 \
--threshold-precision 0.02 \
--fail-on-regression
# Output:
# Performance: PASS (P50 within 15% of baseline)
# Accuracy: PASS (Precision within 2% of baseline)
# Determinism: PASS (100% fidelity)
```
### 6.2 Against Other Tools
```bash
# Generate comparison report
stellaops benchmark compare-tools \
--stellaops results/stellaops/2025-12-14.json \
--trivy results/trivy/2025-12-14.json \
--grype results/grype/2025-12-14.json \
--output comparison-report.html
```
### 6.3 Historical Trends
```bash
# Generate trend report (last 12 months)
stellaops benchmark trend \
--period 12m \
--metrics precision,recall,p50_time \
--output trend-report.html
```
---
## 7. TROUBLESHOOTING
### 7.1 Common Issues
| Issue | Cause | Resolution |
|-------|-------|------------|
| Non-deterministic output | Locale not set | Set `LC_ALL=C` |
| Memory OOM | Large image | Increase memory limit |
| Slow performance | Cold cache | Pre-pull images |
| Missing components | Ecosystem not supported | Check supported ecosystems |
### 7.2 Debug Mode
```bash
# Enable verbose benchmark logging
make benchmark-all DEBUG=1
# Enable timing breakdown
export STELLAOPS_BENCHMARK_TIMING=1
make benchmark-performance
```
### 7.3 Validation Failures
```bash
# Check result schema validity
stellaops benchmark validate --file results/benchmark-all.json
# Check against ground truth
stellaops benchmark validate-ground-truth \
--results results/reachability.json \
--ground-truth datasets/reachability/v2025.12
```
---
## 8. REFERENCES
- [Performance Baselines](performance-baselines.md)
- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
- [Offline Parity Verification](../airgap/offline-parity-verification.md)
- [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md)
- [Ground Truth Datasets](../datasets/README.md)
---
**Document Version**: 1.0
**Target Platform**: .NET 10, PostgreSQL >=16