up
This commit is contained in:
320
docs/benchmarks/accuracy-metrics-framework.md
Normal file
320
docs/benchmarks/accuracy-metrics-framework.md
Normal file
@@ -0,0 +1,320 @@
|
||||
# Accuracy Metrics Framework
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.
|
||||
|
||||
## Metric Definitions
|
||||
|
||||
### Confusion Matrix
|
||||
|
||||
For binary classification tasks (e.g., reachable vs unreachable):
|
||||
|
||||
| | Predicted Positive | Predicted Negative |
|
||||
|--|-------------------|-------------------|
|
||||
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
|
||||
| **Actual Negative** | False Positive (FP) | True Negative (TN) |
|
||||
|
||||
### Core Metrics
|
||||
|
||||
| Metric | Formula | Description | Target |
|
||||
|--------|---------|-------------|--------|
|
||||
| **Precision** | TP / (TP + FP) | Of items flagged, how many were correct | >= 90% |
|
||||
| **Recall** | TP / (TP + FN) | Of actual positives, how many were found | >= 85% |
|
||||
| **F1 Score** | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall | >= 87% |
|
||||
| **False Positive Rate** | FP / (FP + TN) | Rate of incorrect positive flags | <= 10% |
|
||||
| **Accuracy** | (TP + TN) / Total | Overall correctness | >= 90% |
|
||||
|
||||
---
|
||||
|
||||
## Reachability Analysis Accuracy
|
||||
|
||||
### Definitions
|
||||
|
||||
- **True Positive (TP)**: Correctly identified as reachable (code path actually exists)
|
||||
- **False Positive (FP)**: Incorrectly identified as reachable (no real code path)
|
||||
- **True Negative (TN)**: Correctly identified as unreachable (no code path exists)
|
||||
- **False Negative (FN)**: Incorrectly identified as unreachable (code path exists but missed)
|
||||
|
||||
### Target Metrics
|
||||
|
||||
| Metric | Target | Stretch Goal |
|
||||
|--------|--------|--------------|
|
||||
| Precision | >= 90% | >= 95% |
|
||||
| Recall | >= 85% | >= 90% |
|
||||
| F1 Score | >= 87% | >= 92% |
|
||||
| False Positive Rate | <= 10% | <= 5% |
|
||||
|
||||
### Per-Language Targets
|
||||
|
||||
| Language | Precision | Recall | F1 | Notes |
|
||||
|----------|-----------|--------|-----|-------|
|
||||
| Java | >= 92% | >= 88% | >= 90% | Strong static analysis support |
|
||||
| C# | >= 90% | >= 85% | >= 87% | Roslyn-based analysis |
|
||||
| Go | >= 88% | >= 82% | >= 85% | Good call graph support |
|
||||
| JavaScript | >= 85% | >= 78% | >= 81% | Dynamic typing challenges |
|
||||
| Python | >= 83% | >= 75% | >= 79% | Dynamic typing challenges |
|
||||
| TypeScript | >= 88% | >= 82% | >= 85% | Better than JS due to types |
|
||||
|
||||
---
|
||||
|
||||
## Lattice State Accuracy
|
||||
|
||||
VEX lattice states have different confidence requirements:
|
||||
|
||||
| State | Definition | Target Accuracy | Validation |
|
||||
|-------|------------|-----------------|------------|
|
||||
| **CR** (Confirmed Reachable) | Runtime evidence + static path | >= 95% | Runtime trace verification |
|
||||
| **SR** (Static Reachable) | Static path only | >= 90% | Static analysis coverage |
|
||||
| **SU** (Static Unreachable) | No static path found | >= 85% | Negative proof verification |
|
||||
| **DT** (Denied by Tool) | Tool analysis confirms not affected | >= 90% | Tool output validation |
|
||||
| **DV** (Denied by Vendor) | Vendor VEX statement | >= 95% | VEX signature verification |
|
||||
| **U** (Unknown) | Insufficient evidence | Track % | Minimize unknowns |
|
||||
|
||||
### Lattice Transition Accuracy
|
||||
|
||||
Measure accuracy of automatic state transitions:
|
||||
|
||||
| Transition | Trigger | Target Accuracy |
|
||||
|------------|---------|-----------------|
|
||||
| U -> SR | Static analysis finds path | >= 90% |
|
||||
| SR -> CR | Runtime evidence added | >= 95% |
|
||||
| U -> SU | Static analysis proves unreachable | >= 85% |
|
||||
| SR -> DT | Tool-specific analysis | >= 90% |
|
||||
|
||||
---
|
||||
|
||||
## SBOM Completeness Metrics
|
||||
|
||||
### Component Detection
|
||||
|
||||
| Metric | Formula | Target | Notes |
|
||||
|--------|---------|--------|-------|
|
||||
| **Component Recall** | Found / Total Actual | >= 98% | Find all real components |
|
||||
| **Component Precision** | Real / Reported | >= 99% | Minimize phantom components |
|
||||
| **Version Accuracy** | Correct Versions / Total | >= 95% | Version string correctness |
|
||||
| **License Accuracy** | Correct Licenses / Total | >= 90% | License detection accuracy |
|
||||
|
||||
### Per-Ecosystem Targets
|
||||
|
||||
| Ecosystem | Comp. Recall | Comp. Precision | Version Acc. |
|
||||
|-----------|--------------|-----------------|--------------|
|
||||
| Alpine APK | >= 99% | >= 99% | >= 98% |
|
||||
| Debian DEB | >= 99% | >= 99% | >= 98% |
|
||||
| npm | >= 97% | >= 98% | >= 95% |
|
||||
| Maven | >= 98% | >= 99% | >= 96% |
|
||||
| NuGet | >= 98% | >= 99% | >= 96% |
|
||||
| PyPI | >= 96% | >= 98% | >= 94% |
|
||||
| Go Modules | >= 97% | >= 98% | >= 95% |
|
||||
| Cargo (Rust) | >= 98% | >= 99% | >= 96% |
|
||||
|
||||
---
|
||||
|
||||
## Vulnerability Detection Accuracy
|
||||
|
||||
### CVE Matching
|
||||
|
||||
| Metric | Formula | Target |
|
||||
|--------|---------|--------|
|
||||
| **CVE Recall** | Found CVEs / Actual CVEs | >= 95% |
|
||||
| **CVE Precision** | Correct CVEs / Reported CVEs | >= 98% |
|
||||
| **Version Range Accuracy** | Correct Affected / Total | >= 93% |
|
||||
|
||||
### False Positive Categories
|
||||
|
||||
Track and minimize specific FP types:
|
||||
|
||||
| FP Type | Description | Target Rate |
|
||||
|---------|-------------|-------------|
|
||||
| **Phantom Component** | CVE for component not present | <= 1% |
|
||||
| **Version Mismatch** | CVE for wrong version | <= 3% |
|
||||
| **Ecosystem Confusion** | Wrong package with same name | <= 1% |
|
||||
| **Stale Advisory** | Already fixed but flagged | <= 2% |
|
||||
|
||||
---
|
||||
|
||||
## Measurement Methodology
|
||||
|
||||
### Ground Truth Establishment
|
||||
|
||||
1. **Manual Curation**
|
||||
- Expert review of sample applications
|
||||
- Documented decision rationale
|
||||
- Multiple reviewer consensus
|
||||
|
||||
2. **Automated Verification**
|
||||
- Cross-reference with authoritative sources
|
||||
- NVD, OSV, GitHub Advisory Database
|
||||
- Vendor security bulletins
|
||||
|
||||
3. **Runtime Validation**
|
||||
- Dynamic analysis confirmation
|
||||
- Exploit proof-of-concept testing
|
||||
- Production monitoring correlation
|
||||
|
||||
### Test Corpus Requirements
|
||||
|
||||
| Category | Minimum Samples | Diversity Requirements |
|
||||
|----------|-----------------|----------------------|
|
||||
| Reachability | 50 per language | Mix of libraries, frameworks |
|
||||
| SBOM | 100 images | All major ecosystems |
|
||||
| CVE Detection | 500 CVEs | Mix of severities, ages |
|
||||
| Performance | 10 reference images | Various sizes |
|
||||
|
||||
### Measurement Process
|
||||
|
||||
```
|
||||
1. Select ground truth corpus
|
||||
└── Minimum samples per category
|
||||
└── Representative of production workloads
|
||||
|
||||
2. Run scanner with deterministic manifest
|
||||
└── Fixed advisory database version
|
||||
└── Reproducible configuration
|
||||
|
||||
3. Compare results to ground truth
|
||||
└── Automated diff tooling
|
||||
└── Manual review of discrepancies
|
||||
|
||||
4. Compute metrics per category
|
||||
└── Generate confusion matrices
|
||||
└── Calculate precision/recall/F1
|
||||
|
||||
5. Aggregate and publish
|
||||
└── Per-ecosystem breakdown
|
||||
└── Overall summary metrics
|
||||
└── Trend analysis
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reporting Format
|
||||
|
||||
### Quarterly Benchmark Report
|
||||
|
||||
```json
|
||||
{
|
||||
"report_version": "1.0",
|
||||
"scanner_version": "1.3.0",
|
||||
"report_date": "2025-12-14",
|
||||
"ground_truth_version": "2025-Q4",
|
||||
|
||||
"reachability": {
|
||||
"overall": {
|
||||
"precision": 0.91,
|
||||
"recall": 0.86,
|
||||
"f1": 0.88,
|
||||
"samples": 450
|
||||
},
|
||||
"by_language": {
|
||||
"java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
|
||||
"csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
|
||||
"go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
|
||||
}
|
||||
},
|
||||
|
||||
"sbom": {
|
||||
"component_recall": 0.98,
|
||||
"component_precision": 0.99,
|
||||
"version_accuracy": 0.96
|
||||
},
|
||||
|
||||
"vulnerability": {
|
||||
"cve_recall": 0.96,
|
||||
"cve_precision": 0.98,
|
||||
"false_positive_rate": 0.02
|
||||
},
|
||||
|
||||
"lattice_states": {
|
||||
"cr_accuracy": 0.96,
|
||||
"sr_accuracy": 0.91,
|
||||
"su_accuracy": 0.87
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Regression Detection
|
||||
|
||||
### Thresholds
|
||||
|
||||
A regression is flagged when:
|
||||
|
||||
| Metric | Regression Threshold | Action |
|
||||
|--------|---------------------|--------|
|
||||
| Precision | > 3% decrease | Block release |
|
||||
| Recall | > 5% decrease | Block release |
|
||||
| F1 | > 4% decrease | Block release |
|
||||
| FPR | > 2% increase | Block release |
|
||||
| Any metric | > 1% change | Investigate |
|
||||
|
||||
### CI Integration
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/accuracy-check.yml
|
||||
accuracy-benchmark:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Run accuracy benchmark
|
||||
run: make benchmark-accuracy
|
||||
|
||||
- name: Check for regressions
|
||||
run: |
|
||||
stellaops benchmark compare \
|
||||
--baseline results/baseline.json \
|
||||
--current results/current.json \
|
||||
--threshold-precision 0.03 \
|
||||
--threshold-recall 0.05 \
|
||||
--fail-on-regression
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Ground Truth Sources
|
||||
|
||||
### Internal
|
||||
|
||||
- `datasets/reachability/samples/` - Reachability ground truth
|
||||
- `datasets/sbom/reference/` - Known-good SBOMs
|
||||
- `bench/findings/` - CVE finding ground truth
|
||||
|
||||
### External
|
||||
|
||||
- **NIST SARD** - Software Assurance Reference Dataset
|
||||
- **OSV Test Suite** - Open Source Vulnerability test cases
|
||||
- **OWASP Benchmark** - Security testing benchmark
|
||||
- **Juliet Test Suite** - CWE coverage testing
|
||||
|
||||
---
|
||||
|
||||
## Improvement Tracking
|
||||
|
||||
### Gap Analysis
|
||||
|
||||
Identify and prioritize accuracy improvements:
|
||||
|
||||
| Gap | Current | Target | Priority | Improvement Plan |
|
||||
|-----|---------|--------|----------|------------------|
|
||||
| Python recall | 73% | 78% | High | Improve type inference |
|
||||
| npm precision | 96% | 98% | Medium | Fix aliasing issues |
|
||||
| Version accuracy | 94% | 96% | Medium | Better version parsing |
|
||||
|
||||
### Quarterly Goals
|
||||
|
||||
Track progress against improvement targets:
|
||||
|
||||
| Quarter | Focus Area | Metric | Target | Actual |
|
||||
|---------|------------|--------|--------|--------|
|
||||
| Q4 2025 | Java reachability | Recall | 88% | TBD |
|
||||
| Q1 2026 | Python support | F1 | 80% | TBD |
|
||||
| Q1 2026 | SBOM completeness | Recall | 99% | TBD |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [FIRST CVSS v4.0 Specification](https://www.first.org/cvss/v4.0/specification-document)
|
||||
- [NIST NVD API](https://nvd.nist.gov/developers)
|
||||
- [OSV Schema](https://ossf.github.io/osv-schema/)
|
||||
- [StellaOps Reachability Architecture](../modules/scanner/reachability.md)
|
||||
355
docs/benchmarks/performance-baselines.md
Normal file
355
docs/benchmarks/performance-baselines.md
Normal file
@@ -0,0 +1,355 @@
|
||||
# Performance Baselines
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines performance baselines for StellaOps scanner operations. All metrics are measured against reference images and workloads to ensure consistent, reproducible benchmarks.
|
||||
|
||||
**Last Updated:** 2025-12-14
|
||||
**Next Review:** 2026-03-14
|
||||
|
||||
---
|
||||
|
||||
## Reference Images
|
||||
|
||||
Standard images used for performance benchmarking:
|
||||
|
||||
| Image | Size | Components | Expected Vulns | Category |
|
||||
|-------|------|------------|----------------|----------|
|
||||
| `alpine:3.19` | 7MB | ~15 | ~5 | Minimal |
|
||||
| `debian:12-slim` | 75MB | ~90 | ~40 | Minimal |
|
||||
| `ubuntu:22.04` | 77MB | ~100 | ~50 | Standard |
|
||||
| `node:20-alpine` | 180MB | ~200 | ~100 | Application |
|
||||
| `python:3.12` | 1GB | ~300 | ~150 | Application |
|
||||
| `mcr.microsoft.com/dotnet/aspnet:8.0` | 220MB | ~150 | ~75 | Application |
|
||||
| `nginx:1.25` | 190MB | ~120 | ~60 | Application |
|
||||
| `postgres:16-alpine` | 240MB | ~140 | ~70 | Database |
|
||||
|
||||
---
|
||||
|
||||
## Scan Performance Targets
|
||||
|
||||
### Container Image Scanning
|
||||
|
||||
| Image Category | P50 Time | P95 Time | Max Memory | CPU Cores |
|
||||
|---------------|----------|----------|------------|-----------|
|
||||
| Minimal (<100MB) | < 5s | < 10s | < 256MB | 1 |
|
||||
| Standard (100-500MB) | < 15s | < 30s | < 512MB | 2 |
|
||||
| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB | 2 |
|
||||
| Very Large (>2GB) | < 120s | < 240s | < 2GB | 4 |
|
||||
|
||||
### Per-Image Targets
|
||||
|
||||
| Image | P50 Time | P95 Time | Max Memory |
|
||||
|-------|----------|----------|------------|
|
||||
| alpine:3.19 | < 3s | < 8s | < 200MB |
|
||||
| debian:12-slim | < 8s | < 15s | < 300MB |
|
||||
| ubuntu:22.04 | < 10s | < 20s | < 400MB |
|
||||
| node:20-alpine | < 20s | < 40s | < 600MB |
|
||||
| python:3.12 | < 35s | < 70s | < 1.2GB |
|
||||
| dotnet/aspnet:8.0 | < 25s | < 50s | < 800MB |
|
||||
| nginx:1.25 | < 18s | < 35s | < 500MB |
|
||||
| postgres:16-alpine | < 22s | < 45s | < 600MB |
|
||||
|
||||
---
|
||||
|
||||
## Reachability Analysis Targets
|
||||
|
||||
### By Codebase Size
|
||||
|
||||
| Codebase Size | P50 Time | P95 Time | Memory | Notes |
|
||||
|---------------|----------|----------|--------|-------|
|
||||
| Tiny (<5k LOC) | < 10s | < 20s | < 256MB | Single service |
|
||||
| Small (5-20k LOC) | < 30s | < 60s | < 512MB | Small service |
|
||||
| Medium (20-50k LOC) | < 2min | < 4min | < 1GB | Typical microservice |
|
||||
| Large (50-100k LOC) | < 5min | < 10min | < 2GB | Large service |
|
||||
| Very Large (100-500k LOC) | < 15min | < 30min | < 4GB | Monolith |
|
||||
| Monorepo (>500k LOC) | < 45min | < 90min | < 8GB | Enterprise monorepo |
|
||||
|
||||
### By Language
|
||||
|
||||
| Language | Relative Speed | Notes |
|
||||
|----------|---------------|-------|
|
||||
| Go | 1.0x (baseline) | Fast due to simple module system |
|
||||
| Java | 1.2x | Maven/Gradle resolution adds overhead |
|
||||
| C# | 1.3x | MSBuild/NuGet resolution |
|
||||
| TypeScript | 1.5x | npm/yarn resolution, complex imports |
|
||||
| Python | 1.8x | Virtual env resolution, dynamic imports |
|
||||
| JavaScript | 2.0x | Complex bundler configurations |
|
||||
|
||||
---
|
||||
|
||||
## SBOM Generation Targets
|
||||
|
||||
| Format | P50 Time | P95 Time | Output Size | Notes |
|
||||
|--------|----------|----------|-------------|-------|
|
||||
| CycloneDX 1.6 (JSON) | < 1s | < 3s | ~50KB/100 components | Standard |
|
||||
| CycloneDX 1.6 (XML) | < 1.5s | < 4s | ~80KB/100 components | Verbose |
|
||||
| SPDX 3.0.1 (JSON) | < 1s | < 3s | ~60KB/100 components | Standard |
|
||||
| SPDX 3.0.1 (Tag-Value) | < 1.2s | < 3.5s | ~70KB/100 components | Legacy format |
|
||||
|
||||
### Combined Operations
|
||||
|
||||
| Operation | P50 Time | P95 Time |
|
||||
|-----------|----------|----------|
|
||||
| Scan + SBOM | scan_time + 1s | scan_time + 3s |
|
||||
| Scan + SBOM + Reachability | scan_time + reach_time + 2s | scan_time + reach_time + 5s |
|
||||
| Full attestation pipeline | total_time + 2s | total_time + 5s |
|
||||
|
||||
---
|
||||
|
||||
## VEX Processing Targets
|
||||
|
||||
| Operation | P50 Time | P95 Time | Notes |
|
||||
|-----------|----------|----------|-------|
|
||||
| VEX document parsing | < 50ms | < 150ms | Per document |
|
||||
| Lattice state computation | < 100ms | < 300ms | Per 100 vulnerabilities |
|
||||
| VEX consensus merge | < 200ms | < 500ms | 3-5 sources |
|
||||
| State transition | < 10ms | < 30ms | Single transition |
|
||||
|
||||
---
|
||||
|
||||
## CVSS Scoring Targets
|
||||
|
||||
| Operation | P50 Time | P95 Time | Notes |
|
||||
|-----------|----------|----------|-------|
|
||||
| MacroVector lookup | < 1μs | < 5μs | Dictionary lookup |
|
||||
| CVSS v4.0 base score | < 10μs | < 50μs | Full computation |
|
||||
| CVSS v4.0 full score | < 20μs | < 100μs | Base + threat + env |
|
||||
| Vector parsing | < 5μs | < 20μs | String parsing |
|
||||
| Receipt generation | < 100μs | < 500μs | Includes hashing |
|
||||
| Batch scoring (100 vulns) | < 5ms | < 15ms | Parallel processing |
|
||||
|
||||
---
|
||||
|
||||
## Attestation Targets
|
||||
|
||||
| Operation | P50 Time | P95 Time | Notes |
|
||||
|-----------|----------|----------|-------|
|
||||
| DSSE envelope creation | < 50ms | < 150ms | Includes signing |
|
||||
| DSSE verification | < 30ms | < 100ms | Signature check |
|
||||
| Rekor submission | < 500ms | < 2s | Network dependent |
|
||||
| Rekor verification | < 300ms | < 1s | Network dependent |
|
||||
| in-toto predicate | < 20ms | < 80ms | JSON serialization |
|
||||
|
||||
---
|
||||
|
||||
## Database Operation Targets
|
||||
|
||||
| Operation | P50 Time | P95 Time | Notes |
|
||||
|-----------|----------|----------|-------|
|
||||
| Receipt insert | < 5ms | < 20ms | Single record |
|
||||
| Receipt query (by ID) | < 2ms | < 10ms | Indexed lookup |
|
||||
| Receipt query (by tenant) | < 10ms | < 50ms | Index scan |
|
||||
| EPSS lookup (single) | < 1ms | < 5ms | Indexed |
|
||||
| EPSS lookup (batch 100) | < 10ms | < 50ms | Batch query |
|
||||
| Risk score insert | < 5ms | < 20ms | Single record |
|
||||
| Risk score update | < 3ms | < 15ms | Single record |
|
||||
|
||||
---
|
||||
|
||||
## Regression Thresholds
|
||||
|
||||
Performance regression is detected when metrics exceed these thresholds compared to baseline:
|
||||
|
||||
| Metric | Warning Threshold | Blocking Threshold | Action |
|
||||
|--------|------------------|-------------------|--------|
|
||||
| P50 Time | > 15% increase | > 25% increase | Block release |
|
||||
| P95 Time | > 20% increase | > 35% increase | Block release |
|
||||
| Memory Usage | > 20% increase | > 30% increase | Block release |
|
||||
| CPU Time | > 15% increase | > 25% increase | Investigate |
|
||||
| Throughput | > 10% decrease | > 20% decrease | Block release |
|
||||
|
||||
### Regression Detection Rules
|
||||
|
||||
1. **Warning**: Alert engineering team, add to release notes
|
||||
2. **Blocking**: Cannot merge/release until resolved or waived
|
||||
3. **Waiver**: Requires documented justification and SME approval
|
||||
|
||||
---
|
||||
|
||||
## Measurement Methodology
|
||||
|
||||
### Environment Setup
|
||||
|
||||
```bash
|
||||
# Standard test environment
|
||||
# - CPU: 8 cores (x86_64)
|
||||
# - Memory: 16GB RAM
|
||||
# - Storage: NVMe SSD
|
||||
# - OS: Ubuntu 22.04 LTS
|
||||
# - Docker: 24.x
|
||||
|
||||
# Clear caches before cold start tests
|
||||
docker system prune -af
|
||||
sync && echo 3 > /proc/sys/vm/drop_caches
|
||||
```
|
||||
|
||||
### Scan Performance
|
||||
|
||||
```bash
|
||||
# Cold start measurement
|
||||
time stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
|
||||
# Warm cache measurement (run 3x, take average)
|
||||
for i in {1..3}; do
|
||||
time stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
done
|
||||
|
||||
# Memory profiling
|
||||
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
|
||||
grep "Maximum resident set size"
|
||||
|
||||
# CPU profiling
|
||||
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
```
|
||||
|
||||
### Reachability Analysis
|
||||
|
||||
```bash
|
||||
# Time measurement
|
||||
time stellaops reach --project ./src --language csharp --out reach.json
|
||||
|
||||
# Memory profiling
|
||||
/usr/bin/time -v stellaops reach --project ./src --language csharp --out reach.json 2>&1
|
||||
|
||||
# With detailed timing
|
||||
stellaops reach --project ./src --language csharp --out reach.json --timing
|
||||
```
|
||||
|
||||
### SBOM Generation
|
||||
|
||||
```bash
|
||||
# Time measurement
|
||||
time stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json
|
||||
|
||||
# Output size
|
||||
stellaops sbom --image node:20-alpine --format cyclonedx --out sbom.json && \
|
||||
ls -lh sbom.json
|
||||
```
|
||||
|
||||
### Batch Operations
|
||||
|
||||
```bash
|
||||
# Process multiple images in parallel
|
||||
time stellaops scan --images images.txt --parallel 4 --format json --out-dir ./results
|
||||
|
||||
# Throughput test (images per minute)
|
||||
START=$(date +%s)
|
||||
for i in {1..10}; do
|
||||
stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
done
|
||||
END=$(date +%s)
|
||||
echo "Throughput: $(( 10 * 60 / (END - START) )) images/minute"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI Integration
|
||||
|
||||
### Benchmark Workflow
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/performance-benchmark.yml
|
||||
name: Performance Benchmark
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
schedule:
|
||||
- cron: '0 2 * * 1' # Weekly Monday 2am
|
||||
|
||||
jobs:
|
||||
benchmark:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Run benchmarks
|
||||
run: make benchmark-performance
|
||||
|
||||
- name: Check for regressions
|
||||
run: |
|
||||
stellaops benchmark compare \
|
||||
--baseline results/baseline.json \
|
||||
--current results/current.json \
|
||||
--threshold-p50 0.15 \
|
||||
--threshold-p95 0.20 \
|
||||
--threshold-memory 0.20 \
|
||||
--fail-on-regression
|
||||
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: benchmark-results
|
||||
path: results/
|
||||
```
|
||||
|
||||
### Local Testing
|
||||
|
||||
```bash
|
||||
# Run full benchmark suite
|
||||
make benchmark-performance
|
||||
|
||||
# Run specific image benchmark
|
||||
make benchmark-image IMAGE=alpine:3.19
|
||||
|
||||
# Generate baseline
|
||||
make benchmark-baseline
|
||||
|
||||
# Compare against baseline
|
||||
make benchmark-compare
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Optimization Guidelines
|
||||
|
||||
### For Scan Performance
|
||||
|
||||
1. **Pre-pull images** for consistent timing
|
||||
2. **Use layered caching** for repeat scans
|
||||
3. **Enable parallel analysis** for multi-ecosystem images
|
||||
4. **Consider selective scanning** for known-safe layers
|
||||
|
||||
### For Reachability
|
||||
|
||||
1. **Incremental analysis** for unchanged files
|
||||
2. **Cache resolved dependencies**
|
||||
3. **Use language-specific optimizations** (e.g., Roslyn for C#)
|
||||
4. **Limit call graph depth** for very large codebases
|
||||
|
||||
### For Memory
|
||||
|
||||
1. **Stream large SBOMs** instead of loading fully
|
||||
2. **Use batched database operations**
|
||||
3. **Release intermediate data structures early**
|
||||
4. **Configure GC appropriately for workload**
|
||||
|
||||
---
|
||||
|
||||
## Historical Baselines
|
||||
|
||||
### Version History
|
||||
|
||||
| Version | Date | P50 Scan (alpine) | P50 Reach (50k LOC) | Notes |
|
||||
|---------|------|-------------------|---------------------|-------|
|
||||
| 1.3.0 | 2025-12-14 | TBD | TBD | Current |
|
||||
| 1.2.0 | 2025-09-01 | TBD | TBD | Previous |
|
||||
| 1.1.0 | 2025-06-01 | TBD | TBD | Baseline |
|
||||
|
||||
### Improvement Targets
|
||||
|
||||
| Quarter | Focus Area | Target | Status |
|
||||
|---------|------------|--------|--------|
|
||||
| Q1 2026 | Scan cold start | -20% | Planned |
|
||||
| Q1 2026 | Reachability memory | -15% | Planned |
|
||||
| Q2 2026 | SBOM generation | -10% | Planned |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
|
||||
- [Benchmark Submission Guide](submission-guide.md) (pending)
|
||||
- [Scanner Architecture](../modules/scanner/architecture.md)
|
||||
- [Reachability Module](../modules/scanner/reachability.md)
|
||||
653
docs/benchmarks/submission-guide.md
Normal file
653
docs/benchmarks/submission-guide.md
Normal file
@@ -0,0 +1,653 @@
|
||||
# Benchmark Submission Guide
|
||||
|
||||
**Last Updated:** 2025-12-14
|
||||
**Next Review:** 2026-03-14
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
StellaOps publishes benchmarks for:
|
||||
- **Reachability Analysis** - Accuracy of static and runtime path detection
|
||||
- **SBOM Completeness** - Component detection and version accuracy
|
||||
- **Vulnerability Detection** - Precision, recall, and F1 scores
|
||||
- **Scan Performance** - Time, memory, and CPU metrics
|
||||
- **Determinism** - Reproducibility of scan outputs
|
||||
|
||||
This guide explains how to reproduce, validate, and submit benchmark results.
|
||||
|
||||
---
|
||||
|
||||
## 1. PREREQUISITES
|
||||
|
||||
### 1.1 System Requirements
|
||||
|
||||
| Requirement | Minimum | Recommended |
|
||||
|-------------|---------|-------------|
|
||||
| CPU | 4 cores | 8 cores |
|
||||
| Memory | 8 GB | 16 GB |
|
||||
| Storage | 50 GB SSD | 100 GB NVMe |
|
||||
| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
|
||||
| Docker | 24.x | 24.x |
|
||||
| .NET | 10.0 | 10.0 |
|
||||
|
||||
### 1.2 Environment Setup
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://git.stella-ops.org/stella-ops.org/git.stella-ops.org.git
|
||||
cd git.stella-ops.org
|
||||
|
||||
# Install .NET 10 SDK
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y dotnet-sdk-10.0
|
||||
|
||||
# Install Docker (if not present)
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
|
||||
# Install benchmark dependencies
|
||||
sudo apt-get install -y \
|
||||
jq \
|
||||
b3sum \
|
||||
hyperfine \
|
||||
time
|
||||
|
||||
# Set determinism environment variables
|
||||
export TZ=UTC
|
||||
export LC_ALL=C
|
||||
export STELLAOPS_DETERMINISM_SEED=42
|
||||
export STELLAOPS_DETERMINISM_TIMESTAMP="2025-01-01T00:00:00Z"
|
||||
```
|
||||
|
||||
### 1.3 Pull Reference Images
|
||||
|
||||
```bash
|
||||
# Download standard benchmark images
|
||||
make benchmark-pull-images
|
||||
|
||||
# Or manually:
|
||||
docker pull alpine:3.19
|
||||
docker pull debian:12-slim
|
||||
docker pull ubuntu:22.04
|
||||
docker pull node:20-alpine
|
||||
docker pull python:3.12
|
||||
docker pull mcr.microsoft.com/dotnet/aspnet:8.0
|
||||
docker pull nginx:1.25
|
||||
docker pull postgres:16-alpine
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. RUNNING BENCHMARKS
|
||||
|
||||
### 2.1 Full Benchmark Suite
|
||||
|
||||
```bash
|
||||
# Run all benchmarks (takes ~30-60 minutes)
|
||||
make benchmark-all
|
||||
|
||||
# Output: results/benchmark-all-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### 2.2 Category-Specific Benchmarks
|
||||
|
||||
#### Reachability Benchmark
|
||||
|
||||
```bash
|
||||
# Run reachability accuracy benchmarks
|
||||
make benchmark-reachability
|
||||
|
||||
# With specific language filter
|
||||
make benchmark-reachability LANG=csharp
|
||||
|
||||
# Output: results/reachability/benchmark-reachability-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### Performance Benchmark
|
||||
|
||||
```bash
|
||||
# Run scan performance benchmarks
|
||||
make benchmark-performance
|
||||
|
||||
# Single image
|
||||
make benchmark-image IMAGE=alpine:3.19
|
||||
|
||||
# Output: results/performance/benchmark-performance-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### SBOM Benchmark
|
||||
|
||||
```bash
|
||||
# Run SBOM completeness benchmarks
|
||||
make benchmark-sbom
|
||||
|
||||
# Specific format
|
||||
make benchmark-sbom FORMAT=cyclonedx
|
||||
|
||||
# Output: results/sbom/benchmark-sbom-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
#### Determinism Benchmark
|
||||
|
||||
```bash
|
||||
# Run determinism verification
|
||||
make benchmark-determinism
|
||||
|
||||
# Output: results/determinism/benchmark-determinism-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### 2.3 CLI Benchmark Commands
|
||||
|
||||
```bash
|
||||
# Performance timing with hyperfine (10 runs)
|
||||
hyperfine --warmup 2 --runs 10 \
|
||||
'stellaops scan --image alpine:3.19 --format json --output /dev/null'
|
||||
|
||||
# Memory profiling
|
||||
/usr/bin/time -v stellaops scan --image alpine:3.19 --format json 2>&1 | \
|
||||
grep "Maximum resident set size"
|
||||
|
||||
# CPU profiling (Linux)
|
||||
perf stat stellaops scan --image alpine:3.19 --format json > /dev/null
|
||||
|
||||
# Determinism check (run twice, compare hashes)
|
||||
stellaops scan --image alpine:3.19 --format json | sha256sum > run1.sha
|
||||
stellaops scan --image alpine:3.19 --format json | sha256sum > run2.sha
|
||||
diff run1.sha run2.sha && echo "DETERMINISTIC" || echo "NON-DETERMINISTIC"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. OUTPUT FORMATS
|
||||
|
||||
### 3.1 Reachability Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "reachability-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"scanner_commit": "abc123def",
|
||||
"environment": {
|
||||
"os": "ubuntu-22.04",
|
||||
"arch": "amd64",
|
||||
"cpu": "Intel Xeon E-2288G",
|
||||
"memory_gb": 16
|
||||
},
|
||||
"summary": {
|
||||
"total_samples": 200,
|
||||
"precision": 0.92,
|
||||
"recall": 0.87,
|
||||
"f1": 0.894,
|
||||
"false_positive_rate": 0.08,
|
||||
"false_negative_rate": 0.13
|
||||
},
|
||||
"by_language": {
|
||||
"java": {
|
||||
"samples": 50,
|
||||
"precision": 0.94,
|
||||
"recall": 0.88,
|
||||
"f1": 0.909,
|
||||
"confusion_matrix": {
|
||||
"tp": 44, "fp": 3, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"csharp": {
|
||||
"samples": 50,
|
||||
"precision": 0.91,
|
||||
"recall": 0.86,
|
||||
"f1": 0.884,
|
||||
"confusion_matrix": {
|
||||
"tp": 43, "fp": 4, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"typescript": {
|
||||
"samples": 50,
|
||||
"precision": 0.89,
|
||||
"recall": 0.84,
|
||||
"f1": 0.864,
|
||||
"confusion_matrix": {
|
||||
"tp": 42, "fp": 5, "tn": 2, "fn": 1
|
||||
}
|
||||
},
|
||||
"python": {
|
||||
"samples": 50,
|
||||
"precision": 0.88,
|
||||
"recall": 0.83,
|
||||
"f1": 0.854,
|
||||
"confusion_matrix": {
|
||||
"tp": 41, "fp": 5, "tn": 3, "fn": 1
|
||||
}
|
||||
}
|
||||
},
|
||||
"ground_truth_ref": "datasets/reachability/v2025.12",
|
||||
"raw_results_ref": "results/reachability/raw/2025-12-14/"
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 Performance Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "performance-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"scanner_commit": "abc123def",
|
||||
"environment": {
|
||||
"os": "ubuntu-22.04",
|
||||
"arch": "amd64",
|
||||
"cpu": "Intel Xeon E-2288G",
|
||||
"memory_gb": 16,
|
||||
"storage": "nvme"
|
||||
},
|
||||
"images": [
|
||||
{
|
||||
"image": "alpine:3.19",
|
||||
"size_mb": 7,
|
||||
"components": 15,
|
||||
"vulnerabilities": 5,
|
||||
"runs": 10,
|
||||
"cold_start": {
|
||||
"p50_ms": 2800,
|
||||
"p95_ms": 4200,
|
||||
"mean_ms": 3100
|
||||
},
|
||||
"warm_cache": {
|
||||
"p50_ms": 1500,
|
||||
"p95_ms": 2100,
|
||||
"mean_ms": 1650
|
||||
},
|
||||
"memory_peak_mb": 180,
|
||||
"cpu_time_ms": 1200
|
||||
},
|
||||
{
|
||||
"image": "python:3.12",
|
||||
"size_mb": 1024,
|
||||
"components": 300,
|
||||
"vulnerabilities": 150,
|
||||
"runs": 10,
|
||||
"cold_start": {
|
||||
"p50_ms": 32000,
|
||||
"p95_ms": 48000,
|
||||
"mean_ms": 35000
|
||||
},
|
||||
"warm_cache": {
|
||||
"p50_ms": 18000,
|
||||
"p95_ms": 25000,
|
||||
"mean_ms": 19500
|
||||
},
|
||||
"memory_peak_mb": 1100,
|
||||
"cpu_time_ms": 28000
|
||||
}
|
||||
],
|
||||
"aggregated": {
|
||||
"total_images": 8,
|
||||
"total_runs": 80,
|
||||
"avg_time_per_mb_ms": 35,
|
||||
"avg_memory_per_component_kb": 400
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 SBOM Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "sbom-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"summary": {
|
||||
"total_images": 8,
|
||||
"component_recall": 0.98,
|
||||
"component_precision": 0.995,
|
||||
"version_accuracy": 0.96
|
||||
},
|
||||
"by_ecosystem": {
|
||||
"apk": {
|
||||
"ground_truth_components": 100,
|
||||
"detected_components": 99,
|
||||
"correct_versions": 96,
|
||||
"recall": 0.99,
|
||||
"precision": 0.99,
|
||||
"version_accuracy": 0.96
|
||||
},
|
||||
"npm": {
|
||||
"ground_truth_components": 500,
|
||||
"detected_components": 492,
|
||||
"correct_versions": 475,
|
||||
"recall": 0.984,
|
||||
"precision": 0.998,
|
||||
"version_accuracy": 0.965
|
||||
}
|
||||
},
|
||||
"formats_tested": ["cyclonedx-1.6", "spdx-3.0.1"]
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Determinism Results Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"benchmark": "determinism-v1",
|
||||
"date": "2025-12-14T00:00:00Z",
|
||||
"scanner_version": "1.3.0",
|
||||
"summary": {
|
||||
"total_runs": 100,
|
||||
"bitwise_identical": 100,
|
||||
"bitwise_fidelity": 1.0,
|
||||
"semantic_identical": 100,
|
||||
"semantic_fidelity": 1.0
|
||||
},
|
||||
"by_image": {
|
||||
"alpine:3.19": {
|
||||
"runs": 20,
|
||||
"bitwise_identical": 20,
|
||||
"output_hash": "sha256:abc123..."
|
||||
},
|
||||
"python:3.12": {
|
||||
"runs": 20,
|
||||
"bitwise_identical": 20,
|
||||
"output_hash": "sha256:def456..."
|
||||
}
|
||||
},
|
||||
"seed": 42,
|
||||
"timestamp_frozen": "2025-01-01T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. SUBMISSION PROCESS
|
||||
|
||||
### 4.1 Internal Submission (StellaOps Team)
|
||||
|
||||
Benchmark results are automatically collected by CI:
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/weekly-benchmark.yml triggers:
|
||||
# - Weekly benchmark runs
|
||||
# - Results stored in internal dashboard
|
||||
# - Regression detection against baselines
|
||||
```
|
||||
|
||||
Manual submission:
|
||||
```bash
|
||||
# Upload to internal dashboard
|
||||
make benchmark-submit
|
||||
|
||||
# Or via CLI
|
||||
stellaops benchmark submit \
|
||||
--file results/benchmark-all-20251214.json \
|
||||
--dashboard internal
|
||||
```
|
||||
|
||||
### 4.2 External Validation Submission
|
||||
|
||||
Third parties can validate and submit benchmark results:
|
||||
|
||||
#### Step 1: Fork and Clone
|
||||
|
||||
```bash
|
||||
# Fork the benchmark repository
|
||||
# https://git.stella-ops.org/stella-ops.org/benchmarks
|
||||
|
||||
git clone https://git.stella-ops.org/<your-org>/benchmarks.git
|
||||
cd benchmarks
|
||||
```
|
||||
|
||||
#### Step 2: Run Benchmarks
|
||||
|
||||
```bash
|
||||
# With StellaOps scanner
|
||||
make benchmark-all SCANNER=stellaops
|
||||
|
||||
# Or with your own tool for comparison
|
||||
make benchmark-all SCANNER=your-tool
|
||||
```
|
||||
|
||||
#### Step 3: Prepare Submission
|
||||
|
||||
```bash
|
||||
# Results directory structure
|
||||
mkdir -p submissions/<your-org>/<date>
|
||||
|
||||
# Copy results
|
||||
cp results/*.json submissions/<your-org>/<date>/
|
||||
|
||||
# Add reproduction README
|
||||
cat > submissions/<your-org>/<date>/README.md <<EOF
|
||||
# Benchmark Results: <Your Org>
|
||||
|
||||
**Date:** $(date -u +%Y-%m-%d)
|
||||
**Scanner:** <tool-name>
|
||||
**Version:** <version>
|
||||
|
||||
## Environment
|
||||
- OS: <os>
|
||||
- CPU: <cpu>
|
||||
- Memory: <memory>
|
||||
|
||||
## Reproduction Steps
|
||||
<steps>
|
||||
|
||||
## Notes
|
||||
<any observations>
|
||||
EOF
|
||||
```
|
||||
|
||||
#### Step 4: Submit Pull Request
|
||||
|
||||
```bash
|
||||
git checkout -b benchmark-results-$(date +%Y%m%d)
|
||||
git add submissions/
|
||||
git commit -m "Add benchmark results from <your-org> $(date +%Y-%m-%d)"
|
||||
git push origin benchmark-results-$(date +%Y%m%d)
|
||||
|
||||
# Create PR via web interface or gh CLI
|
||||
gh pr create --title "Benchmark: <your-org> $(date +%Y-%m-%d)" \
|
||||
--body "Benchmark results for external validation"
|
||||
```
|
||||
|
||||
### 4.3 Submission Review Process
|
||||
|
||||
| Step | Action | Timeline |
|
||||
|------|--------|----------|
|
||||
| 1 | PR submitted | Day 0 |
|
||||
| 2 | Automated validation runs | Day 0 (CI) |
|
||||
| 3 | Maintainer review | Day 1-3 |
|
||||
| 4 | Results published (if valid) | Day 3-5 |
|
||||
| 5 | Dashboard updated | Day 5 |
|
||||
|
||||
---
|
||||
|
||||
## 5. BENCHMARK CATEGORIES
|
||||
|
||||
### 5.1 Reachability Benchmark
|
||||
|
||||
**Purpose:** Measure accuracy of static and runtime reachability analysis.
|
||||
|
||||
**Ground Truth Source:** `datasets/reachability/`
|
||||
|
||||
**Test Cases:**
|
||||
- 50+ samples per language (Java, C#, TypeScript, Python, Go)
|
||||
- Known-reachable vulnerable paths
|
||||
- Known-unreachable vulnerable code
|
||||
- Runtime-only reachable code
|
||||
|
||||
**Scoring:**
|
||||
```
|
||||
Precision = TP / (TP + FP)
|
||||
Recall = TP / (TP + FN)
|
||||
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
||||
```
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target | Blocking |
|
||||
|--------|--------|----------|
|
||||
| Precision | >= 90% | >= 85% |
|
||||
| Recall | >= 85% | >= 80% |
|
||||
| F1 | >= 87% | >= 82% |
|
||||
|
||||
### 5.2 Performance Benchmark
|
||||
|
||||
**Purpose:** Measure scan time, memory usage, and CPU utilization.
|
||||
|
||||
**Reference Images:** See [Performance Baselines](performance-baselines.md)
|
||||
|
||||
**Metrics:**
|
||||
- P50/P95 scan time (cold and warm)
|
||||
- Peak memory usage
|
||||
- CPU time
|
||||
- Throughput (images/minute)
|
||||
|
||||
**Targets:**
|
||||
| Image Category | P50 Time | P95 Time | Max Memory |
|
||||
|----------------|----------|----------|------------|
|
||||
| Minimal (<100MB) | < 5s | < 10s | < 256MB |
|
||||
| Standard (100-500MB) | < 15s | < 30s | < 512MB |
|
||||
| Large (500MB-2GB) | < 45s | < 90s | < 1.5GB |
|
||||
|
||||
### 5.3 SBOM Benchmark
|
||||
|
||||
**Purpose:** Measure component detection completeness and accuracy.
|
||||
|
||||
**Ground Truth Source:** Manual SBOM audits of reference images.
|
||||
|
||||
**Metrics:**
|
||||
- Component recall (found / total)
|
||||
- Component precision (real / reported)
|
||||
- Version accuracy (correct / total)
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Component Recall | >= 98% |
|
||||
| Component Precision | >= 99% |
|
||||
| Version Accuracy | >= 95% |
|
||||
|
||||
### 5.4 Vulnerability Detection Benchmark
|
||||
|
||||
**Purpose:** Measure CVE detection accuracy against known-vulnerable images.
|
||||
|
||||
**Ground Truth Source:** `datasets/vulns/` curated CVE lists.
|
||||
|
||||
**Metrics:**
|
||||
- True positive rate
|
||||
- False positive rate
|
||||
- False negative rate
|
||||
- Precision/Recall/F1
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Precision | >= 95% |
|
||||
| Recall | >= 90% |
|
||||
| F1 | >= 92% |
|
||||
|
||||
### 5.5 Determinism Benchmark
|
||||
|
||||
**Purpose:** Verify reproducible scan outputs.
|
||||
|
||||
**Methodology:**
|
||||
1. Run same scan N times (default: 20)
|
||||
2. Compare output hashes
|
||||
3. Calculate bitwise fidelity
|
||||
|
||||
**Targets:**
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Bitwise Fidelity | 100% |
|
||||
| Semantic Fidelity | 100% |
|
||||
|
||||
---
|
||||
|
||||
## 6. COMPARING RESULTS
|
||||
|
||||
### 6.1 Against Baselines
|
||||
|
||||
```bash
|
||||
# Compare current run against stored baseline
|
||||
stellaops benchmark compare \
|
||||
--baseline results/baseline/2025-Q4.json \
|
||||
--current results/benchmark-all-20251214.json \
|
||||
--threshold-p50 0.15 \
|
||||
--threshold-precision 0.02 \
|
||||
--fail-on-regression
|
||||
|
||||
# Output:
|
||||
# Performance: PASS (P50 within 15% of baseline)
|
||||
# Accuracy: PASS (Precision within 2% of baseline)
|
||||
# Determinism: PASS (100% fidelity)
|
||||
```
|
||||
|
||||
### 6.2 Against Other Tools
|
||||
|
||||
```bash
|
||||
# Generate comparison report
|
||||
stellaops benchmark compare-tools \
|
||||
--stellaops results/stellaops/2025-12-14.json \
|
||||
--trivy results/trivy/2025-12-14.json \
|
||||
--grype results/grype/2025-12-14.json \
|
||||
--output comparison-report.html
|
||||
```
|
||||
|
||||
### 6.3 Historical Trends
|
||||
|
||||
```bash
|
||||
# Generate trend report (last 12 months)
|
||||
stellaops benchmark trend \
|
||||
--period 12m \
|
||||
--metrics precision,recall,p50_time \
|
||||
--output trend-report.html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. TROUBLESHOOTING
|
||||
|
||||
### 7.1 Common Issues
|
||||
|
||||
| Issue | Cause | Resolution |
|
||||
|-------|-------|------------|
|
||||
| Non-deterministic output | Locale not set | Set `LC_ALL=C` |
|
||||
| Memory OOM | Large image | Increase memory limit |
|
||||
| Slow performance | Cold cache | Pre-pull images |
|
||||
| Missing components | Ecosystem not supported | Check supported ecosystems |
|
||||
|
||||
### 7.2 Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable verbose benchmark logging
|
||||
make benchmark-all DEBUG=1
|
||||
|
||||
# Enable timing breakdown
|
||||
export STELLAOPS_BENCHMARK_TIMING=1
|
||||
make benchmark-performance
|
||||
```
|
||||
|
||||
### 7.3 Validation Failures
|
||||
|
||||
```bash
|
||||
# Check result schema validity
|
||||
stellaops benchmark validate --file results/benchmark-all.json
|
||||
|
||||
# Check against ground truth
|
||||
stellaops benchmark validate-ground-truth \
|
||||
--results results/reachability.json \
|
||||
--ground-truth datasets/reachability/v2025.12
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. REFERENCES
|
||||
|
||||
- [Performance Baselines](performance-baselines.md)
|
||||
- [Accuracy Metrics Framework](accuracy-metrics-framework.md)
|
||||
- [Offline Parity Verification](../airgap/offline-parity-verification.md)
|
||||
- [Determinism CI Harness](../modules/scanner/design/determinism-ci-harness.md)
|
||||
- [Ground Truth Datasets](../datasets/README.md)
|
||||
|
||||
---
|
||||
|
||||
**Document Version**: 1.0
|
||||
**Target Platform**: .NET 10, PostgreSQL >=16
|
||||
Reference in New Issue
Block a user