# Accuracy Metrics Framework ## Overview This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly. ## Metric Definitions ### Confusion Matrix For binary classification tasks (e.g., reachable vs unreachable): | | Predicted Positive | Predicted Negative | |--|-------------------|-------------------| | **Actual Positive** | True Positive (TP) | False Negative (FN) | | **Actual Negative** | False Positive (FP) | True Negative (TN) | ### Core Metrics | Metric | Formula | Description | Target | |--------|---------|-------------|--------| | **Precision** | TP / (TP + FP) | Of items flagged, how many were correct | >= 90% | | **Recall** | TP / (TP + FN) | Of actual positives, how many were found | >= 85% | | **F1 Score** | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall | >= 87% | | **False Positive Rate** | FP / (FP + TN) | Rate of incorrect positive flags | <= 10% | | **Accuracy** | (TP + TN) / Total | Overall correctness | >= 90% | --- ## Reachability Analysis Accuracy ### Definitions - **True Positive (TP)**: Correctly identified as reachable (code path actually exists) - **False Positive (FP)**: Incorrectly identified as reachable (no real code path) - **True Negative (TN)**: Correctly identified as unreachable (no code path exists) - **False Negative (FN)**: Incorrectly identified as unreachable (code path exists but missed) ### Target Metrics | Metric | Target | Stretch Goal | |--------|--------|--------------| | Precision | >= 90% | >= 95% | | Recall | >= 85% | >= 90% | | F1 Score | >= 87% | >= 92% | | False Positive Rate | <= 10% | <= 5% | ### Per-Language Targets | Language | Precision | Recall | F1 | Notes | |----------|-----------|--------|-----|-------| | Java | >= 92% | >= 88% | >= 90% | Strong static analysis support | | C# | >= 90% | >= 85% | >= 87% | Roslyn-based analysis | | Go | >= 88% | >= 82% | >= 85% | Good call graph support | | JavaScript | >= 85% | >= 78% | >= 81% | Dynamic typing challenges | | Python | >= 83% | >= 75% | >= 79% | Dynamic typing challenges | | TypeScript | >= 88% | >= 82% | >= 85% | Better than JS due to types | --- ## Lattice State Accuracy VEX lattice states have different confidence requirements: | State | Definition | Target Accuracy | Validation | |-------|------------|-----------------|------------| | **CR** (Confirmed Reachable) | Runtime evidence + static path | >= 95% | Runtime trace verification | | **SR** (Static Reachable) | Static path only | >= 90% | Static analysis coverage | | **SU** (Static Unreachable) | No static path found | >= 85% | Negative proof verification | | **DT** (Denied by Tool) | Tool analysis confirms not affected | >= 90% | Tool output validation | | **DV** (Denied by Vendor) | Vendor VEX statement | >= 95% | VEX signature verification | | **U** (Unknown) | Insufficient evidence | Track % | Minimize unknowns | ### Lattice Transition Accuracy Measure accuracy of automatic state transitions: | Transition | Trigger | Target Accuracy | |------------|---------|-----------------| | U -> SR | Static analysis finds path | >= 90% | | SR -> CR | Runtime evidence added | >= 95% | | U -> SU | Static analysis proves unreachable | >= 85% | | SR -> DT | Tool-specific analysis | >= 90% | --- ## SBOM Completeness Metrics ### Component Detection | Metric | Formula | Target | Notes | |--------|---------|--------|-------| | **Component Recall** | Found / Total Actual | >= 98% | Find all real components | | **Component Precision** | Real / Reported | >= 99% | Minimize phantom components | | **Version Accuracy** | Correct Versions / Total | >= 95% | Version string correctness | | **License Accuracy** | Correct Licenses / Total | >= 90% | License detection accuracy | ### Per-Ecosystem Targets | Ecosystem | Comp. Recall | Comp. Precision | Version Acc. | |-----------|--------------|-----------------|--------------| | Alpine APK | >= 99% | >= 99% | >= 98% | | Debian DEB | >= 99% | >= 99% | >= 98% | | npm | >= 97% | >= 98% | >= 95% | | Maven | >= 98% | >= 99% | >= 96% | | NuGet | >= 98% | >= 99% | >= 96% | | PyPI | >= 96% | >= 98% | >= 94% | | Go Modules | >= 97% | >= 98% | >= 95% | | Cargo (Rust) | >= 98% | >= 99% | >= 96% | --- ## Vulnerability Detection Accuracy ### CVE Matching | Metric | Formula | Target | |--------|---------|--------| | **CVE Recall** | Found CVEs / Actual CVEs | >= 95% | | **CVE Precision** | Correct CVEs / Reported CVEs | >= 98% | | **Version Range Accuracy** | Correct Affected / Total | >= 93% | ### False Positive Categories Track and minimize specific FP types: | FP Type | Description | Target Rate | |---------|-------------|-------------| | **Phantom Component** | CVE for component not present | <= 1% | | **Version Mismatch** | CVE for wrong version | <= 3% | | **Ecosystem Confusion** | Wrong package with same name | <= 1% | | **Stale Advisory** | Already fixed but flagged | <= 2% | --- ## Measurement Methodology ### Ground Truth Establishment 1. **Manual Curation** - Expert review of sample applications - Documented decision rationale - Multiple reviewer consensus 2. **Automated Verification** - Cross-reference with authoritative sources - NVD, OSV, GitHub Advisory Database - Vendor security bulletins 3. **Runtime Validation** - Dynamic analysis confirmation - Exploit proof-of-concept testing - Production monitoring correlation ### Test Corpus Requirements | Category | Minimum Samples | Diversity Requirements | |----------|-----------------|----------------------| | Reachability | 50 per language | Mix of libraries, frameworks | | SBOM | 100 images | All major ecosystems | | CVE Detection | 500 CVEs | Mix of severities, ages | | Performance | 10 reference images | Various sizes | ### Measurement Process ``` 1. Select ground truth corpus └── Minimum samples per category └── Representative of production workloads 2. Run scanner with deterministic manifest └── Fixed advisory database version └── Reproducible configuration 3. Compare results to ground truth └── Automated diff tooling └── Manual review of discrepancies 4. Compute metrics per category └── Generate confusion matrices └── Calculate precision/recall/F1 5. Aggregate and publish └── Per-ecosystem breakdown └── Overall summary metrics └── Trend analysis ``` --- ## Reporting Format ### Quarterly Benchmark Report ```json { "report_version": "1.0", "scanner_version": "1.3.0", "report_date": "2025-12-14", "ground_truth_version": "2025-Q4", "reachability": { "overall": { "precision": 0.91, "recall": 0.86, "f1": 0.88, "samples": 450 }, "by_language": { "java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100}, "csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80}, "go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70} } }, "sbom": { "component_recall": 0.98, "component_precision": 0.99, "version_accuracy": 0.96 }, "vulnerability": { "cve_recall": 0.96, "cve_precision": 0.98, "false_positive_rate": 0.02 }, "lattice_states": { "cr_accuracy": 0.96, "sr_accuracy": 0.91, "su_accuracy": 0.87 } } ``` --- ## Regression Detection ### Thresholds A regression is flagged when: | Metric | Regression Threshold | Action | |--------|---------------------|--------| | Precision | > 3% decrease | Block release | | Recall | > 5% decrease | Block release | | F1 | > 4% decrease | Block release | | FPR | > 2% increase | Block release | | Any metric | > 1% change | Investigate | ### CI Integration ```yaml # .gitea/workflows/accuracy-check.yml accuracy-benchmark: runs-on: ubuntu-latest steps: - name: Run accuracy benchmark run: make benchmark-accuracy - name: Check for regressions run: | stellaops benchmark compare \ --baseline results/baseline.json \ --current results/current.json \ --threshold-precision 0.03 \ --threshold-recall 0.05 \ --fail-on-regression ``` --- ## Ground Truth Sources ### Internal - `datasets/reachability/samples/` - Reachability ground truth - `datasets/sbom/reference/` - Known-good SBOMs - `bench/findings/` - CVE finding ground truth ### External - **NIST SARD** - Software Assurance Reference Dataset - **OSV Test Suite** - Open Source Vulnerability test cases - **OWASP Benchmark** - Security testing benchmark - **Juliet Test Suite** - CWE coverage testing --- ## Improvement Tracking ### Gap Analysis Identify and prioritize accuracy improvements: | Gap | Current | Target | Priority | Improvement Plan | |-----|---------|--------|----------|------------------| | Python recall | 73% | 78% | High | Improve type inference | | npm precision | 96% | 98% | Medium | Fix aliasing issues | | Version accuracy | 94% | 96% | Medium | Better version parsing | ### Quarterly Goals Track progress against improvement targets: | Quarter | Focus Area | Metric | Target | Actual | |---------|------------|--------|--------|--------| | Q4 2025 | Java reachability | Recall | 88% | TBD | | Q1 2026 | Python support | F1 | 80% | TBD | | Q1 2026 | SBOM completeness | Recall | 99% | TBD | --- ## References - [FIRST CVSS v4.0 Specification](https://www.first.org/cvss/v4.0/specification-document) - [NIST NVD API](https://nvd.nist.gov/developers) - [OSV Schema](https://ossf.github.io/osv-schema/) - [StellaOps Reachability Architecture](../modules/scanner/reachability.md)