Files
git.stella-ops.org/docs/benchmarks/accuracy-metrics-framework.md
StellaOps Bot b058dbe031 up
2025-12-14 23:20:14 +02:00

9.5 KiB

Accuracy Metrics Framework

Overview

This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.

Metric Definitions

Confusion Matrix

For binary classification tasks (e.g., reachable vs unreachable):

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Core Metrics

Metric Formula Description Target
Precision TP / (TP + FP) Of items flagged, how many were correct >= 90%
Recall TP / (TP + FN) Of actual positives, how many were found >= 85%
F1 Score 2 * (P * R) / (P + R) Harmonic mean of precision and recall >= 87%
False Positive Rate FP / (FP + TN) Rate of incorrect positive flags <= 10%
Accuracy (TP + TN) / Total Overall correctness >= 90%

Reachability Analysis Accuracy

Definitions

  • True Positive (TP): Correctly identified as reachable (code path actually exists)
  • False Positive (FP): Incorrectly identified as reachable (no real code path)
  • True Negative (TN): Correctly identified as unreachable (no code path exists)
  • False Negative (FN): Incorrectly identified as unreachable (code path exists but missed)

Target Metrics

Metric Target Stretch Goal
Precision >= 90% >= 95%
Recall >= 85% >= 90%
F1 Score >= 87% >= 92%
False Positive Rate <= 10% <= 5%

Per-Language Targets

Language Precision Recall F1 Notes
Java >= 92% >= 88% >= 90% Strong static analysis support
C# >= 90% >= 85% >= 87% Roslyn-based analysis
Go >= 88% >= 82% >= 85% Good call graph support
JavaScript >= 85% >= 78% >= 81% Dynamic typing challenges
Python >= 83% >= 75% >= 79% Dynamic typing challenges
TypeScript >= 88% >= 82% >= 85% Better than JS due to types

Lattice State Accuracy

VEX lattice states have different confidence requirements:

State Definition Target Accuracy Validation
CR (Confirmed Reachable) Runtime evidence + static path >= 95% Runtime trace verification
SR (Static Reachable) Static path only >= 90% Static analysis coverage
SU (Static Unreachable) No static path found >= 85% Negative proof verification
DT (Denied by Tool) Tool analysis confirms not affected >= 90% Tool output validation
DV (Denied by Vendor) Vendor VEX statement >= 95% VEX signature verification
U (Unknown) Insufficient evidence Track % Minimize unknowns

Lattice Transition Accuracy

Measure accuracy of automatic state transitions:

Transition Trigger Target Accuracy
U -> SR Static analysis finds path >= 90%
SR -> CR Runtime evidence added >= 95%
U -> SU Static analysis proves unreachable >= 85%
SR -> DT Tool-specific analysis >= 90%

SBOM Completeness Metrics

Component Detection

Metric Formula Target Notes
Component Recall Found / Total Actual >= 98% Find all real components
Component Precision Real / Reported >= 99% Minimize phantom components
Version Accuracy Correct Versions / Total >= 95% Version string correctness
License Accuracy Correct Licenses / Total >= 90% License detection accuracy

Per-Ecosystem Targets

Ecosystem Comp. Recall Comp. Precision Version Acc.
Alpine APK >= 99% >= 99% >= 98%
Debian DEB >= 99% >= 99% >= 98%
npm >= 97% >= 98% >= 95%
Maven >= 98% >= 99% >= 96%
NuGet >= 98% >= 99% >= 96%
PyPI >= 96% >= 98% >= 94%
Go Modules >= 97% >= 98% >= 95%
Cargo (Rust) >= 98% >= 99% >= 96%

Vulnerability Detection Accuracy

CVE Matching

Metric Formula Target
CVE Recall Found CVEs / Actual CVEs >= 95%
CVE Precision Correct CVEs / Reported CVEs >= 98%
Version Range Accuracy Correct Affected / Total >= 93%

False Positive Categories

Track and minimize specific FP types:

FP Type Description Target Rate
Phantom Component CVE for component not present <= 1%
Version Mismatch CVE for wrong version <= 3%
Ecosystem Confusion Wrong package with same name <= 1%
Stale Advisory Already fixed but flagged <= 2%

Measurement Methodology

Ground Truth Establishment

  1. Manual Curation

    • Expert review of sample applications
    • Documented decision rationale
    • Multiple reviewer consensus
  2. Automated Verification

    • Cross-reference with authoritative sources
    • NVD, OSV, GitHub Advisory Database
    • Vendor security bulletins
  3. Runtime Validation

    • Dynamic analysis confirmation
    • Exploit proof-of-concept testing
    • Production monitoring correlation

Test Corpus Requirements

Category Minimum Samples Diversity Requirements
Reachability 50 per language Mix of libraries, frameworks
SBOM 100 images All major ecosystems
CVE Detection 500 CVEs Mix of severities, ages
Performance 10 reference images Various sizes

Measurement Process

1. Select ground truth corpus
   └── Minimum samples per category
   └── Representative of production workloads

2. Run scanner with deterministic manifest
   └── Fixed advisory database version
   └── Reproducible configuration

3. Compare results to ground truth
   └── Automated diff tooling
   └── Manual review of discrepancies

4. Compute metrics per category
   └── Generate confusion matrices
   └── Calculate precision/recall/F1

5. Aggregate and publish
   └── Per-ecosystem breakdown
   └── Overall summary metrics
   └── Trend analysis

Reporting Format

Quarterly Benchmark Report

{
  "report_version": "1.0",
  "scanner_version": "1.3.0",
  "report_date": "2025-12-14",
  "ground_truth_version": "2025-Q4",

  "reachability": {
    "overall": {
      "precision": 0.91,
      "recall": 0.86,
      "f1": 0.88,
      "samples": 450
    },
    "by_language": {
      "java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
      "csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
      "go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
    }
  },

  "sbom": {
    "component_recall": 0.98,
    "component_precision": 0.99,
    "version_accuracy": 0.96
  },

  "vulnerability": {
    "cve_recall": 0.96,
    "cve_precision": 0.98,
    "false_positive_rate": 0.02
  },

  "lattice_states": {
    "cr_accuracy": 0.96,
    "sr_accuracy": 0.91,
    "su_accuracy": 0.87
  }
}

Regression Detection

Thresholds

A regression is flagged when:

Metric Regression Threshold Action
Precision > 3% decrease Block release
Recall > 5% decrease Block release
F1 > 4% decrease Block release
FPR > 2% increase Block release
Any metric > 1% change Investigate

CI Integration

# .gitea/workflows/accuracy-check.yml
accuracy-benchmark:
  runs-on: ubuntu-latest
  steps:
    - name: Run accuracy benchmark
      run: make benchmark-accuracy

    - name: Check for regressions
      run: |
        stellaops benchmark compare \
          --baseline results/baseline.json \
          --current results/current.json \
          --threshold-precision 0.03 \
          --threshold-recall 0.05 \
          --fail-on-regression

Ground Truth Sources

Internal

  • datasets/reachability/samples/ - Reachability ground truth
  • datasets/sbom/reference/ - Known-good SBOMs
  • bench/findings/ - CVE finding ground truth

External

  • NIST SARD - Software Assurance Reference Dataset
  • OSV Test Suite - Open Source Vulnerability test cases
  • OWASP Benchmark - Security testing benchmark
  • Juliet Test Suite - CWE coverage testing

Improvement Tracking

Gap Analysis

Identify and prioritize accuracy improvements:

Gap Current Target Priority Improvement Plan
Python recall 73% 78% High Improve type inference
npm precision 96% 98% Medium Fix aliasing issues
Version accuracy 94% 96% Medium Better version parsing

Quarterly Goals

Track progress against improvement targets:

Quarter Focus Area Metric Target Actual
Q4 2025 Java reachability Recall 88% TBD
Q1 2026 Python support F1 80% TBD
Q1 2026 SBOM completeness Recall 99% TBD

References