Files

StellaOps Bot b058dbe031 up

2025-12-14 23:20:14 +02:00

9.5 KiB

Raw Blame History

Accuracy Metrics Framework

Overview

This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.

Metric Definitions

Confusion Matrix

For binary classification tasks (e.g., reachable vs unreachable):

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Core Metrics

Metric	Formula	Description	Target
Precision	TP / (TP + FP)	Of items flagged, how many were correct	>= 90%
Recall	TP / (TP + FN)	Of actual positives, how many were found	>= 85%
F1 Score	2 * (P * R) / (P + R)	Harmonic mean of precision and recall	>= 87%
False Positive Rate	FP / (FP + TN)	Rate of incorrect positive flags	<= 10%
Accuracy	(TP + TN) / Total	Overall correctness	>= 90%

Reachability Analysis Accuracy

Definitions

True Positive (TP): Correctly identified as reachable (code path actually exists)
False Positive (FP): Incorrectly identified as reachable (no real code path)
True Negative (TN): Correctly identified as unreachable (no code path exists)
False Negative (FN): Incorrectly identified as unreachable (code path exists but missed)

Target Metrics

Metric	Target	Stretch Goal
Precision	>= 90%	>= 95%
Recall	>= 85%	>= 90%
F1 Score	>= 87%	>= 92%
False Positive Rate	<= 10%	<= 5%

Per-Language Targets

Language	Precision	Recall	F1	Notes
Java	>= 92%	>= 88%	>= 90%	Strong static analysis support
C#	>= 90%	>= 85%	>= 87%	Roslyn-based analysis
Go	>= 88%	>= 82%	>= 85%	Good call graph support
JavaScript	>= 85%	>= 78%	>= 81%	Dynamic typing challenges
Python	>= 83%	>= 75%	>= 79%	Dynamic typing challenges
TypeScript	>= 88%	>= 82%	>= 85%	Better than JS due to types

Lattice State Accuracy

VEX lattice states have different confidence requirements:

State	Definition	Target Accuracy	Validation
CR (Confirmed Reachable)	Runtime evidence + static path	>= 95%	Runtime trace verification
SR (Static Reachable)	Static path only	>= 90%	Static analysis coverage
SU (Static Unreachable)	No static path found	>= 85%	Negative proof verification
DT (Denied by Tool)	Tool analysis confirms not affected	>= 90%	Tool output validation
DV (Denied by Vendor)	Vendor VEX statement	>= 95%	VEX signature verification
U (Unknown)	Insufficient evidence	Track %	Minimize unknowns

Lattice Transition Accuracy

Measure accuracy of automatic state transitions:

Transition	Trigger	Target Accuracy
U -> SR	Static analysis finds path	>= 90%
SR -> CR	Runtime evidence added	>= 95%
U -> SU	Static analysis proves unreachable	>= 85%
SR -> DT	Tool-specific analysis	>= 90%

SBOM Completeness Metrics

Component Detection

Metric	Formula	Target	Notes
Component Recall	Found / Total Actual	>= 98%	Find all real components
Component Precision	Real / Reported	>= 99%	Minimize phantom components
Version Accuracy	Correct Versions / Total	>= 95%	Version string correctness
License Accuracy	Correct Licenses / Total	>= 90%	License detection accuracy

Per-Ecosystem Targets

Ecosystem	Comp. Recall	Comp. Precision	Version Acc.
Alpine APK	>= 99%	>= 99%	>= 98%
Debian DEB	>= 99%	>= 99%	>= 98%
npm	>= 97%	>= 98%	>= 95%
Maven	>= 98%	>= 99%	>= 96%
NuGet	>= 98%	>= 99%	>= 96%
PyPI	>= 96%	>= 98%	>= 94%
Go Modules	>= 97%	>= 98%	>= 95%
Cargo (Rust)	>= 98%	>= 99%	>= 96%

Vulnerability Detection Accuracy

CVE Matching

Metric	Formula	Target
CVE Recall	Found CVEs / Actual CVEs	>= 95%
CVE Precision	Correct CVEs / Reported CVEs	>= 98%
Version Range Accuracy	Correct Affected / Total	>= 93%

False Positive Categories

Track and minimize specific FP types:

FP Type	Description	Target Rate
Phantom Component	CVE for component not present	<= 1%
Version Mismatch	CVE for wrong version	<= 3%
Ecosystem Confusion	Wrong package with same name	<= 1%
Stale Advisory	Already fixed but flagged	<= 2%

Measurement Methodology

Ground Truth Establishment

Manual Curation
- Expert review of sample applications
- Documented decision rationale
- Multiple reviewer consensus
Automated Verification
- Cross-reference with authoritative sources
- NVD, OSV, GitHub Advisory Database
- Vendor security bulletins
Runtime Validation
- Dynamic analysis confirmation
- Exploit proof-of-concept testing
- Production monitoring correlation

Test Corpus Requirements

Category	Minimum Samples	Diversity Requirements
Reachability	50 per language	Mix of libraries, frameworks
SBOM	100 images	All major ecosystems
CVE Detection	500 CVEs	Mix of severities, ages
Performance	10 reference images	Various sizes

Measurement Process

1. Select ground truth corpus
   └── Minimum samples per category
   └── Representative of production workloads

2. Run scanner with deterministic manifest
   └── Fixed advisory database version
   └── Reproducible configuration

3. Compare results to ground truth
   └── Automated diff tooling
   └── Manual review of discrepancies

4. Compute metrics per category
   └── Generate confusion matrices
   └── Calculate precision/recall/F1

5. Aggregate and publish
   └── Per-ecosystem breakdown
   └── Overall summary metrics
   └── Trend analysis

Reporting Format

Quarterly Benchmark Report

{
  "report_version": "1.0",
  "scanner_version": "1.3.0",
  "report_date": "2025-12-14",
  "ground_truth_version": "2025-Q4",

  "reachability": {
    "overall": {
      "precision": 0.91,
      "recall": 0.86,
      "f1": 0.88,
      "samples": 450
    },
    "by_language": {
      "java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
      "csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
      "go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
    }
  },

  "sbom": {
    "component_recall": 0.98,
    "component_precision": 0.99,
    "version_accuracy": 0.96
  },

  "vulnerability": {
    "cve_recall": 0.96,
    "cve_precision": 0.98,
    "false_positive_rate": 0.02
  },

  "lattice_states": {
    "cr_accuracy": 0.96,
    "sr_accuracy": 0.91,
    "su_accuracy": 0.87
  }
}

Regression Detection

Thresholds

A regression is flagged when:

Metric	Regression Threshold	Action
Precision	> 3% decrease	Block release
Recall	> 5% decrease	Block release
F1	> 4% decrease	Block release
FPR	> 2% increase	Block release
Any metric	> 1% change	Investigate

CI Integration

# .gitea/workflows/accuracy-check.yml
accuracy-benchmark:
  runs-on: ubuntu-latest
  steps:
    - name: Run accuracy benchmark
      run: make benchmark-accuracy

    - name: Check for regressions
      run: |
        stellaops benchmark compare \
          --baseline results/baseline.json \
          --current results/current.json \
          --threshold-precision 0.03 \
          --threshold-recall 0.05 \
          --fail-on-regression

Ground Truth Sources

Internal

datasets/reachability/samples/ - Reachability ground truth
datasets/sbom/reference/ - Known-good SBOMs
bench/findings/ - CVE finding ground truth

External

NIST SARD - Software Assurance Reference Dataset
OSV Test Suite - Open Source Vulnerability test cases
OWASP Benchmark - Security testing benchmark
Juliet Test Suite - CWE coverage testing

Improvement Tracking

Gap Analysis

Identify and prioritize accuracy improvements:

Gap	Current	Target	Priority	Improvement Plan
Python recall	73%	78%	High	Improve type inference
npm precision	96%	98%	Medium	Fix aliasing issues
Version accuracy	94%	96%	Medium	Better version parsing

Quarterly Goals

Track progress against improvement targets:

Quarter	Focus Area	Metric	Target	Actual
Q4 2025	Java reachability	Recall	88%	TBD
Q1 2026	Python support	F1	80%	TBD
Q1 2026	SBOM completeness	Recall	99%	TBD

9.5 KiB Raw Blame History