Accuracy Metrics Framework
Overview
This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.
Metric Definitions
Confusion Matrix
For binary classification tasks (e.g., reachable vs unreachable):
|
Predicted Positive |
Predicted Negative |
| Actual Positive |
True Positive (TP) |
False Negative (FN) |
| Actual Negative |
False Positive (FP) |
True Negative (TN) |
Core Metrics
| Metric |
Formula |
Description |
Target |
| Precision |
TP / (TP + FP) |
Of items flagged, how many were correct |
>= 90% |
| Recall |
TP / (TP + FN) |
Of actual positives, how many were found |
>= 85% |
| F1 Score |
2 * (P * R) / (P + R) |
Harmonic mean of precision and recall |
>= 87% |
| False Positive Rate |
FP / (FP + TN) |
Rate of incorrect positive flags |
<= 10% |
| Accuracy |
(TP + TN) / Total |
Overall correctness |
>= 90% |
Reachability Analysis Accuracy
Definitions
- True Positive (TP): Correctly identified as reachable (code path actually exists)
- False Positive (FP): Incorrectly identified as reachable (no real code path)
- True Negative (TN): Correctly identified as unreachable (no code path exists)
- False Negative (FN): Incorrectly identified as unreachable (code path exists but missed)
Target Metrics
| Metric |
Target |
Stretch Goal |
| Precision |
>= 90% |
>= 95% |
| Recall |
>= 85% |
>= 90% |
| F1 Score |
>= 87% |
>= 92% |
| False Positive Rate |
<= 10% |
<= 5% |
Per-Language Targets
| Language |
Precision |
Recall |
F1 |
Notes |
| Java |
>= 92% |
>= 88% |
>= 90% |
Strong static analysis support |
| C# |
>= 90% |
>= 85% |
>= 87% |
Roslyn-based analysis |
| Go |
>= 88% |
>= 82% |
>= 85% |
Good call graph support |
| JavaScript |
>= 85% |
>= 78% |
>= 81% |
Dynamic typing challenges |
| Python |
>= 83% |
>= 75% |
>= 79% |
Dynamic typing challenges |
| TypeScript |
>= 88% |
>= 82% |
>= 85% |
Better than JS due to types |
Lattice State Accuracy
VEX lattice states have different confidence requirements:
| State |
Definition |
Target Accuracy |
Validation |
| CR (Confirmed Reachable) |
Runtime evidence + static path |
>= 95% |
Runtime trace verification |
| SR (Static Reachable) |
Static path only |
>= 90% |
Static analysis coverage |
| SU (Static Unreachable) |
No static path found |
>= 85% |
Negative proof verification |
| DT (Denied by Tool) |
Tool analysis confirms not affected |
>= 90% |
Tool output validation |
| DV (Denied by Vendor) |
Vendor VEX statement |
>= 95% |
VEX signature verification |
| U (Unknown) |
Insufficient evidence |
Track % |
Minimize unknowns |
Lattice Transition Accuracy
Measure accuracy of automatic state transitions:
| Transition |
Trigger |
Target Accuracy |
| U -> SR |
Static analysis finds path |
>= 90% |
| SR -> CR |
Runtime evidence added |
>= 95% |
| U -> SU |
Static analysis proves unreachable |
>= 85% |
| SR -> DT |
Tool-specific analysis |
>= 90% |
SBOM Completeness Metrics
Component Detection
| Metric |
Formula |
Target |
Notes |
| Component Recall |
Found / Total Actual |
>= 98% |
Find all real components |
| Component Precision |
Real / Reported |
>= 99% |
Minimize phantom components |
| Version Accuracy |
Correct Versions / Total |
>= 95% |
Version string correctness |
| License Accuracy |
Correct Licenses / Total |
>= 90% |
License detection accuracy |
Per-Ecosystem Targets
| Ecosystem |
Comp. Recall |
Comp. Precision |
Version Acc. |
| Alpine APK |
>= 99% |
>= 99% |
>= 98% |
| Debian DEB |
>= 99% |
>= 99% |
>= 98% |
| npm |
>= 97% |
>= 98% |
>= 95% |
| Maven |
>= 98% |
>= 99% |
>= 96% |
| NuGet |
>= 98% |
>= 99% |
>= 96% |
| PyPI |
>= 96% |
>= 98% |
>= 94% |
| Go Modules |
>= 97% |
>= 98% |
>= 95% |
| Cargo (Rust) |
>= 98% |
>= 99% |
>= 96% |
Vulnerability Detection Accuracy
CVE Matching
| Metric |
Formula |
Target |
| CVE Recall |
Found CVEs / Actual CVEs |
>= 95% |
| CVE Precision |
Correct CVEs / Reported CVEs |
>= 98% |
| Version Range Accuracy |
Correct Affected / Total |
>= 93% |
False Positive Categories
Track and minimize specific FP types:
| FP Type |
Description |
Target Rate |
| Phantom Component |
CVE for component not present |
<= 1% |
| Version Mismatch |
CVE for wrong version |
<= 3% |
| Ecosystem Confusion |
Wrong package with same name |
<= 1% |
| Stale Advisory |
Already fixed but flagged |
<= 2% |
Measurement Methodology
Ground Truth Establishment
-
Manual Curation
- Expert review of sample applications
- Documented decision rationale
- Multiple reviewer consensus
-
Automated Verification
- Cross-reference with authoritative sources
- NVD, OSV, GitHub Advisory Database
- Vendor security bulletins
-
Runtime Validation
- Dynamic analysis confirmation
- Exploit proof-of-concept testing
- Production monitoring correlation
Test Corpus Requirements
| Category |
Minimum Samples |
Diversity Requirements |
| Reachability |
50 per language |
Mix of libraries, frameworks |
| SBOM |
100 images |
All major ecosystems |
| CVE Detection |
500 CVEs |
Mix of severities, ages |
| Performance |
10 reference images |
Various sizes |
Measurement Process
Reporting Format
Quarterly Benchmark Report
Regression Detection
Thresholds
A regression is flagged when:
| Metric |
Regression Threshold |
Action |
| Precision |
> 3% decrease |
Block release |
| Recall |
> 5% decrease |
Block release |
| F1 |
> 4% decrease |
Block release |
| FPR |
> 2% increase |
Block release |
| Any metric |
> 1% change |
Investigate |
CI Integration
Ground Truth Sources
Internal
datasets/reachability/samples/ - Reachability ground truth
datasets/sbom/reference/ - Known-good SBOMs
bench/findings/ - CVE finding ground truth
External
- NIST SARD - Software Assurance Reference Dataset
- OSV Test Suite - Open Source Vulnerability test cases
- OWASP Benchmark - Security testing benchmark
- Juliet Test Suite - CWE coverage testing
Improvement Tracking
Gap Analysis
Identify and prioritize accuracy improvements:
| Gap |
Current |
Target |
Priority |
Improvement Plan |
| Python recall |
73% |
78% |
High |
Improve type inference |
| npm precision |
96% |
98% |
Medium |
Fix aliasing issues |
| Version accuracy |
94% |
96% |
Medium |
Better version parsing |
Quarterly Goals
Track progress against improvement targets:
| Quarter |
Focus Area |
Metric |
Target |
Actual |
| Q4 2025 |
Java reachability |
Recall |
88% |
TBD |
| Q1 2026 |
Python support |
F1 |
80% |
TBD |
| Q1 2026 |
SBOM completeness |
Recall |
99% |
TBD |
References