321 lines
9.5 KiB
Markdown
321 lines
9.5 KiB
Markdown
# Accuracy Metrics Framework
|
|
|
|
## Overview
|
|
|
|
This document defines the accuracy metrics framework used to measure and track StellaOps scanner performance. All metrics are computed against ground truth datasets and published quarterly.
|
|
|
|
## Metric Definitions
|
|
|
|
### Confusion Matrix
|
|
|
|
For binary classification tasks (e.g., reachable vs unreachable):
|
|
|
|
| | Predicted Positive | Predicted Negative |
|
|
|--|-------------------|-------------------|
|
|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
|
|
| **Actual Negative** | False Positive (FP) | True Negative (TN) |
|
|
|
|
### Core Metrics
|
|
|
|
| Metric | Formula | Description | Target |
|
|
|--------|---------|-------------|--------|
|
|
| **Precision** | TP / (TP + FP) | Of items flagged, how many were correct | >= 90% |
|
|
| **Recall** | TP / (TP + FN) | Of actual positives, how many were found | >= 85% |
|
|
| **F1 Score** | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall | >= 87% |
|
|
| **False Positive Rate** | FP / (FP + TN) | Rate of incorrect positive flags | <= 10% |
|
|
| **Accuracy** | (TP + TN) / Total | Overall correctness | >= 90% |
|
|
|
|
---
|
|
|
|
## Reachability Analysis Accuracy
|
|
|
|
### Definitions
|
|
|
|
- **True Positive (TP)**: Correctly identified as reachable (code path actually exists)
|
|
- **False Positive (FP)**: Incorrectly identified as reachable (no real code path)
|
|
- **True Negative (TN)**: Correctly identified as unreachable (no code path exists)
|
|
- **False Negative (FN)**: Incorrectly identified as unreachable (code path exists but missed)
|
|
|
|
### Target Metrics
|
|
|
|
| Metric | Target | Stretch Goal |
|
|
|--------|--------|--------------|
|
|
| Precision | >= 90% | >= 95% |
|
|
| Recall | >= 85% | >= 90% |
|
|
| F1 Score | >= 87% | >= 92% |
|
|
| False Positive Rate | <= 10% | <= 5% |
|
|
|
|
### Per-Language Targets
|
|
|
|
| Language | Precision | Recall | F1 | Notes |
|
|
|----------|-----------|--------|-----|-------|
|
|
| Java | >= 92% | >= 88% | >= 90% | Strong static analysis support |
|
|
| C# | >= 90% | >= 85% | >= 87% | Roslyn-based analysis |
|
|
| Go | >= 88% | >= 82% | >= 85% | Good call graph support |
|
|
| JavaScript | >= 85% | >= 78% | >= 81% | Dynamic typing challenges |
|
|
| Python | >= 83% | >= 75% | >= 79% | Dynamic typing challenges |
|
|
| TypeScript | >= 88% | >= 82% | >= 85% | Better than JS due to types |
|
|
|
|
---
|
|
|
|
## Lattice State Accuracy
|
|
|
|
VEX lattice states have different confidence requirements:
|
|
|
|
| State | Definition | Target Accuracy | Validation |
|
|
|-------|------------|-----------------|------------|
|
|
| **CR** (Confirmed Reachable) | Runtime evidence + static path | >= 95% | Runtime trace verification |
|
|
| **SR** (Static Reachable) | Static path only | >= 90% | Static analysis coverage |
|
|
| **SU** (Static Unreachable) | No static path found | >= 85% | Negative proof verification |
|
|
| **DT** (Denied by Tool) | Tool analysis confirms not affected | >= 90% | Tool output validation |
|
|
| **DV** (Denied by Vendor) | Vendor VEX statement | >= 95% | VEX signature verification |
|
|
| **U** (Unknown) | Insufficient evidence | Track % | Minimize unknowns |
|
|
|
|
### Lattice Transition Accuracy
|
|
|
|
Measure accuracy of automatic state transitions:
|
|
|
|
| Transition | Trigger | Target Accuracy |
|
|
|------------|---------|-----------------|
|
|
| U -> SR | Static analysis finds path | >= 90% |
|
|
| SR -> CR | Runtime evidence added | >= 95% |
|
|
| U -> SU | Static analysis proves unreachable | >= 85% |
|
|
| SR -> DT | Tool-specific analysis | >= 90% |
|
|
|
|
---
|
|
|
|
## SBOM Completeness Metrics
|
|
|
|
### Component Detection
|
|
|
|
| Metric | Formula | Target | Notes |
|
|
|--------|---------|--------|-------|
|
|
| **Component Recall** | Found / Total Actual | >= 98% | Find all real components |
|
|
| **Component Precision** | Real / Reported | >= 99% | Minimize phantom components |
|
|
| **Version Accuracy** | Correct Versions / Total | >= 95% | Version string correctness |
|
|
| **License Accuracy** | Correct Licenses / Total | >= 90% | License detection accuracy |
|
|
|
|
### Per-Ecosystem Targets
|
|
|
|
| Ecosystem | Comp. Recall | Comp. Precision | Version Acc. |
|
|
|-----------|--------------|-----------------|--------------|
|
|
| Alpine APK | >= 99% | >= 99% | >= 98% |
|
|
| Debian DEB | >= 99% | >= 99% | >= 98% |
|
|
| npm | >= 97% | >= 98% | >= 95% |
|
|
| Maven | >= 98% | >= 99% | >= 96% |
|
|
| NuGet | >= 98% | >= 99% | >= 96% |
|
|
| PyPI | >= 96% | >= 98% | >= 94% |
|
|
| Go Modules | >= 97% | >= 98% | >= 95% |
|
|
| Cargo (Rust) | >= 98% | >= 99% | >= 96% |
|
|
|
|
---
|
|
|
|
## Vulnerability Detection Accuracy
|
|
|
|
### CVE Matching
|
|
|
|
| Metric | Formula | Target |
|
|
|--------|---------|--------|
|
|
| **CVE Recall** | Found CVEs / Actual CVEs | >= 95% |
|
|
| **CVE Precision** | Correct CVEs / Reported CVEs | >= 98% |
|
|
| **Version Range Accuracy** | Correct Affected / Total | >= 93% |
|
|
|
|
### False Positive Categories
|
|
|
|
Track and minimize specific FP types:
|
|
|
|
| FP Type | Description | Target Rate |
|
|
|---------|-------------|-------------|
|
|
| **Phantom Component** | CVE for component not present | <= 1% |
|
|
| **Version Mismatch** | CVE for wrong version | <= 3% |
|
|
| **Ecosystem Confusion** | Wrong package with same name | <= 1% |
|
|
| **Stale Advisory** | Already fixed but flagged | <= 2% |
|
|
|
|
---
|
|
|
|
## Measurement Methodology
|
|
|
|
### Ground Truth Establishment
|
|
|
|
1. **Manual Curation**
|
|
- Expert review of sample applications
|
|
- Documented decision rationale
|
|
- Multiple reviewer consensus
|
|
|
|
2. **Automated Verification**
|
|
- Cross-reference with authoritative sources
|
|
- NVD, OSV, GitHub Advisory Database
|
|
- Vendor security bulletins
|
|
|
|
3. **Runtime Validation**
|
|
- Dynamic analysis confirmation
|
|
- Exploit proof-of-concept testing
|
|
- Production monitoring correlation
|
|
|
|
### Test Corpus Requirements
|
|
|
|
| Category | Minimum Samples | Diversity Requirements |
|
|
|----------|-----------------|----------------------|
|
|
| Reachability | 50 per language | Mix of libraries, frameworks |
|
|
| SBOM | 100 images | All major ecosystems |
|
|
| CVE Detection | 500 CVEs | Mix of severities, ages |
|
|
| Performance | 10 reference images | Various sizes |
|
|
|
|
### Measurement Process
|
|
|
|
```
|
|
1. Select ground truth corpus
|
|
└── Minimum samples per category
|
|
└── Representative of production workloads
|
|
|
|
2. Run scanner with deterministic manifest
|
|
└── Fixed advisory database version
|
|
└── Reproducible configuration
|
|
|
|
3. Compare results to ground truth
|
|
└── Automated diff tooling
|
|
└── Manual review of discrepancies
|
|
|
|
4. Compute metrics per category
|
|
└── Generate confusion matrices
|
|
└── Calculate precision/recall/F1
|
|
|
|
5. Aggregate and publish
|
|
└── Per-ecosystem breakdown
|
|
└── Overall summary metrics
|
|
└── Trend analysis
|
|
```
|
|
|
|
---
|
|
|
|
## Reporting Format
|
|
|
|
### Quarterly Benchmark Report
|
|
|
|
```json
|
|
{
|
|
"report_version": "1.0",
|
|
"scanner_version": "1.3.0",
|
|
"report_date": "2025-12-14",
|
|
"ground_truth_version": "2025-Q4",
|
|
|
|
"reachability": {
|
|
"overall": {
|
|
"precision": 0.91,
|
|
"recall": 0.86,
|
|
"f1": 0.88,
|
|
"samples": 450
|
|
},
|
|
"by_language": {
|
|
"java": {"precision": 0.93, "recall": 0.88, "f1": 0.90, "samples": 100},
|
|
"csharp": {"precision": 0.90, "recall": 0.85, "f1": 0.87, "samples": 80},
|
|
"go": {"precision": 0.89, "recall": 0.83, "f1": 0.86, "samples": 70}
|
|
}
|
|
},
|
|
|
|
"sbom": {
|
|
"component_recall": 0.98,
|
|
"component_precision": 0.99,
|
|
"version_accuracy": 0.96
|
|
},
|
|
|
|
"vulnerability": {
|
|
"cve_recall": 0.96,
|
|
"cve_precision": 0.98,
|
|
"false_positive_rate": 0.02
|
|
},
|
|
|
|
"lattice_states": {
|
|
"cr_accuracy": 0.96,
|
|
"sr_accuracy": 0.91,
|
|
"su_accuracy": 0.87
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Regression Detection
|
|
|
|
### Thresholds
|
|
|
|
A regression is flagged when:
|
|
|
|
| Metric | Regression Threshold | Action |
|
|
|--------|---------------------|--------|
|
|
| Precision | > 3% decrease | Block release |
|
|
| Recall | > 5% decrease | Block release |
|
|
| F1 | > 4% decrease | Block release |
|
|
| FPR | > 2% increase | Block release |
|
|
| Any metric | > 1% change | Investigate |
|
|
|
|
### CI Integration
|
|
|
|
```yaml
|
|
# .gitea/workflows/accuracy-check.yml
|
|
accuracy-benchmark:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- name: Run accuracy benchmark
|
|
run: make benchmark-accuracy
|
|
|
|
- name: Check for regressions
|
|
run: |
|
|
stellaops benchmark compare \
|
|
--baseline results/baseline.json \
|
|
--current results/current.json \
|
|
--threshold-precision 0.03 \
|
|
--threshold-recall 0.05 \
|
|
--fail-on-regression
|
|
```
|
|
|
|
---
|
|
|
|
## Ground Truth Sources
|
|
|
|
### Internal
|
|
|
|
- `datasets/reachability/samples/` - Reachability ground truth
|
|
- `datasets/sbom/reference/` - Known-good SBOMs
|
|
- `bench/findings/` - CVE finding ground truth
|
|
|
|
### External
|
|
|
|
- **NIST SARD** - Software Assurance Reference Dataset
|
|
- **OSV Test Suite** - Open Source Vulnerability test cases
|
|
- **OWASP Benchmark** - Security testing benchmark
|
|
- **Juliet Test Suite** - CWE coverage testing
|
|
|
|
---
|
|
|
|
## Improvement Tracking
|
|
|
|
### Gap Analysis
|
|
|
|
Identify and prioritize accuracy improvements:
|
|
|
|
| Gap | Current | Target | Priority | Improvement Plan |
|
|
|-----|---------|--------|----------|------------------|
|
|
| Python recall | 73% | 78% | High | Improve type inference |
|
|
| npm precision | 96% | 98% | Medium | Fix aliasing issues |
|
|
| Version accuracy | 94% | 96% | Medium | Better version parsing |
|
|
|
|
### Quarterly Goals
|
|
|
|
Track progress against improvement targets:
|
|
|
|
| Quarter | Focus Area | Metric | Target | Actual |
|
|
|---------|------------|--------|--------|--------|
|
|
| Q4 2025 | Java reachability | Recall | 88% | TBD |
|
|
| Q1 2026 | Python support | F1 | 80% | TBD |
|
|
| Q1 2026 | SBOM completeness | Recall | 99% | TBD |
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [FIRST CVSS v4.0 Specification](https://www.first.org/cvss/v4.0/specification-document)
|
|
- [NIST NVD API](https://nvd.nist.gov/developers)
|
|
- [OSV Schema](https://ossf.github.io/osv-schema/)
|
|
- [StellaOps Reachability Architecture](../modules/scanner/reachability.md)
|