Files
git.stella-ops.org/docs/modules/policy/design/confidence-to-ews-migration.md
2025-12-25 12:16:13 +02:00

423 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Confidence to Evidence-Weighted Score Migration Guide
> **Version:** 1.0
> **Status:** Active
> **Last Updated:** 2025-12-31
> **Sprint:** 8200.0012.0003 (Policy Engine Integration)
## Overview
This document describes the migration path from the legacy **Confidence** scoring system to the new **Evidence-Weighted Score (EWS)** system. The migration is designed to be gradual, low-risk, and fully reversible at each stage.
### Key Differences
| Aspect | Confidence (Legacy) | Evidence-Weighted Score |
|--------|---------------------|------------------------|
| **Score Range** | 0.01.0 (decimal) | 0100 (integer) |
| **Direction** | Higher = more confident | Higher = higher risk/priority |
| **Basis** | Heuristic confidence in finding | Evidence-backed exploitability |
| **Breakdown** | Single value | 6 dimensions (Rch, Rts, Bkp, Xpl, Src, Mit) |
| **Determinism** | Limited | Fully deterministic with proofs |
| **Attestation** | Not attested | Included in verdict attestation |
### Semantic Inversion
The most important difference is **semantic inversion**:
- **Confidence**: Higher values indicate higher confidence that a finding is accurate
- **EWS**: Higher values indicate higher exploitability evidence (more urgent to fix)
A high-confidence finding may have a low EWS if evidence shows it's mitigated. Conversely, a low-confidence finding may have a high EWS if runtime signals indicate active exploitation.
---
## Migration Phases
### Phase 1: Feature Flag (Opt-In)
**Duration:** Immediate → 2 weeks
**Risk:** None (off by default)
Enable EWS calculation without changing behavior:
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": false,
"DualEmitMode": false,
"UseAsPrimaryScore": false
}
}
}
```
**What happens:**
- EWS infrastructure is loaded but dormant
- No performance impact
- No output changes
**When to proceed:** After infrastructure validation
---
### Phase 2: Dual-Emit Mode (Parallel Calculation)
**Duration:** 24 weeks
**Risk:** Low (additive only)
Enable both scoring systems in parallel:
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": true,
"DualEmitMode": true,
"UseAsPrimaryScore": false
}
}
}
```
**What happens:**
- Both Confidence AND EWS are calculated
- Both appear in verdicts and attestations
- Telemetry compares rankings
- Existing rules use Confidence (unchanged behavior)
**Verdict output example:**
```json
{
"findingId": "CVE-2024-1234:pkg:npm/lodash@4.17.0",
"status": "block",
"confidence": {
"value": 0.85,
"tier": "High"
},
"evidenceWeightedScore": {
"score": 72,
"bucket": "ScheduleNext",
"breakdown": {
"rch": { "weighted": 18, "weight": 0.25, "raw": 0.72 },
"rts": { "weighted": 24, "weight": 0.30, "raw": 0.80 },
"bkp": { "weighted": 0, "weight": 0.10, "raw": 0.00 },
"xpl": { "weighted": 10, "weight": 0.15, "raw": 0.67 },
"src": { "weighted": 12, "weight": 0.15, "raw": 0.80 },
"mit": { "weighted": 8, "weight": 0.05, "raw": 1.60 }
},
"flags": ["live-signal"],
"explanations": ["Runtime signal detected (score +8)", "Reachable via call graph"]
}
}
```
**Monitoring during this phase:**
- Use `IMigrationTelemetryService` to track alignment
- Review `MigrationTelemetryStats.AlignmentRate`
- Investigate samples where rankings diverge significantly
**When to proceed:** When alignment rate > 80% or divergences are understood
---
### Phase 3: EWS as Primary (Shadow Confidence)
**Duration:** 24 weeks
**Risk:** Medium (behavior change possible)
Switch primary scoring while keeping Confidence for validation:
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": true,
"DualEmitMode": true,
"UseAsPrimaryScore": true
}
}
}
```
**What happens:**
- EWS is used for policy rule evaluation
- Confidence is still calculated and emitted (deprecated field)
- Policy rules should be migrated to use `score` instead of `confidence`
**Rule migration example:**
Before (Confidence-based):
```yaml
rules:
- name: block-high-confidence
when: confidence >= 0.9
then: block
```
After (EWS-based):
```yaml
rules:
- name: block-high-evidence
when: score >= 85
then: block
# Or use bucket-based for clearer semantics:
- name: block-act-now
when: score.bucket == "ActNow"
then: block
```
**Recommended rule patterns:**
| Confidence Rule | EWS Equivalent | Notes |
|----------------|----------------|-------|
| `confidence >= 0.9` | `score >= 85` or `score.bucket == "ActNow"` | Very high certainty |
| `confidence >= 0.7` | `score >= 60` or `score.bucket in ["ActNow", "ScheduleNext"]` | High certainty |
| `confidence >= 0.5` | `score >= 40` | Medium certainty |
| `confidence < 0.3` | `score < 25` | Low evidence |
**When to proceed:** After rule migration and 2+ weeks of stable operation
---
### Phase 4: EWS-Only (Deprecation Complete)
**Duration:** Permanent
**Risk:** Low (rollback path exists)
Disable legacy Confidence scoring:
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": true,
"DualEmitMode": false,
"UseAsPrimaryScore": true
}
}
}
```
**What happens:**
- Only EWS is calculated
- Confidence field is null in verdicts
- Performance improvement (single calculation)
- Consumers must use EWS fields
**Breaking changes to document:**
- `Verdict.Confidence` returns null
- `ConfidenceScore` type is deprecated (will be removed in v3.0)
- Rules referencing `confidence` will fail validation
---
## Configuration Reference
### Full Configuration Schema
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": true,
"DualEmitMode": true,
"UseAsPrimaryScore": false,
"EnableCaching": true,
"CacheDurationSeconds": 300,
"Weights": {
"Reachability": 0.25,
"RuntimeSignal": 0.30,
"BackportStatus": 0.10,
"ExploitMaturity": 0.15,
"SourceTrust": 0.15,
"MitigationStatus": 0.05
},
"BucketThresholds": {
"ActNow": 85,
"ScheduleNext": 60,
"Investigate": 40
},
"Telemetry": {
"EnableMigrationMetrics": true,
"SampleRate": 0.1,
"MaxSamples": 1000
}
}
}
}
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `POLICY_EWS_ENABLED` | `false` | Enable EWS calculation |
| `POLICY_EWS_DUAL_EMIT` | `false` | Emit both scores |
| `POLICY_EWS_PRIMARY` | `false` | Use EWS as primary score |
| `POLICY_EWS_CACHE_ENABLED` | `true` | Enable score caching |
---
## Telemetry & Monitoring
### Metrics
The migration telemetry service exposes these metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `stellaops.policy.migration.comparisons_total` | Counter | Total comparisons made |
| `stellaops.policy.migration.aligned_total` | Counter | Comparisons where rankings aligned |
| `stellaops.policy.migration.score_difference` | Histogram | Distribution of score differences |
| `stellaops.policy.migration.tier_bucket_match_total` | Counter | Tier/bucket matches |
| `stellaops.policy.dual_emit.verdicts_total` | Counter | Dual-emit verdicts produced |
### Dashboard Queries
**Alignment rate over time:**
```promql
rate(stellaops_policy_migration_aligned_total[5m])
/ rate(stellaops_policy_migration_comparisons_total[5m])
```
**Score difference distribution:**
```promql
histogram_quantile(0.95, stellaops_policy_migration_score_difference_bucket)
```
### Sample Analysis
Use `IMigrationTelemetryService.GetRecentSamples()` to retrieve divergent samples:
```csharp
var telemetry = serviceProvider.GetRequiredService<IMigrationTelemetryService>();
var stats = telemetry.GetStats();
if (stats.AlignmentRate < 0.8m)
{
var samples = telemetry.GetRecentSamples(50)
.Where(s => !s.IsAligned)
.OrderByDescending(s => Math.Abs(s.ScoreDifference));
foreach (var sample in samples)
{
Console.WriteLine($"{sample.FindingId}: Conf={sample.ConfidenceValue:F2} → EWS={sample.EwsScore} (Δ={sample.ScoreDifference})");
}
}
```
---
## Rollback Procedures
### Phase 4 → Phase 3 (Re-enable Dual-Emit)
```json
{
"Policy": {
"EvidenceWeightedScore": {
"DualEmitMode": true
}
}
}
```
Restart services. Confidence will be calculated again.
### Phase 3 → Phase 2 (Revert to Confidence Primary)
```json
{
"Policy": {
"EvidenceWeightedScore": {
"UseAsPrimaryScore": false
}
}
}
```
Rules using `confidence` will work again. Rules using `score` will still work.
### Phase 2 → Phase 1 (Disable EWS)
```json
{
"Policy": {
"EvidenceWeightedScore": {
"Enabled": false
}
}
}
```
No EWS calculation, no performance impact.
### Emergency Rollback
Set environment variable for immediate effect without restart (if hot-reload enabled):
```bash
export POLICY_EWS_ENABLED=false
```
---
## Rule Migration Checklist
- [ ] Inventory all policies using `confidence` field
- [ ] Map confidence thresholds to EWS thresholds (see table above)
- [ ] Update rules to use `score` syntax
- [ ] Consider using bucket-based rules for clearer semantics
- [ ] Test rules in dual-emit mode before switching primary
- [ ] Update documentation and runbooks
- [ ] Train operators on new score interpretation
- [ ] Update alerting thresholds
---
## FAQ
### Q: Will existing rules break?
**A:** Not during dual-emit mode. Rules using `confidence` continue to work. Once `UseAsPrimaryScore: true`, new rules should use `score`. Old `confidence` rules will emit deprecation warnings and fail validation in Phase 4.
### Q: How do I interpret the score difference?
**A:** The ConfidenceToEwsAdapter maps Confidence (0-1) to an approximate EWS (0-100) with semantic inversion. A "difference" of ±15 points is normal due to the different underlying models. Investigate differences > 30 points.
### Q: What if my rankings diverge significantly?
**A:** This is expected for findings where:
- Runtime signals (Rts) differ from static analysis
- Vendor VEX overrides traditional severity
- Reachability analysis shows unreachable code
Review these cases manually. EWS is likely more accurate due to evidence integration.
### Q: Can I customize the EWS weights?
**A:** Yes, via `Weights` configuration. However, changing weights affects determinism proofs. Document any changes and bump the policy version.
### Q: What about attestations?
**A:** During dual-emit, attestations include both scores. After Phase 4, only EWS is attested. Old attestations remain verifiable with their original scores.
---
## Related Documents
- [Evidence-Weighted Score Architecture](../../signals/architecture.md)
- [Policy DSL Reference](../contracts/policy-dsl.md)
- [Verdict Attestation](../verdict-attestation.md)
- [Sprint 8200.0012.0003](../../../../implplan/SPRINT_8200_0012_0003_policy_engine_integration.md)
---
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-31 | Implementer | Initial migration guide |