22 KiB
Here’s a compact, low-friction way to tame “unknowns” in Stella Ops without boiling the ocean: two heuristics you can prototype this week—each yields one clear artifact you can show in the UI and wire into the next planning cycle.
1) Decaying Confidence (Half-Life) for Unknown Reachability
Idea: every “unknown” reachability/verdict starts with a confidence score that decays over time (exponential half-life). If no new evidence arrives, confidence naturally drifts toward “needs refresh,” preventing stale assumptions from lingering.
Why it helps (plain English): unknowns don’t stay “probably fine” forever—this makes them self-expiring, so triage resurfaces them at the right time instead of only when something breaks.
Minimal data model (UnknownsRegistry):
{
"unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",
"subject_ref": { "type": "package", "purl": "pkg:npm/lodash@4.17.21" },
"vuln_id": "CVE-2021-23337",
"dimension": "reachability",
"confidence": { "value": 0.78, "method": "half_life", "t0": "2025-11-29T12:00:00Z", "half_life_days": 14 },
"evidence": [{ "kind": "static_scan_hint", "hash": "…" }],
"next_review_at": "2025-12-06T12:00:00Z",
"status": "unknown"
}
Update rule (per tick or on read):
confidence_now = confidence_t0 * 0.5^(Δdays / half_life_days)- When
confidence_now < threshold_low→ flag for human review (see Queue below). - When fresh evidence arrives → reset
t0, optionally raise confidence.
One UI artifact: A “Confidence Decay Card” on each unknown, showing:
- sparkline of decay over time,
- next review ETA,
- button “Refresh with latest evidence” (re-run reachability probes).
One ops hook (planning): Export a daily CSV/JSON of unknowns whose confidence crossed threshold to feed the triage board.
2) Human-Review Queue for High-Impact Unknowns
Idea: only a subset of unknowns deserve people time. Auto-rank them by potential blast radius + decayed confidence.
Triage score (simple, transparent):
triage_score = impact_score * (1 - confidence_now)
impact_score(0–1): runtime exposure, privilege, prevalence, SLA tier.confidence_now: from heuristic #1.
Queue item schema (artifact to display & act on):
{
"queue_item_id": "TRIAGE:unknown:…",
"unknown_id": "URN:unknown:…",
"triage_score": 0.74,
"impact_factors": { "runtime_presence": true, "privilege": "high", "fleet_prevalence": 0.62, "sla_tier": "gold" },
"confidence_now": 0.28,
"assigned_to": "unassigned",
"due_by": "2025-12-02T17:00:00Z",
"actions": [
{ "type": "collect_runtime_trace", "cost": "low" },
{ "type": "symbolic_slice_probe", "cost": "medium" },
{ "type": "vendor_VEX_request", "cost": "low" }
],
"audit": [{ "at": "2025-11-29T12:05:00Z", "who": "system", "what": "enqueued: threshold_low crossed" }]
}
One UI artifact:
A “High-Impact Unknowns” queue view sorted by triage_score, showing:
- pill tags for impact factors,
- inline actions (Assign, Probe, Add Evidence, Mark Resolved),
- SLO badge showing
due_by.
One ops hook (planning):
Pull top N items by triage_score at sprint start. Each resolved item must attach new evidence or a documented “Not Affected” rationale so the next decay cycle begins from stronger assumptions.
Wiring into Stella Ops quickly (dev notes)
- Storage: add
UnknownsRegistrycollection/table; compute decay on read to avoid cron churn. - Thresholds: start with
half_life_days = 14,threshold_low = 0.35; tune later. - Impact scoring: begin with simple weights in config (runtime_presence=0.4, privilege=0.3, prevalence=0.2, SLA=0.1).
- APIs:
GET /unknowns?stale=true(confidence < threshold)POST /triage/enqueue(system-owned)POST /unknowns/{id}/evidence(resets t0, recomputes next_review_at)
- Events: emit
UnknownConfidenceCrossedLow→TriageItemCreated.
What you’ll have after a 1–2 day spike
- A decay card on each unknown + a simple, sortable triage queue.
- A daily export artifact to drive planning.
- A clear, auditable path from “we’re unsure” → “we gathered evidence” → “we’re confident (for now).”
If you want, I can generate:
- the C# POCOs/EF mappings,
- a minimal Controller set,
- Angular components (card + queue table),
- and seed data + an evaluator that computes
confidence_nowandtriage_scorefrom config. Cool, let’s turn those two sketchy heuristics into something you can actually ship and iterate on in Stella Ops.
I’ll go deeper on:
- Decaying confidence as a proper first‑class concept
- The triage queue and workflow around “unknowns”
- A lightweight “unknown budget” / guardrail layer
- Concrete implementation sketches (data models, formulas, pseudo‑code)
- How this feeds planning & metrics
1) Decaying Confidence: From Idea to Mechanism
1.1 What “confidence” actually means
To keep semantics crisp, define confidence as:
“How fresh and well‑supported our knowledge is about this vulnerability in this subject along this dimension (reachability, exploitability, etc.).”
1.0= Recently assessed with strong evidence0.0= We basically haven’t looked / our info is ancient
This works for unknown, known‑affected, and known‑not‑affected; decay is about knowledge freshness, not the verdict itself.
For unknowns, confidence will usually be low and decaying → that’s what pushes them into the queue.
1.2 Data model v2 (UnknownsRegistry)
Extend the earlier object a bit:
{
"unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",
"subject_ref": {
"type": "package", // package | service | container | host | cluster
"purl": "pkg:npm/lodash@4.17.21",
"service_id": "checkout-api",
"env": "prod" // prod | staging | dev
},
"vuln_id": "CVE-2021-23337",
"dimension": "reachability", // reachability | exploitability | fix_validity | other
"state": "unknown", // unknown | known_affected | known_not_affected | ignored
"unknown_cause": "tooling_gap", // data_missing | vendor_silent | tooling_gap | conflicting_evidence
"confidence": {
"value": 0.62, // computed on read
"method": "half_life",
"t0": "2025-11-29T12:00:00Z",
"value_at_t0": 0.9,
"half_life_days": 14,
"threshold_low": 0.35,
"threshold_high": 0.75
},
"impact": {
"runtime_presence": true,
"internet_exposed": true,
"privilege_level": "high",
"data_sensitivity": "pii", // none | internal | pii | financial
"fleet_prevalence": 0.62, // fraction of services using this
"sla_tier": "gold" // bronze | silver | gold
},
"next_review_at": "2025-12-06T12:00:00Z",
"owner": "team-checkout",
"created_at": "2025-11-29T12:00:00Z",
"updated_at": "2025-11-29T12:00:00Z",
"evidence": [
{
"kind": "static_scan_hint",
"summary": "No direct call from public handler to vulnerable sink found.",
"created_at": "2025-11-29T12:00:00Z",
"link": "https://stellaops/ui/evidence/123"
}
]
}
Key points:
-
unknown_causehelps you slice unknowns by “why do we not know?” (lack of data vs tooling vs vendor). -
impactis embedded here so triage scoring can be local without joining a ton of tables. -
half_life_dayscan be per dimension & per environment, e.g.:- prod + reachability → 7 days
- staging + fix_validity → 30 days
1.3 Decay math & scheduling
Use exponential decay:
confidence(t) = value_at_t0 * 0.5^(Δdays / half_life_days)
Where:
Δdays = (now - t0) in days
On write (when you update or create the record), you:
- Compute
value_nowfrom any previous state. - Apply bump/delta based on new evidence (bounded by 0..1).
- Set
value_at_t0 = value_now_after_bump,t0 = now. - Precompute
next_review_at= whenconfidence(t)will crossthreshold_low.
Pseudo‑code for step 4:
double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
{
if (valueAtT0 <= threshold) return 0;
// threshold = valueAtT0 * 0.5^(Δ/halfLife)
// Δ = halfLife * log(threshold/valueAtT0) / log(0.5)
return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
}
Then:
var days = DaysUntilThreshold(valueAtT0, thresholdLow, halfLifeDays);
nextReviewAt = now.AddDays(days);
Important: this gives you a cheap query to build the queue:
SELECT * FROM UnknownsRegistry
WHERE state = 'unknown'
AND next_review_at <= now();
No cron‑based bulk recomputation necessary.
1.4 Events that bump confidence
Any new evidence should “refresh” knowledge and adjust confidence:
Examples:
-
Runtime traces showing the vulnerable function never called in a hot path → bump reachability confidence up moderately (e.g. +0.2, capped at 0.9).
-
Symbolic or fuzzing probe explicitly drives execution into the vulnerable code → flip
state = known_affected, set confidence close to 1.0 with longer half‑life. -
Vendor VEX: NOT AFFECTED → flip
state = known_not_affected, long half‑life (60–90 days), high confidence. -
New major release, infra changes, or new internet exposure → degrade confidence (e.g. −0.3) because architecture changed.
Implement this as a simple rules table:
{
"on_evidence": [
{
"when": { "kind": "runtime_trace", "result": "no_calls_observed" },
"dimension": "reachability",
"delta_confidence": +0.2,
"half_life_days": 14
},
{
"when": { "kind": "runtime_trace", "result": "calls_into_vuln" },
"dimension": "reachability",
"set_state": "known_affected",
"set_confidence": 0.95,
"half_life_days": 21
},
{
"when": { "kind": "vendor_vex", "result": "not_affected" },
"set_state": "known_not_affected",
"set_confidence": 0.98,
"half_life_days": 60
}
]
}
1.5 UI for decaying confidence
On the Unknown Detail page, you can show:
-
Confidence chip:
- “Knowledge freshness: 0.28 (stale)” with a color gradient.
-
Decay sparkline: small chart showing confidence over the last 30 days.
-
Next review: “Next review recommended by Dec 2, 2025 (in 3 days)”
-
Evidence stack: timeline of evidence events with icons (static scan, runtime, vendor, etc.).
-
Actions area: “Refresh now → Trigger runtime probe / request VEX / open Jira”.
All of that makes the heuristic feel concrete and gives engineers a mental model: “this is decaying; here’s when we revisit; here’s how to add evidence.”
2) Triage Queue for High‑Impact Unknowns: Making It Useful
The goal: reduce an ocean of unknowns to a small, actionable queue that:
- Is ranked by risk, not noise
- Has clear owners and due dates
- Plugs cleanly into teams’ existing planning
2.1 Impact scoring, more formally
Define a normalized impact score I between 0 and 1:
I = w_env * EnvExposure
+ w_data * DataSensitivity
+ w_prevalence * Prevalence
+ w_sla * SlaCriticality
+ w_cvss * CvssSeverity
Where each factor is also 0–1:
-
EnvExposure:- prod + internet_exposed → 1.0
- prod + internal only → 0.7
- non‑prod → 0.3
-
DataSensitivity:- none → 0.0, internal → 0.3, pii → 0.7, financial/health → 1.0
-
Prevalence:- fraction of services/assets affected (0..1)
-
SlaCriticality:- bronze → 0.3, silver → 0.6, gold → 1.0
-
CvssSeverity:- use CVSS normalized to 0..1 if you have it, otherwise approximate from “critical/high/med/low”.
Weights w_* configurable, e.g.:
w_env = 0.3
w_data = 0.25
w_prevalence = 0.15
w_sla = 0.15
w_cvss = 0.15
These can live in a tenant‑level config.
2.2 Triage score
You already had the core idea:
triage_score = Impact * (1 - ConfidenceNow)
You can enrich this slightly with recency:
RecencyBoost = min(1.2, 1.0 + DaysSinceCreated / 30 * 0.2)
triage_score = Impact * (1 - ConfidenceNow) * RecencyBoost
So very old unknowns with low confidence get a slight bump to avoid being buried forever.
2.3 Queue item lifecycle
Represent queue items as a simple workflow:
{
"queue_item_id": "TRIAGE:unknown:…",
"unknown_id": "URN:unknown:…",
"triage_score": 0.81,
"status": "open", // open | in_progress | blocked | resolved | wont_fix
"reason_blocked": null,
"owner_team": "team-checkout",
"assigned_to": "alice",
"created_at": "2025-11-29T12:05:00Z",
"due_by": "2025-12-02T17:00:00Z",
"required_outcome": "add_evidence_or_verdict", // tasks that actually change state
"suggested_actions": [
{ "type": "collect_runtime_trace", "cost": "low" },
{ "type": "symbolic_slice_probe", "cost": "medium" },
{ "type": "vendor_VEX_request", "cost": "low" }
],
"audit": [
{
"at": "2025-11-29T12:05:00Z",
"who": "system",
"what": "enqueued: confidence below threshold_low; I=0.9, C=0.21"
}
]
}
Rules:
-
Queue item is (re)created automatically when unknown’s
next_review_at <= nowand impact above a minimum threshold. -
When an engineer adds evidence or changes
stateon the underlying unknown, the system:- Recomputes confidence, impact, triage_score
- Closes the queue item if confidence now >
threshold_highor state != unknown
-
You can allow re‑open if it decays again later.
2.4 Queue UI & ops hooks
In UI, the “High‑Impact Unknowns” view shows:
Columns:
- Unknown (vuln + subject)
- State (always “unknown” here, but future‑proof)
- Impact badge (Low/Med/High/Critical)
- Confidence chip
- Triage score (sortable)
- Owner team
- Due by
- Quick actions
Interactions:
- Default filter:
impact >= HighANDenv = prod - Per‑team view: filter owner_team = “team‑X”
- Bulk ops: “Assign top 10 to me”, “Open Jira for selected” etc.
Ops hooks:
- Daily digest to each team: “You have 5 high‑impact unknowns due this week.”
- Planning export: per sprint, each team looks at “Top N unknowns by triage_score” and picks some into the sprint.
- SLO integration: if team’s “unknown budget” (see below) is overrun, they must schedule unknown work.
2.5 Example: one unknown from signal to closure
-
New CVE hits; SBOM says
checkout-apiuses affected library.-
Unknown created with:
- Impact ≈ 0.9 (prod, internet, PII, critical CVE)
- Confidence = 0.4 (all we know is “it exists”).
-
triage_score ≈ 0.9 * (1 - 0.4) = 0.54→ high enough to enqueue.
-
-
Engineer collects runtime trace, sees no calls to vulnerable path under normal traffic.
- Evidence added, confidence bumped to 0.75, half‑life 14 days.
- Queue item auto‑resolves if your
threshold_highis 0.7.
-
Two months later, architecture changes, service gets a new public endpoint.
- Deployment event triggers an automatic “degrade confidence” rule (−0.2), sets new
t0and shorter half‑life. next_review_atmoves closer; unknown re‑enters queue later.
- Deployment event triggers an automatic “degrade confidence” rule (−0.2), sets new
This gives you continuously updating risk without manual spreadsheets.
3) Unknown Budget & Guardrails (Optional but Powerful)
To connect this to leadership/SRE conversations, define an “unknown budget” per service/team:
A target maximum risk mass of unknowns we’re willing to tolerate.
3.1 Per‑unknown “risk units”
For each unknown, define:
risk_units = Impact * (1 - ConfidenceNow)
(It’s literally the triage score, but aggregated differently.)
Per team or service:
unknown_risk_budget = sum(risk_units for that team/service)
You can then set guardrails, e.g.:
- Gold‑tier service: budget ≤ 5.0
- Silver: ≤ 15.0
- Bronze: ≤ 30.0
3.2 Guardrail behaviors
If a team exceeds its budget:
-
Show warnings in the Stella Ops UI on the service details page.
-
Add a banner in the high‑impact queue: “Unknown budget exceeded by 3.2 units.”
-
Optional: feed into deployment checks:
- Above 2× budget → require security approval before prod deploy.
- Above 1× budget → must plan unknown work in next sprint.
This ties the heuristics to behavior change without being draconian.
4) Implementation Sketch (API & Code)
4.1 C# model sketch
public enum UnknownState { Unknown, KnownAffected, KnownNotAffected, Ignored }
public sealed class UnknownRecord
{
public string Id { get; set; } = default!;
public string SubjectType { get; set; } = default!; // "package", "service", ...
public string? Purl { get; set; }
public string? ServiceId { get; set; }
public string Env { get; set; } = "prod";
public string? VulnId { get; set; } // CVE, GHSA, etc.
public string Dimension { get; set; } = "reachability";
public UnknownState State { get; set; } = UnknownState.Unknown;
public string UnknownCause { get; set; } = "data_missing";
// Confidence fields persisted
public double ConfidenceValueAtT0 { get; set; }
public DateTimeOffset ConfidenceT0 { get; set; }
public double HalfLifeDays { get; set; }
public double ThresholdLow { get; set; }
public double ThresholdHigh { get; set; }
public DateTimeOffset NextReviewAt { get; set; }
// Impact factors
public bool RuntimePresence { get; set; }
public bool InternetExposed { get; set; }
public string PrivilegeLevel { get; set; } = "low";
public string DataSensitivity { get; set; } = "none";
public double FleetPrevalence { get; set; }
public string SlaTier { get; set; } = "bronze";
// Ownership & audit
public string OwnerTeam { get; set; } = default!;
public DateTimeOffset CreatedAt { get; set; }
public DateTimeOffset UpdatedAt { get; set; }
}
Helper to compute ConfidenceNow:
public static class ConfidenceCalculator
{
public static double ComputeNow(UnknownRecord r, DateTimeOffset now)
{
var deltaDays = (now - r.ConfidenceT0).TotalDays;
if (deltaDays <= 0) return Clamp01(r.ConfidenceValueAtT0);
var factor = Math.Pow(0.5, deltaDays / r.HalfLifeDays);
return Clamp01(r.ConfidenceValueAtT0 * factor);
}
public static (double valueAtT0, DateTimeOffset t0, DateTimeOffset nextReviewAt)
ApplyEvidence(UnknownRecord r, double deltaConfidence, double? newHalfLifeDays, DateTimeOffset now)
{
var current = ComputeNow(r, now);
var updated = Clamp01(current + deltaConfidence);
var halfLife = newHalfLifeDays ?? r.HalfLifeDays;
var daysToThreshold = DaysUntilThreshold(updated, r.ThresholdLow, halfLife);
var nextReview = now.AddDays(daysToThreshold);
return (updated, now, nextReview);
}
private static double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
{
if (valueAtT0 <= threshold) return 0;
return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
}
private static double Clamp01(double v) => v < 0 ? 0 : (v > 1 ? 1 : v);
}
Queue scoring:
public sealed class TriageConfig
{
public double WEnv { get; set; } = 0.3;
public double WData { get; set; } = 0.25;
public double WPrev { get; set; } = 0.15;
public double WSla { get; set; } = 0.15;
public double WCvss { get; set; } = 0.15;
public double MinImpactForQueue { get; set; } = 0.4;
public double MaxRecencyBoost { get; set; } = 1.2;
}
public static class TriageScorer
{
public static double ComputeImpact(UnknownRecord r, double cvssNorm, TriageConfig cfg)
{
var env = r.Env == "prod"
? (r.InternetExposed ? 1.0 : 0.7)
: 0.3;
var data = r.DataSensitivity switch
{
"none" => 0.0,
"internal" => 0.3,
"pii" => 0.7,
"financial" => 1.0,
_ => 0.3
};
var sla = r.SlaTier switch
{
"bronze" => 0.3,
"silver" => 0.6,
"gold" => 1.0,
_ => 0.3
};
var prev = Math.Max(0, Math.Min(1, r.FleetPrevalence));
return cfg.WEnv * env
+ cfg.WData * data
+ cfg.WPrev * prev
+ cfg.WSla * sla
+ cfg.WCvss * cvssNorm;
}
public static double ComputeTriageScore(
UnknownRecord r,
double cvssNorm,
DateTimeOffset now,
DateTimeOffset createdAt,
TriageConfig cfg)
{
var impact = ComputeImpact(r, cvssNorm, cfg);
var confidence = ConfidenceCalculator.ComputeNow(r, now);
if (impact < cfg.MinImpactForQueue) return 0;
var ageDays = (now - createdAt).TotalDays;
var recencyBoost = Math.Min(cfg.MaxRecencyBoost, 1.0 + (ageDays / 30.0) * 0.2);
return impact * (1 - confidence) * recencyBoost;
}
}
This is all straightforward to wire into your existing C#/Angular stack.
5) How This Feeds Planning & Metrics
Once this is live, you get a bunch of useful knobs for product and leadership:
5.1 Per‑team dashboards
For each team/service, show:
-
Unknown count (total & by dimension)
-
Unknown risk budget (current vs target)
-
Distribution of confidence (e.g., histogram buckets: 0–0.25, 0.25–0.5, etc.)
-
Average age of unknowns
-
Queue throughput:
-
of unknowns investigated this sprint
- Average time from
enqueued → evidence added/ verdict
-
These tell you if teams are actually burning down epistemic risk or just tagging things.
5.2 Process metrics to tune heuristics
Every quarter, look at:
-
How many unknowns re‑enter the queue because decay hits threshold?
-
For unknowns that later become known‑affected incidents, what were their triage scores?
- If many “incident‑causing unknowns” had low triage scores, adjust weights.
-
Are teams routinely ignoring certain impact factors (e.g., low data sensitivity)?
- Maybe reduce weight or adjust scoring.
Because the heuristics are explicit and simple, you can iterate: tweak half‑lives and weights, observe effect on queue size and incident correlation.
If you’d like, next step I can sketch:
- A REST API surface (
GET /unknowns,GET /unknowns/triage,POST /unknowns/{id}/evidence) - Or specific Angular components for the Confidence Decay Card and High‑Impact Unknowns table, wired to these models.