Here’s a compact, low-friction way to tame “unknowns” in Stella Ops without boiling the ocean: two heuristics you can prototype this week—each yields one clear artifact you can show in the UI and wire into the next planning cycle. --- # 1) Decaying Confidence (Half-Life) for Unknown Reachability **Idea:** every “unknown” reachability/verdict starts with a confidence score that **decays over time** (exponential half-life). If no new evidence arrives, confidence naturally drifts toward “needs refresh,” preventing stale assumptions from lingering. **Why it helps (plain English):** unknowns don’t stay “probably fine” forever—this makes them self-expiring, so triage resurfaces them at the right time instead of only when something breaks. **Minimal data model (UnknownsRegistry):** ```json { "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability", "subject_ref": { "type": "package", "purl": "pkg:npm/lodash@4.17.21" }, "vuln_id": "CVE-2021-23337", "dimension": "reachability", "confidence": { "value": 0.78, "method": "half_life", "t0": "2025-11-29T12:00:00Z", "half_life_days": 14 }, "evidence": [{ "kind": "static_scan_hint", "hash": "…" }], "next_review_at": "2025-12-06T12:00:00Z", "status": "unknown" } ``` **Update rule (per tick or on read):** * `confidence_now = confidence_t0 * 0.5^(Δdays / half_life_days)` * When `confidence_now < threshold_low` → flag for human review (see Queue below). * When fresh evidence arrives → reset `t0`, optionally raise confidence. **One UI artifact:** A **“Confidence Decay Card”** on each unknown, showing: * sparkline of decay over time, * next review ETA, * button “Refresh with latest evidence” (re-run reachability probes). **One ops hook (planning):** Export a **daily CSV/JSON of unknowns whose confidence crossed threshold** to feed the triage board. --- # 2) Human-Review Queue for High-Impact Unknowns **Idea:** only a subset of unknowns deserve people time. Auto-rank them by potential blast radius + decayed confidence. **Triage score (simple, transparent):** `triage_score = impact_score * (1 - confidence_now)` * `impact_score` (0–1): runtime exposure, privilege, prevalence, SLA tier. * `confidence_now`: from heuristic #1. **Queue item schema (artifact to display & act on):** ```json { "queue_item_id": "TRIAGE:unknown:…", "unknown_id": "URN:unknown:…", "triage_score": 0.74, "impact_factors": { "runtime_presence": true, "privilege": "high", "fleet_prevalence": 0.62, "sla_tier": "gold" }, "confidence_now": 0.28, "assigned_to": "unassigned", "due_by": "2025-12-02T17:00:00Z", "actions": [ { "type": "collect_runtime_trace", "cost": "low" }, { "type": "symbolic_slice_probe", "cost": "medium" }, { "type": "vendor_VEX_request", "cost": "low" } ], "audit": [{ "at": "2025-11-29T12:05:00Z", "who": "system", "what": "enqueued: threshold_low crossed" }] } ``` **One UI artifact:** A **“High-Impact Unknowns” queue view** sorted by `triage_score`, showing: * pill tags for impact factors, * inline actions (Assign, Probe, Add Evidence, Mark Resolved), * SLO badge showing `due_by`. **One ops hook (planning):** Pull top N items by `triage_score` at sprint start. Each resolved item must attach new evidence or a documented “Not Affected” rationale so the next decay cycle begins from stronger assumptions. --- ## Wiring into Stella Ops quickly (dev notes) * **Storage:** add `UnknownsRegistry` collection/table; compute decay on read to avoid cron churn. * **Thresholds:** start with `half_life_days = 14`, `threshold_low = 0.35`; tune later. * **Impact scoring:** begin with simple weights in config (runtime_presence=0.4, privilege=0.3, prevalence=0.2, SLA=0.1). * **APIs:** * `GET /unknowns?stale=true` (confidence < threshold) * `POST /triage/enqueue` (system-owned) * `POST /unknowns/{id}/evidence` (resets t0, recomputes next_review_at) * **Events:** emit `UnknownConfidenceCrossedLow` → `TriageItemCreated`. --- ## What you’ll have after a 1–2 day spike * A decay card on each unknown + a simple, sortable triage queue. * A daily export artifact to drive planning. * A clear, auditable path from “we’re unsure” → “we gathered evidence” → “we’re confident (for now).” If you want, I can generate: * the C# POCOs/EF mappings, * a minimal Controller set, * Angular components (card + queue table), * and seed data + an evaluator that computes `confidence_now` and `triage_score` from config. Cool, let’s turn those two sketchy heuristics into something you can actually ship and iterate on in Stella Ops. I’ll go deeper on: 1. Decaying confidence as a proper first‑class concept 2. The triage queue and workflow around “unknowns” 3. A lightweight “unknown budget” / guardrail layer 4. Concrete implementation sketches (data models, formulas, pseudo‑code) 5. How this feeds planning & metrics --- ## 1) Decaying Confidence: From Idea to Mechanism ### 1.1 What “confidence” actually means To keep semantics crisp, define **confidence** as: > “How fresh and well‑supported our knowledge is about this vulnerability in this subject along this dimension (reachability, exploitability, etc.).” * `1.0` = Recently assessed with strong evidence * `0.0` = We basically haven’t looked / our info is ancient This works for **unknown**, **known‑affected**, and **known‑not‑affected**; decay is about **knowledge freshness**, not the verdict itself. For unknowns, confidence will usually be low and decaying → that’s what pushes them into the queue. ### 1.2 Data model v2 (UnknownsRegistry) Extend the earlier object a bit: ```jsonc { "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability", "subject_ref": { "type": "package", // package | service | container | host | cluster "purl": "pkg:npm/lodash@4.17.21", "service_id": "checkout-api", "env": "prod" // prod | staging | dev }, "vuln_id": "CVE-2021-23337", "dimension": "reachability", // reachability | exploitability | fix_validity | other "state": "unknown", // unknown | known_affected | known_not_affected | ignored "unknown_cause": "tooling_gap", // data_missing | vendor_silent | tooling_gap | conflicting_evidence "confidence": { "value": 0.62, // computed on read "method": "half_life", "t0": "2025-11-29T12:00:00Z", "value_at_t0": 0.9, "half_life_days": 14, "threshold_low": 0.35, "threshold_high": 0.75 }, "impact": { "runtime_presence": true, "internet_exposed": true, "privilege_level": "high", "data_sensitivity": "pii", // none | internal | pii | financial "fleet_prevalence": 0.62, // fraction of services using this "sla_tier": "gold" // bronze | silver | gold }, "next_review_at": "2025-12-06T12:00:00Z", "owner": "team-checkout", "created_at": "2025-11-29T12:00:00Z", "updated_at": "2025-11-29T12:00:00Z", "evidence": [ { "kind": "static_scan_hint", "summary": "No direct call from public handler to vulnerable sink found.", "created_at": "2025-11-29T12:00:00Z", "link": "https://stellaops/ui/evidence/123" } ] } ``` Key points: * `unknown_cause` helps you slice unknowns by “why do we not know?” (lack of data vs tooling vs vendor). * `impact` is embedded here so triage scoring can be local without joining a ton of tables. * `half_life_days` can be **per dimension & per environment**, e.g.: * prod + reachability → 7 days * staging + fix_validity → 30 days ### 1.3 Decay math & scheduling Use exponential decay: ```text confidence(t) = value_at_t0 * 0.5^(Δdays / half_life_days) ``` Where: * `Δdays = (now - t0) in days` On write (when you update or create the record), you: 1. Compute `value_now` from any previous state. 2. Apply bump/delta based on new evidence (bounded by 0..1). 3. Set `value_at_t0 = value_now_after_bump`, `t0 = now`. 4. Precompute `next_review_at` = when `confidence(t)` will cross `threshold_low`. Pseudo‑code for step 4: ```csharp double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays) { if (valueAtT0 <= threshold) return 0; // threshold = valueAtT0 * 0.5^(Δ/halfLife) // Δ = halfLife * log(threshold/valueAtT0) / log(0.5) return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5); } ``` Then: ```csharp var days = DaysUntilThreshold(valueAtT0, thresholdLow, halfLifeDays); nextReviewAt = now.AddDays(days); ``` **Important:** this gives you a **cheap query** to build the queue: ```sql SELECT * FROM UnknownsRegistry WHERE state = 'unknown' AND next_review_at <= now(); ``` No cron‑based bulk recomputation necessary. ### 1.4 Events that bump confidence Any new evidence should “refresh” knowledge and adjust confidence: Examples: * **Runtime traces showing the vulnerable function never called in a hot path** → bump reachability confidence up moderately (e.g. +0.2, capped at 0.9). * **Symbolic or fuzzing probe explicitly drives execution into the vulnerable code** → flip `state = known_affected`, set confidence close to 1.0 with longer half‑life. * **Vendor VEX: NOT AFFECTED** → flip `state = known_not_affected`, long half‑life (60–90 days), high confidence. * **New major release, infra changes, or new internet exposure** → degrade confidence (e.g. −0.3) because architecture changed. Implement this as a simple rules table: ```jsonc { "on_evidence": [ { "when": { "kind": "runtime_trace", "result": "no_calls_observed" }, "dimension": "reachability", "delta_confidence": +0.2, "half_life_days": 14 }, { "when": { "kind": "runtime_trace", "result": "calls_into_vuln" }, "dimension": "reachability", "set_state": "known_affected", "set_confidence": 0.95, "half_life_days": 21 }, { "when": { "kind": "vendor_vex", "result": "not_affected" }, "set_state": "known_not_affected", "set_confidence": 0.98, "half_life_days": 60 } ] } ``` ### 1.5 UI for decaying confidence On the **Unknown Detail page**, you can show: * **Confidence chip**: * “Knowledge freshness: 0.28 (stale)” with a color gradient. * **Decay sparkline**: small chart showing confidence over the last 30 days. * **Next review**: “Next review recommended by Dec 2, 2025 (in 3 days)” * **Evidence stack**: timeline of evidence events with icons (static scan, runtime, vendor, etc.). * **Actions area**: “Refresh now → Trigger runtime probe / request VEX / open Jira”. All of that makes the heuristic feel concrete and gives engineers a mental model: “this is decaying; here’s when we revisit; here’s how to add evidence.” --- ## 2) Triage Queue for High‑Impact Unknowns: Making It Useful The goal: **reduce an ocean of unknowns to a small, actionable queue** that: * Is **ranked by risk**, not noise * Has clear **owners and due dates** * Plugs cleanly into teams’ existing planning ### 2.1 Impact scoring, more formally Define a normalized **impact score** `I` between 0 and 1: ```text I = w_env * EnvExposure + w_data * DataSensitivity + w_prevalence * Prevalence + w_sla * SlaCriticality + w_cvss * CvssSeverity ``` Where each factor is also 0–1: * `EnvExposure`: * prod + internet_exposed → 1.0 * prod + internal only → 0.7 * non‑prod → 0.3 * `DataSensitivity`: * none → 0.0, internal → 0.3, pii → 0.7, financial/health → 1.0 * `Prevalence`: * fraction of services/assets affected (0..1) * `SlaCriticality`: * bronze → 0.3, silver → 0.6, gold → 1.0 * `CvssSeverity`: * use CVSS normalized to 0..1 if you have it, otherwise approximate from “critical/high/med/low”. Weights `w_*` configurable, e.g.: ```text w_env = 0.3 w_data = 0.25 w_prevalence = 0.15 w_sla = 0.15 w_cvss = 0.15 ``` These can live in a tenant‑level config. ### 2.2 Triage score You already had the core idea: ```text triage_score = Impact * (1 - ConfidenceNow) ``` You can enrich this slightly with recency: ```text RecencyBoost = min(1.2, 1.0 + DaysSinceCreated / 30 * 0.2) triage_score = Impact * (1 - ConfidenceNow) * RecencyBoost ``` So very old unknowns with low confidence get a slight bump to avoid being buried forever. ### 2.3 Queue item lifecycle Represent queue items as a simple workflow: ```jsonc { "queue_item_id": "TRIAGE:unknown:…", "unknown_id": "URN:unknown:…", "triage_score": 0.81, "status": "open", // open | in_progress | blocked | resolved | wont_fix "reason_blocked": null, "owner_team": "team-checkout", "assigned_to": "alice", "created_at": "2025-11-29T12:05:00Z", "due_by": "2025-12-02T17:00:00Z", "required_outcome": "add_evidence_or_verdict", // tasks that actually change state "suggested_actions": [ { "type": "collect_runtime_trace", "cost": "low" }, { "type": "symbolic_slice_probe", "cost": "medium" }, { "type": "vendor_VEX_request", "cost": "low" } ], "audit": [ { "at": "2025-11-29T12:05:00Z", "who": "system", "what": "enqueued: confidence below threshold_low; I=0.9, C=0.21" } ] } ``` Rules: * Queue item is (re)created automatically when unknown’s `next_review_at <= now` **and** impact above a minimum threshold. * When an engineer **adds evidence** or changes `state` on the underlying unknown, the system: * Recomputes confidence, impact, triage_score * Closes the queue item if confidence now > `threshold_high` or state != unknown * You can allow **re‑open** if it decays again later. ### 2.4 Queue UI & ops hooks In UI, the **“High‑Impact Unknowns”** view shows: Columns: * Unknown (vuln + subject) * State (always “unknown” here, but future‑proof) * Impact badge (Low/Med/High/Critical) * Confidence chip * Triage score (sortable) * Owner team * Due by * Quick actions Interactions: * Default filter: `impact >= High` AND `env = prod` * Per‑team view: filter owner_team = “team‑X” * Bulk ops: “Assign top 10 to me”, “Open Jira for selected” etc. Ops hooks: * **Daily digest** to each team: “You have 5 high‑impact unknowns due this week.” * **Planning export**: per sprint, each team looks at “Top N unknowns by triage_score” and picks some into the sprint. * **SLO integration**: if team’s “unknown budget” (see below) is overrun, they must schedule unknown work. ### 2.5 Example: one unknown from signal to closure 1. New CVE hits; SBOM says `checkout-api` uses affected library. * Unknown created with: * Impact ≈ 0.9 (prod, internet, PII, critical CVE) * Confidence = 0.4 (all we know is “it exists”). * `triage_score ≈ 0.9 * (1 - 0.4) = 0.54` → high enough to enqueue. 2. Engineer collects runtime trace, sees no calls to vulnerable path under normal traffic. * Evidence added, confidence bumped to 0.75, half‑life 14 days. * Queue item auto‑resolves if your `threshold_high` is 0.7. 3. Two months later, architecture changes, service gets a new public endpoint. * Deployment event triggers an automatic “degrade confidence” rule (−0.2), sets new `t0` and shorter half‑life. * `next_review_at` moves closer; unknown re‑enters queue later. This gives you **continuously updating risk** without manual spreadsheets. --- ## 3) Unknown Budget & Guardrails (Optional but Powerful) To connect this to leadership/SRE conversations, define an **“unknown budget”** per service/team: > A target maximum risk mass of unknowns we’re willing to tolerate. ### 3.1 Per‑unknown “risk units” For each unknown, define: ```text risk_units = Impact * (1 - ConfidenceNow) ``` (It’s literally the triage score, but aggregated differently.) Per team or service: ```text unknown_risk_budget = sum(risk_units for that team/service) ``` You can then set **guardrails**, e.g.: * Gold‑tier service: budget ≤ 5.0 * Silver: ≤ 15.0 * Bronze: ≤ 30.0 ### 3.2 Guardrail behaviors If a team exceeds its budget: * Show warnings in the Stella Ops UI on the service details page. * Add a banner in the high‑impact queue: “Unknown budget exceeded by 3.2 units.” * Optional: feed into deployment checks: * Above 2× budget → require security approval before prod deploy. * Above 1× budget → must plan unknown work in next sprint. This ties the heuristics to behavior change without being draconian. --- ## 4) Implementation Sketch (API & Code) ### 4.1 C# model sketch ```csharp public enum UnknownState { Unknown, KnownAffected, KnownNotAffected, Ignored } public sealed class UnknownRecord { public string Id { get; set; } = default!; public string SubjectType { get; set; } = default!; // "package", "service", ... public string? Purl { get; set; } public string? ServiceId { get; set; } public string Env { get; set; } = "prod"; public string? VulnId { get; set; } // CVE, GHSA, etc. public string Dimension { get; set; } = "reachability"; public UnknownState State { get; set; } = UnknownState.Unknown; public string UnknownCause { get; set; } = "data_missing"; // Confidence fields persisted public double ConfidenceValueAtT0 { get; set; } public DateTimeOffset ConfidenceT0 { get; set; } public double HalfLifeDays { get; set; } public double ThresholdLow { get; set; } public double ThresholdHigh { get; set; } public DateTimeOffset NextReviewAt { get; set; } // Impact factors public bool RuntimePresence { get; set; } public bool InternetExposed { get; set; } public string PrivilegeLevel { get; set; } = "low"; public string DataSensitivity { get; set; } = "none"; public double FleetPrevalence { get; set; } public string SlaTier { get; set; } = "bronze"; // Ownership & audit public string OwnerTeam { get; set; } = default!; public DateTimeOffset CreatedAt { get; set; } public DateTimeOffset UpdatedAt { get; set; } } ``` Helper to compute `ConfidenceNow`: ```csharp public static class ConfidenceCalculator { public static double ComputeNow(UnknownRecord r, DateTimeOffset now) { var deltaDays = (now - r.ConfidenceT0).TotalDays; if (deltaDays <= 0) return Clamp01(r.ConfidenceValueAtT0); var factor = Math.Pow(0.5, deltaDays / r.HalfLifeDays); return Clamp01(r.ConfidenceValueAtT0 * factor); } public static (double valueAtT0, DateTimeOffset t0, DateTimeOffset nextReviewAt) ApplyEvidence(UnknownRecord r, double deltaConfidence, double? newHalfLifeDays, DateTimeOffset now) { var current = ComputeNow(r, now); var updated = Clamp01(current + deltaConfidence); var halfLife = newHalfLifeDays ?? r.HalfLifeDays; var daysToThreshold = DaysUntilThreshold(updated, r.ThresholdLow, halfLife); var nextReview = now.AddDays(daysToThreshold); return (updated, now, nextReview); } private static double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays) { if (valueAtT0 <= threshold) return 0; return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5); } private static double Clamp01(double v) => v < 0 ? 0 : (v > 1 ? 1 : v); } ``` Queue scoring: ```csharp public sealed class TriageConfig { public double WEnv { get; set; } = 0.3; public double WData { get; set; } = 0.25; public double WPrev { get; set; } = 0.15; public double WSla { get; set; } = 0.15; public double WCvss { get; set; } = 0.15; public double MinImpactForQueue { get; set; } = 0.4; public double MaxRecencyBoost { get; set; } = 1.2; } public static class TriageScorer { public static double ComputeImpact(UnknownRecord r, double cvssNorm, TriageConfig cfg) { var env = r.Env == "prod" ? (r.InternetExposed ? 1.0 : 0.7) : 0.3; var data = r.DataSensitivity switch { "none" => 0.0, "internal" => 0.3, "pii" => 0.7, "financial" => 1.0, _ => 0.3 }; var sla = r.SlaTier switch { "bronze" => 0.3, "silver" => 0.6, "gold" => 1.0, _ => 0.3 }; var prev = Math.Max(0, Math.Min(1, r.FleetPrevalence)); return cfg.WEnv * env + cfg.WData * data + cfg.WPrev * prev + cfg.WSla * sla + cfg.WCvss * cvssNorm; } public static double ComputeTriageScore( UnknownRecord r, double cvssNorm, DateTimeOffset now, DateTimeOffset createdAt, TriageConfig cfg) { var impact = ComputeImpact(r, cvssNorm, cfg); var confidence = ConfidenceCalculator.ComputeNow(r, now); if (impact < cfg.MinImpactForQueue) return 0; var ageDays = (now - createdAt).TotalDays; var recencyBoost = Math.Min(cfg.MaxRecencyBoost, 1.0 + (ageDays / 30.0) * 0.2); return impact * (1 - confidence) * recencyBoost; } } ``` This is all straightforward to wire into your existing C#/Angular stack. --- ## 5) How This Feeds Planning & Metrics Once this is live, you get a bunch of useful knobs for product and leadership: ### 5.1 Per‑team dashboards For each team/service, show: * **Unknown count** (total & by dimension) * **Unknown risk budget** (current vs target) * **Distribution of confidence** (e.g., histogram buckets: 0–0.25, 0.25–0.5, etc.) * **Average age of unknowns** * **Queue throughput**: * # of unknowns investigated this sprint * Average time from `enqueued → evidence added/ verdict` These tell you if teams are actually burning down epistemic risk or just tagging things. ### 5.2 Process metrics to tune heuristics Every quarter, look at: * How many unknowns **re‑enter the queue** because decay hits threshold? * For unknowns that later become **known‑affected incidents**, what were their triage scores? * If many “incident‑causing unknowns” had low triage scores, adjust weights. * Are teams routinely ignoring certain impact factors (e.g., low data sensitivity)? * Maybe reduce weight or adjust scoring. Because the heuristics are **explicit and simple**, you can iterate: tweak half‑lives and weights, observe effect on queue size and incident correlation. --- If you’d like, next step I can sketch: * A REST API surface (`GET /unknowns`, `GET /unknowns/triage`, `POST /unknowns/{id}/evidence`) * Or specific Angular components for the **Confidence Decay Card** and **High‑Impact Unknowns** table, wired to these models.