Files
git.stella-ops.org/docs/product-advisories/30-Nov-2025 - Unknowns Decay & Triage Heuristics.md
StellaOps Bot 25254e3831 news advisories
2025-11-30 21:00:38 +02:00

22 KiB
Raw Blame History

Heres a compact, low-friction way to tame “unknowns” in Stella Ops without boiling the ocean: two heuristics you can prototype this week—each yields one clear artifact you can show in the UI and wire into the next planning cycle.


1) Decaying Confidence (Half-Life) for Unknown Reachability

Idea: every “unknown” reachability/verdict starts with a confidence score that decays over time (exponential half-life). If no new evidence arrives, confidence naturally drifts toward “needs refresh,” preventing stale assumptions from lingering.

Why it helps (plain English): unknowns dont stay “probably fine” forever—this makes them self-expiring, so triage resurfaces them at the right time instead of only when something breaks.

Minimal data model (UnknownsRegistry):

{
  "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",
  "subject_ref": { "type": "package", "purl": "pkg:npm/lodash@4.17.21" },
  "vuln_id": "CVE-2021-23337",
  "dimension": "reachability", 
  "confidence": { "value": 0.78, "method": "half_life", "t0": "2025-11-29T12:00:00Z", "half_life_days": 14 },
  "evidence": [{ "kind": "static_scan_hint", "hash": "…" }],
  "next_review_at": "2025-12-06T12:00:00Z",
  "status": "unknown"
}

Update rule (per tick or on read):

  • confidence_now = confidence_t0 * 0.5^(Δdays / half_life_days)
  • When confidence_now < threshold_low → flag for human review (see Queue below).
  • When fresh evidence arrives → reset t0, optionally raise confidence.

One UI artifact: A “Confidence Decay Card” on each unknown, showing:

  • sparkline of decay over time,
  • next review ETA,
  • button “Refresh with latest evidence” (re-run reachability probes).

One ops hook (planning): Export a daily CSV/JSON of unknowns whose confidence crossed threshold to feed the triage board.


2) Human-Review Queue for High-Impact Unknowns

Idea: only a subset of unknowns deserve people time. Auto-rank them by potential blast radius + decayed confidence.

Triage score (simple, transparent): triage_score = impact_score * (1 - confidence_now)

  • impact_score (01): runtime exposure, privilege, prevalence, SLA tier.
  • confidence_now: from heuristic #1.

Queue item schema (artifact to display & act on):

{
  "queue_item_id": "TRIAGE:unknown:…",
  "unknown_id": "URN:unknown:…",
  "triage_score": 0.74,
  "impact_factors": { "runtime_presence": true, "privilege": "high", "fleet_prevalence": 0.62, "sla_tier": "gold" },
  "confidence_now": 0.28,
  "assigned_to": "unassigned",
  "due_by": "2025-12-02T17:00:00Z",
  "actions": [
    { "type": "collect_runtime_trace", "cost": "low" },
    { "type": "symbolic_slice_probe", "cost": "medium" },
    { "type": "vendor_VEX_request", "cost": "low" }
  ],
  "audit": [{ "at": "2025-11-29T12:05:00Z", "who": "system", "what": "enqueued: threshold_low crossed" }]
}

One UI artifact: A “High-Impact Unknowns” queue view sorted by triage_score, showing:

  • pill tags for impact factors,
  • inline actions (Assign, Probe, Add Evidence, Mark Resolved),
  • SLO badge showing due_by.

One ops hook (planning): Pull top N items by triage_score at sprint start. Each resolved item must attach new evidence or a documented “Not Affected” rationale so the next decay cycle begins from stronger assumptions.


Wiring into Stella Ops quickly (dev notes)

  • Storage: add UnknownsRegistry collection/table; compute decay on read to avoid cron churn.
  • Thresholds: start with half_life_days = 14, threshold_low = 0.35; tune later.
  • Impact scoring: begin with simple weights in config (runtime_presence=0.4, privilege=0.3, prevalence=0.2, SLA=0.1).
  • APIs:
    • GET /unknowns?stale=true (confidence < threshold)
    • POST /triage/enqueue (system-owned)
    • POST /unknowns/{id}/evidence (resets t0, recomputes next_review_at)
  • Events: emit UnknownConfidenceCrossedLowTriageItemCreated.

What youll have after a 12 day spike

  • A decay card on each unknown + a simple, sortable triage queue.
  • A daily export artifact to drive planning.
  • A clear, auditable path from “were unsure” → “we gathered evidence” → “were confident (for now).”

If you want, I can generate:

  • the C# POCOs/EF mappings,
  • a minimal Controller set,
  • Angular components (card + queue table),
  • and seed data + an evaluator that computes confidence_now and triage_score from config. Cool, lets turn those two sketchy heuristics into something you can actually ship and iterate on in StellaOps.

Ill go deeper on:

  1. Decaying confidence as a proper firstclass concept
  2. The triage queue and workflow around “unknowns”
  3. A lightweight “unknown budget” / guardrail layer
  4. Concrete implementation sketches (data models, formulas, pseudocode)
  5. How this feeds planning & metrics

1) Decaying Confidence: From Idea to Mechanism

1.1 What “confidence” actually means

To keep semantics crisp, define confidence as:

“How fresh and wellsupported our knowledge is about this vulnerability in this subject along this dimension (reachability, exploitability, etc.).”

  • 1.0 = Recently assessed with strong evidence
  • 0.0 = We basically havent looked / our info is ancient

This works for unknown, knownaffected, and knownnotaffected; decay is about knowledge freshness, not the verdict itself.

For unknowns, confidence will usually be low and decaying → thats what pushes them into the queue.

1.2 Data model v2 (UnknownsRegistry)

Extend the earlier object a bit:

{
  "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",

  "subject_ref": {
    "type": "package",        // package | service | container | host | cluster
    "purl": "pkg:npm/lodash@4.17.21",
    "service_id": "checkout-api",
    "env": "prod"             // prod | staging | dev
  },

  "vuln_id": "CVE-2021-23337",
  "dimension": "reachability", // reachability | exploitability | fix_validity | other

  "state": "unknown",          // unknown | known_affected | known_not_affected | ignored
  "unknown_cause": "tooling_gap", // data_missing | vendor_silent | tooling_gap | conflicting_evidence

  "confidence": {
    "value": 0.62,             // computed on read
    "method": "half_life",
    "t0": "2025-11-29T12:00:00Z",
    "value_at_t0": 0.9,
    "half_life_days": 14,
    "threshold_low": 0.35,
    "threshold_high": 0.75
  },

  "impact": {
    "runtime_presence": true,
    "internet_exposed": true,
    "privilege_level": "high",
    "data_sensitivity": "pii", // none | internal | pii | financial
    "fleet_prevalence": 0.62,  // fraction of services using this
    "sla_tier": "gold"         // bronze | silver | gold
  },

  "next_review_at": "2025-12-06T12:00:00Z",
  "owner": "team-checkout",
  "created_at": "2025-11-29T12:00:00Z",
  "updated_at": "2025-11-29T12:00:00Z",

  "evidence": [
    {
      "kind": "static_scan_hint",
      "summary": "No direct call from public handler to vulnerable sink found.",
      "created_at": "2025-11-29T12:00:00Z",
      "link": "https://stellaops/ui/evidence/123"
    }
  ]
}

Key points:

  • unknown_cause helps you slice unknowns by “why do we not know?” (lack of data vs tooling vs vendor).

  • impact is embedded here so triage scoring can be local without joining a ton of tables.

  • half_life_days can be per dimension & per environment, e.g.:

    • prod + reachability → 7 days
    • staging + fix_validity → 30 days

1.3 Decay math & scheduling

Use exponential decay:

confidence(t) = value_at_t0 * 0.5^(Δdays / half_life_days)

Where:

  • Δdays = (now - t0) in days

On write (when you update or create the record), you:

  1. Compute value_now from any previous state.
  2. Apply bump/delta based on new evidence (bounded by 0..1).
  3. Set value_at_t0 = value_now_after_bump, t0 = now.
  4. Precompute next_review_at = when confidence(t) will cross threshold_low.

Pseudocode for step 4:

double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
{
    if (valueAtT0 <= threshold) return 0;

    // threshold = valueAtT0 * 0.5^(Δ/halfLife)
    // Δ = halfLife * log(threshold/valueAtT0) / log(0.5)
    return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
}

Then:

var days = DaysUntilThreshold(valueAtT0, thresholdLow, halfLifeDays);
nextReviewAt = now.AddDays(days);

Important: this gives you a cheap query to build the queue:

SELECT * FROM UnknownsRegistry
WHERE state = 'unknown'
  AND next_review_at <= now();

No cronbased bulk recomputation necessary.

1.4 Events that bump confidence

Any new evidence should “refresh” knowledge and adjust confidence:

Examples:

  • Runtime traces showing the vulnerable function never called in a hot path → bump reachability confidence up moderately (e.g. +0.2, capped at 0.9).

  • Symbolic or fuzzing probe explicitly drives execution into the vulnerable code → flip state = known_affected, set confidence close to 1.0 with longer halflife.

  • Vendor VEX: NOT AFFECTED → flip state = known_not_affected, long halflife (6090 days), high confidence.

  • New major release, infra changes, or new internet exposure → degrade confidence (e.g. 0.3) because architecture changed.

Implement this as a simple rules table:

{
  "on_evidence": [
    {
      "when": { "kind": "runtime_trace", "result": "no_calls_observed" },
      "dimension": "reachability",
      "delta_confidence": +0.2,
      "half_life_days": 14
    },
    {
      "when": { "kind": "runtime_trace", "result": "calls_into_vuln" },
      "dimension": "reachability",
      "set_state": "known_affected",
      "set_confidence": 0.95,
      "half_life_days": 21
    },
    {
      "when": { "kind": "vendor_vex", "result": "not_affected" },
      "set_state": "known_not_affected",
      "set_confidence": 0.98,
      "half_life_days": 60
    }
  ]
}

1.5 UI for decaying confidence

On the Unknown Detail page, you can show:

  • Confidence chip:

    • “Knowledge freshness: 0.28 (stale)” with a color gradient.
  • Decay sparkline: small chart showing confidence over the last 30 days.

  • Next review: “Next review recommended by Dec 2, 2025 (in 3 days)”

  • Evidence stack: timeline of evidence events with icons (static scan, runtime, vendor, etc.).

  • Actions area: “Refresh now → Trigger runtime probe / request VEX / open Jira”.

All of that makes the heuristic feel concrete and gives engineers a mental model: “this is decaying; heres when we revisit; heres how to add evidence.”


2) Triage Queue for HighImpact Unknowns: Making It Useful

The goal: reduce an ocean of unknowns to a small, actionable queue that:

  • Is ranked by risk, not noise
  • Has clear owners and due dates
  • Plugs cleanly into teams existing planning

2.1 Impact scoring, more formally

Define a normalized impact score I between 0 and 1:

I = w_env * EnvExposure
  + w_data * DataSensitivity
  + w_prevalence * Prevalence
  + w_sla * SlaCriticality
  + w_cvss * CvssSeverity

Where each factor is also 01:

  • EnvExposure:

    • prod + internet_exposed → 1.0
    • prod + internal only → 0.7
    • nonprod → 0.3
  • DataSensitivity:

    • none → 0.0, internal → 0.3, pii → 0.7, financial/health → 1.0
  • Prevalence:

    • fraction of services/assets affected (0..1)
  • SlaCriticality:

    • bronze → 0.3, silver → 0.6, gold → 1.0
  • CvssSeverity:

    • use CVSS normalized to 0..1 if you have it, otherwise approximate from “critical/high/med/low”.

Weights w_* configurable, e.g.:

w_env = 0.3
w_data = 0.25
w_prevalence = 0.15
w_sla = 0.15
w_cvss = 0.15

These can live in a tenantlevel config.

2.2 Triage score

You already had the core idea:

triage_score = Impact * (1 - ConfidenceNow)

You can enrich this slightly with recency:

RecencyBoost = min(1.2, 1.0 + DaysSinceCreated / 30 * 0.2)
triage_score = Impact * (1 - ConfidenceNow) * RecencyBoost

So very old unknowns with low confidence get a slight bump to avoid being buried forever.

2.3 Queue item lifecycle

Represent queue items as a simple workflow:

{
  "queue_item_id": "TRIAGE:unknown:…",
  "unknown_id": "URN:unknown:…",
  "triage_score": 0.81,
  "status": "open",          // open | in_progress | blocked | resolved | wont_fix
  "reason_blocked": null,
  "owner_team": "team-checkout",
  "assigned_to": "alice",
  "created_at": "2025-11-29T12:05:00Z",
  "due_by": "2025-12-02T17:00:00Z",

  "required_outcome": "add_evidence_or_verdict", // tasks that actually change state
  "suggested_actions": [
    { "type": "collect_runtime_trace", "cost": "low" },
    { "type": "symbolic_slice_probe", "cost": "medium" },
    { "type": "vendor_VEX_request", "cost": "low" }
  ],

  "audit": [
    {
      "at": "2025-11-29T12:05:00Z",
      "who": "system",
      "what": "enqueued: confidence below threshold_low; I=0.9, C=0.21"
    }
  ]
}

Rules:

  • Queue item is (re)created automatically when unknowns next_review_at <= now and impact above a minimum threshold.

  • When an engineer adds evidence or changes state on the underlying unknown, the system:

    • Recomputes confidence, impact, triage_score
    • Closes the queue item if confidence now > threshold_high or state != unknown
  • You can allow reopen if it decays again later.

2.4 Queue UI & ops hooks

In UI, the “HighImpact Unknowns” view shows:

Columns:

  • Unknown (vuln + subject)
  • State (always “unknown” here, but futureproof)
  • Impact badge (Low/Med/High/Critical)
  • Confidence chip
  • Triage score (sortable)
  • Owner team
  • Due by
  • Quick actions

Interactions:

  • Default filter: impact >= High AND env = prod
  • Perteam view: filter owner_team = “teamX”
  • Bulk ops: “Assign top 10 to me”, “Open Jira for selected” etc.

Ops hooks:

  • Daily digest to each team: “You have 5 highimpact unknowns due this week.”
  • Planning export: per sprint, each team looks at “Top N unknowns by triage_score” and picks some into the sprint.
  • SLO integration: if teams “unknown budget” (see below) is overrun, they must schedule unknown work.

2.5 Example: one unknown from signal to closure

  1. New CVE hits; SBOM says checkout-api uses affected library.

    • Unknown created with:

      • Impact ≈ 0.9 (prod, internet, PII, critical CVE)
      • Confidence = 0.4 (all we know is “it exists”).
    • triage_score ≈ 0.9 * (1 - 0.4) = 0.54 → high enough to enqueue.

  2. Engineer collects runtime trace, sees no calls to vulnerable path under normal traffic.

    • Evidence added, confidence bumped to 0.75, halflife 14 days.
    • Queue item autoresolves if your threshold_high is 0.7.
  3. Two months later, architecture changes, service gets a new public endpoint.

    • Deployment event triggers an automatic “degrade confidence” rule (0.2), sets new t0 and shorter halflife.
    • next_review_at moves closer; unknown reenters queue later.

This gives you continuously updating risk without manual spreadsheets.


3) Unknown Budget & Guardrails (Optional but Powerful)

To connect this to leadership/SRE conversations, define an “unknown budget” per service/team:

A target maximum risk mass of unknowns were willing to tolerate.

3.1 Perunknown “risk units”

For each unknown, define:

risk_units = Impact * (1 - ConfidenceNow)

(Its literally the triage score, but aggregated differently.)

Per team or service:

unknown_risk_budget = sum(risk_units for that team/service)

You can then set guardrails, e.g.:

  • Goldtier service: budget ≤ 5.0
  • Silver: ≤ 15.0
  • Bronze: ≤ 30.0

3.2 Guardrail behaviors

If a team exceeds its budget:

  • Show warnings in the StellaOps UI on the service details page.

  • Add a banner in the highimpact queue: “Unknown budget exceeded by 3.2 units.”

  • Optional: feed into deployment checks:

    • Above 2× budget → require security approval before prod deploy.
    • Above 1× budget → must plan unknown work in next sprint.

This ties the heuristics to behavior change without being draconian.


4) Implementation Sketch (API & Code)

4.1 C# model sketch

public enum UnknownState { Unknown, KnownAffected, KnownNotAffected, Ignored }

public sealed class UnknownRecord
{
    public string Id { get; set; } = default!;

    public string SubjectType { get; set; } = default!; // "package", "service", ...
    public string? Purl { get; set; }
    public string? ServiceId { get; set; }
    public string Env { get; set; } = "prod";

    public string? VulnId { get; set; }  // CVE, GHSA, etc.
    public string Dimension { get; set; } = "reachability";
    public UnknownState State { get; set; } = UnknownState.Unknown;
    public string UnknownCause { get; set; } = "data_missing";

    // Confidence fields persisted
    public double ConfidenceValueAtT0 { get; set; }
    public DateTimeOffset ConfidenceT0 { get; set; }
    public double HalfLifeDays { get; set; }
    public double ThresholdLow { get; set; }
    public double ThresholdHigh { get; set; }
    public DateTimeOffset NextReviewAt { get; set; }

    // Impact factors
    public bool RuntimePresence { get; set; }
    public bool InternetExposed { get; set; }
    public string PrivilegeLevel { get; set; } = "low";
    public string DataSensitivity { get; set; } = "none";
    public double FleetPrevalence { get; set; }
    public string SlaTier { get; set; } = "bronze";

    // Ownership & audit
    public string OwnerTeam { get; set; } = default!;
    public DateTimeOffset CreatedAt { get; set; }
    public DateTimeOffset UpdatedAt { get; set; }
}

Helper to compute ConfidenceNow:

public static class ConfidenceCalculator
{
    public static double ComputeNow(UnknownRecord r, DateTimeOffset now)
    {
        var deltaDays = (now - r.ConfidenceT0).TotalDays;
        if (deltaDays <= 0) return Clamp01(r.ConfidenceValueAtT0);

        var factor = Math.Pow(0.5, deltaDays / r.HalfLifeDays);
        return Clamp01(r.ConfidenceValueAtT0 * factor);
    }

    public static (double valueAtT0, DateTimeOffset t0, DateTimeOffset nextReviewAt)
        ApplyEvidence(UnknownRecord r, double deltaConfidence, double? newHalfLifeDays, DateTimeOffset now)
    {
        var current = ComputeNow(r, now);
        var updated = Clamp01(current + deltaConfidence);
        var halfLife = newHalfLifeDays ?? r.HalfLifeDays;

        var daysToThreshold = DaysUntilThreshold(updated, r.ThresholdLow, halfLife);
        var nextReview = now.AddDays(daysToThreshold);

        return (updated, now, nextReview);
    }

    private static double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
    {
        if (valueAtT0 <= threshold) return 0;
        return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
    }

    private static double Clamp01(double v) => v < 0 ? 0 : (v > 1 ? 1 : v);
}

Queue scoring:

public sealed class TriageConfig
{
    public double WEnv { get; set; } = 0.3;
    public double WData { get; set; } = 0.25;
    public double WPrev { get; set; } = 0.15;
    public double WSla { get; set; } = 0.15;
    public double WCvss { get; set; } = 0.15;

    public double MinImpactForQueue { get; set; } = 0.4;
    public double MaxRecencyBoost { get; set; } = 1.2;
}

public static class TriageScorer
{
    public static double ComputeImpact(UnknownRecord r, double cvssNorm, TriageConfig cfg)
    {
        var env = r.Env == "prod"
            ? (r.InternetExposed ? 1.0 : 0.7)
            : 0.3;

        var data = r.DataSensitivity switch
        {
            "none" => 0.0,
            "internal" => 0.3,
            "pii" => 0.7,
            "financial" => 1.0,
            _ => 0.3
        };

        var sla = r.SlaTier switch
        {
            "bronze" => 0.3,
            "silver" => 0.6,
            "gold" => 1.0,
            _ => 0.3
        };

        var prev = Math.Max(0, Math.Min(1, r.FleetPrevalence));

        return cfg.WEnv * env
             + cfg.WData * data
             + cfg.WPrev * prev
             + cfg.WSla * sla
             + cfg.WCvss * cvssNorm;
    }

    public static double ComputeTriageScore(
        UnknownRecord r,
        double cvssNorm,
        DateTimeOffset now,
        DateTimeOffset createdAt,
        TriageConfig cfg)
    {
        var impact = ComputeImpact(r, cvssNorm, cfg);
        var confidence = ConfidenceCalculator.ComputeNow(r, now);

        if (impact < cfg.MinImpactForQueue) return 0;

        var ageDays = (now - createdAt).TotalDays;
        var recencyBoost = Math.Min(cfg.MaxRecencyBoost, 1.0 + (ageDays / 30.0) * 0.2);

        return impact * (1 - confidence) * recencyBoost;
    }
}

This is all straightforward to wire into your existing C#/Angular stack.


5) How This Feeds Planning & Metrics

Once this is live, you get a bunch of useful knobs for product and leadership:

5.1 Perteam dashboards

For each team/service, show:

  • Unknown count (total & by dimension)

  • Unknown risk budget (current vs target)

  • Distribution of confidence (e.g., histogram buckets: 00.25, 0.250.5, etc.)

  • Average age of unknowns

  • Queue throughput:

    • of unknowns investigated this sprint

    • Average time from enqueued → evidence added/ verdict

These tell you if teams are actually burning down epistemic risk or just tagging things.

5.2 Process metrics to tune heuristics

Every quarter, look at:

  • How many unknowns reenter the queue because decay hits threshold?

  • For unknowns that later become knownaffected incidents, what were their triage scores?

    • If many “incidentcausing unknowns” had low triage scores, adjust weights.
  • Are teams routinely ignoring certain impact factors (e.g., low data sensitivity)?

    • Maybe reduce weight or adjust scoring.

Because the heuristics are explicit and simple, you can iterate: tweak halflives and weights, observe effect on queue size and incident correlation.


If youd like, next step I can sketch:

  • A REST API surface (GET /unknowns, GET /unknowns/triage, POST /unknowns/{id}/evidence)
  • Or specific Angular components for the Confidence Decay Card and HighImpact Unknowns table, wired to these models.