Files

StellaOps Bot 25254e3831 news advisories

2025-11-30 21:00:38 +02:00

22 KiB

Raw Blame History

Here’s a compact, low-friction way to tame “unknowns” in Stella Ops without boiling the ocean: two heuristics you can prototype this week—each yields one clear artifact you can show in the UI and wire into the next planning cycle.

1) Decaying Confidence (Half-Life) for Unknown Reachability

Idea: every “unknown” reachability/verdict starts with a confidence score that decays over time (exponential half-life). If no new evidence arrives, confidence naturally drifts toward “needs refresh,” preventing stale assumptions from lingering.

Why it helps (plain English): unknowns don’t stay “probably fine” forever—this makes them self-expiring, so triage resurfaces them at the right time instead of only when something breaks.

Minimal data model (UnknownsRegistry):

{
  "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",
  "subject_ref": { "type": "package", "purl": "pkg:npm/lodash@4.17.21" },
  "vuln_id": "CVE-2021-23337",
  "dimension": "reachability", 
  "confidence": { "value": 0.78, "method": "half_life", "t0": "2025-11-29T12:00:00Z", "half_life_days": 14 },
  "evidence": [{ "kind": "static_scan_hint", "hash": "…" }],
  "next_review_at": "2025-12-06T12:00:00Z",
  "status": "unknown"
}

Update rule (per tick or on read):

confidence_now = confidence_t0 * 0.5^(Δdays / half_life_days)
When confidence_now < threshold_low → flag for human review (see Queue below).
When fresh evidence arrives → reset t0, optionally raise confidence.

One UI artifact: A “Confidence Decay Card” on each unknown, showing:

sparkline of decay over time,
next review ETA,
button “Refresh with latest evidence” (re-run reachability probes).

One ops hook (planning): Export a daily CSV/JSON of unknowns whose confidence crossed threshold to feed the triage board.

2) Human-Review Queue for High-Impact Unknowns

Idea: only a subset of unknowns deserve people time. Auto-rank them by potential blast radius + decayed confidence.

Triage score (simple, transparent): triage_score = impact_score * (1 - confidence_now)

impact_score (0–1): runtime exposure, privilege, prevalence, SLA tier.
confidence_now: from heuristic #1.

Queue item schema (artifact to display & act on):

{
  "queue_item_id": "TRIAGE:unknown:…",
  "unknown_id": "URN:unknown:…",
  "triage_score": 0.74,
  "impact_factors": { "runtime_presence": true, "privilege": "high", "fleet_prevalence": 0.62, "sla_tier": "gold" },
  "confidence_now": 0.28,
  "assigned_to": "unassigned",
  "due_by": "2025-12-02T17:00:00Z",
  "actions": [
    { "type": "collect_runtime_trace", "cost": "low" },
    { "type": "symbolic_slice_probe", "cost": "medium" },
    { "type": "vendor_VEX_request", "cost": "low" }
  ],
  "audit": [{ "at": "2025-11-29T12:05:00Z", "who": "system", "what": "enqueued: threshold_low crossed" }]
}

One UI artifact: A “High-Impact Unknowns” queue view sorted by triage_score, showing:

pill tags for impact factors,
inline actions (Assign, Probe, Add Evidence, Mark Resolved),
SLO badge showing due_by.

One ops hook (planning): Pull top N items by triage_score at sprint start. Each resolved item must attach new evidence or a documented “Not Affected” rationale so the next decay cycle begins from stronger assumptions.

Wiring into Stella Ops quickly (dev notes)

Storage: add UnknownsRegistry collection/table; compute decay on read to avoid cron churn.
Thresholds: start with half_life_days = 14, threshold_low = 0.35; tune later.
Impact scoring: begin with simple weights in config (runtime_presence=0.4, privilege=0.3, prevalence=0.2, SLA=0.1).
APIs:
- GET /unknowns?stale=true (confidence < threshold)
- POST /triage/enqueue (system-owned)
- POST /unknowns/{id}/evidence (resets t0, recomputes next_review_at)
Events: emit UnknownConfidenceCrossedLow → TriageItemCreated.

What you’ll have after a 1–2 day spike

A decay card on each unknown + a simple, sortable triage queue.
A daily export artifact to drive planning.
A clear, auditable path from “we’re unsure” → “we gathered evidence” → “we’re confident (for now).”

If you want, I can generate:

the C# POCOs/EF mappings,
a minimal Controller set,
Angular components (card + queue table),
and seed data + an evaluator that computes confidence_now and triage_score from config. Cool, let’s turn those two sketchy heuristics into something you can actually ship and iterate on in Stella Ops.

I’ll go deeper on:

Decaying confidence as a proper first‑class concept
The triage queue and workflow around “unknowns”
A lightweight “unknown budget” / guardrail layer
Concrete implementation sketches (data models, formulas, pseudo‑code)
How this feeds planning & metrics

1) Decaying Confidence: From Idea to Mechanism

1.1 What “confidence” actually means

To keep semantics crisp, define confidence as:

“How fresh and well‑supported our knowledge is about this vulnerability in this subject along this dimension (reachability, exploitability, etc.).”

1.0 = Recently assessed with strong evidence
0.0 = We basically haven’t looked / our info is ancient

This works for unknown, known‑affected, and known‑not‑affected; decay is about knowledge freshness, not the verdict itself.

For unknowns, confidence will usually be low and decaying → that’s what pushes them into the queue.

1.2 Data model v2 (UnknownsRegistry)

Extend the earlier object a bit:

{
  "unknown_id": "URN:unknown:pkg/npm/lodash:4.17.21:CVE-2021-23337:reachability",

  "subject_ref": {
    "type": "package",        // package | service | container | host | cluster
    "purl": "pkg:npm/lodash@4.17.21",
    "service_id": "checkout-api",
    "env": "prod"             // prod | staging | dev
  },

  "vuln_id": "CVE-2021-23337",
  "dimension": "reachability", // reachability | exploitability | fix_validity | other

  "state": "unknown",          // unknown | known_affected | known_not_affected | ignored
  "unknown_cause": "tooling_gap", // data_missing | vendor_silent | tooling_gap | conflicting_evidence

  "confidence": {
    "value": 0.62,             // computed on read
    "method": "half_life",
    "t0": "2025-11-29T12:00:00Z",
    "value_at_t0": 0.9,
    "half_life_days": 14,
    "threshold_low": 0.35,
    "threshold_high": 0.75
  },

  "impact": {
    "runtime_presence": true,
    "internet_exposed": true,
    "privilege_level": "high",
    "data_sensitivity": "pii", // none | internal | pii | financial
    "fleet_prevalence": 0.62,  // fraction of services using this
    "sla_tier": "gold"         // bronze | silver | gold
  },

  "next_review_at": "2025-12-06T12:00:00Z",
  "owner": "team-checkout",
  "created_at": "2025-11-29T12:00:00Z",
  "updated_at": "2025-11-29T12:00:00Z",

  "evidence": [
    {
      "kind": "static_scan_hint",
      "summary": "No direct call from public handler to vulnerable sink found.",
      "created_at": "2025-11-29T12:00:00Z",
      "link": "https://stellaops/ui/evidence/123"
    }
  ]
}

Key points:

unknown_cause helps you slice unknowns by “why do we not know?” (lack of data vs tooling vs vendor).
impact is embedded here so triage scoring can be local without joining a ton of tables.
half_life_days can be per dimension & per environment, e.g.:
- prod + reachability → 7 days
- staging + fix_validity → 30 days

1.3 Decay math & scheduling

Use exponential decay:

confidence(t) = value_at_t0 * 0.5^(Δdays / half_life_days)

Where:

Δdays = (now - t0) in days

On write (when you update or create the record), you:

Compute value_now from any previous state.
Apply bump/delta based on new evidence (bounded by 0..1).
Set value_at_t0 = value_now_after_bump, t0 = now.
Precompute next_review_at = when confidence(t) will cross threshold_low.

Pseudo‑code for step 4:

double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
{
    if (valueAtT0 <= threshold) return 0;

    // threshold = valueAtT0 * 0.5^(Δ/halfLife)
    // Δ = halfLife * log(threshold/valueAtT0) / log(0.5)
    return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
}

Then:

var days = DaysUntilThreshold(valueAtT0, thresholdLow, halfLifeDays);
nextReviewAt = now.AddDays(days);

Important: this gives you a cheap query to build the queue:

SELECT * FROM UnknownsRegistry
WHERE state = 'unknown'
  AND next_review_at <= now();

No cron‑based bulk recomputation necessary.

1.4 Events that bump confidence

Any new evidence should “refresh” knowledge and adjust confidence:

Examples:

Runtime traces showing the vulnerable function never called in a hot path → bump reachability confidence up moderately (e.g. +0.2, capped at 0.9).
Symbolic or fuzzing probe explicitly drives execution into the vulnerable code → flip state = known_affected, set confidence close to 1.0 with longer half‑life.
Vendor VEX: NOT AFFECTED → flip state = known_not_affected, long half‑life (60–90 days), high confidence.
New major release, infra changes, or new internet exposure → degrade confidence (e.g. −0.3) because architecture changed.

Implement this as a simple rules table:

{
  "on_evidence": [
    {
      "when": { "kind": "runtime_trace", "result": "no_calls_observed" },
      "dimension": "reachability",
      "delta_confidence": +0.2,
      "half_life_days": 14
    },
    {
      "when": { "kind": "runtime_trace", "result": "calls_into_vuln" },
      "dimension": "reachability",
      "set_state": "known_affected",
      "set_confidence": 0.95,
      "half_life_days": 21
    },
    {
      "when": { "kind": "vendor_vex", "result": "not_affected" },
      "set_state": "known_not_affected",
      "set_confidence": 0.98,
      "half_life_days": 60
    }
  ]
}

1.5 UI for decaying confidence

On the Unknown Detail page, you can show:

Confidence chip:
- “Knowledge freshness: 0.28 (stale)” with a color gradient.
Decay sparkline: small chart showing confidence over the last 30 days.
Next review: “Next review recommended by Dec 2, 2025 (in 3 days)”
Evidence stack: timeline of evidence events with icons (static scan, runtime, vendor, etc.).
Actions area: “Refresh now → Trigger runtime probe / request VEX / open Jira”.

All of that makes the heuristic feel concrete and gives engineers a mental model: “this is decaying; here’s when we revisit; here’s how to add evidence.”

2) Triage Queue for High‑Impact Unknowns: Making It Useful

The goal: reduce an ocean of unknowns to a small, actionable queue that:

Is ranked by risk, not noise
Has clear owners and due dates
Plugs cleanly into teams’ existing planning

2.1 Impact scoring, more formally

Define a normalized impact score I between 0 and 1:

I = w_env * EnvExposure
  + w_data * DataSensitivity
  + w_prevalence * Prevalence
  + w_sla * SlaCriticality
  + w_cvss * CvssSeverity

Where each factor is also 0–1:

EnvExposure:
- prod + internet_exposed → 1.0
- prod + internal only → 0.7
- non‑prod → 0.3
DataSensitivity:
- none → 0.0, internal → 0.3, pii → 0.7, financial/health → 1.0
Prevalence:
- fraction of services/assets affected (0..1)
SlaCriticality:
- bronze → 0.3, silver → 0.6, gold → 1.0
CvssSeverity:
- use CVSS normalized to 0..1 if you have it, otherwise approximate from “critical/high/med/low”.

Weights w_* configurable, e.g.:

w_env = 0.3
w_data = 0.25
w_prevalence = 0.15
w_sla = 0.15
w_cvss = 0.15

These can live in a tenant‑level config.

2.2 Triage score

You already had the core idea:

triage_score = Impact * (1 - ConfidenceNow)

You can enrich this slightly with recency:

RecencyBoost = min(1.2, 1.0 + DaysSinceCreated / 30 * 0.2)
triage_score = Impact * (1 - ConfidenceNow) * RecencyBoost

So very old unknowns with low confidence get a slight bump to avoid being buried forever.

2.3 Queue item lifecycle

Represent queue items as a simple workflow:

{
  "queue_item_id": "TRIAGE:unknown:…",
  "unknown_id": "URN:unknown:…",
  "triage_score": 0.81,
  "status": "open",          // open | in_progress | blocked | resolved | wont_fix
  "reason_blocked": null,
  "owner_team": "team-checkout",
  "assigned_to": "alice",
  "created_at": "2025-11-29T12:05:00Z",
  "due_by": "2025-12-02T17:00:00Z",

  "required_outcome": "add_evidence_or_verdict", // tasks that actually change state
  "suggested_actions": [
    { "type": "collect_runtime_trace", "cost": "low" },
    { "type": "symbolic_slice_probe", "cost": "medium" },
    { "type": "vendor_VEX_request", "cost": "low" }
  ],

  "audit": [
    {
      "at": "2025-11-29T12:05:00Z",
      "who": "system",
      "what": "enqueued: confidence below threshold_low; I=0.9, C=0.21"
    }
  ]
}

Rules:

Queue item is (re)created automatically when unknown’s next_review_at <= now and impact above a minimum threshold.
When an engineer adds evidence or changes state on the underlying unknown, the system:
- Recomputes confidence, impact, triage_score
- Closes the queue item if confidence now > threshold_high or state != unknown
You can allow re‑open if it decays again later.

2.4 Queue UI & ops hooks

In UI, the “High‑Impact Unknowns” view shows:

Columns:

Unknown (vuln + subject)
State (always “unknown” here, but future‑proof)
Impact badge (Low/Med/High/Critical)
Confidence chip
Triage score (sortable)
Owner team
Due by
Quick actions

Interactions:

Default filter: impact >= High AND env = prod
Per‑team view: filter owner_team = “team‑X”
Bulk ops: “Assign top 10 to me”, “Open Jira for selected” etc.

Ops hooks:

Daily digest to each team: “You have 5 high‑impact unknowns due this week.”
Planning export: per sprint, each team looks at “Top N unknowns by triage_score” and picks some into the sprint.
SLO integration: if team’s “unknown budget” (see below) is overrun, they must schedule unknown work.

2.5 Example: one unknown from signal to closure

New CVE hits; SBOM says checkout-api uses affected library.
- Unknown created with:
  - Impact ≈ 0.9 (prod, internet, PII, critical CVE)
  - Confidence = 0.4 (all we know is “it exists”).
- triage_score ≈ 0.9 * (1 - 0.4) = 0.54 → high enough to enqueue.
Engineer collects runtime trace, sees no calls to vulnerable path under normal traffic.
- Evidence added, confidence bumped to 0.75, half‑life 14 days.
- Queue item auto‑resolves if your threshold_high is 0.7.
Two months later, architecture changes, service gets a new public endpoint.
- Deployment event triggers an automatic “degrade confidence” rule (−0.2), sets new t0 and shorter half‑life.
- next_review_at moves closer; unknown re‑enters queue later.

This gives you continuously updating risk without manual spreadsheets.

3) Unknown Budget & Guardrails (Optional but Powerful)

To connect this to leadership/SRE conversations, define an “unknown budget” per service/team:

A target maximum risk mass of unknowns we’re willing to tolerate.

3.1 Per‑unknown “risk units”

For each unknown, define:

risk_units = Impact * (1 - ConfidenceNow)

(It’s literally the triage score, but aggregated differently.)

Per team or service:

unknown_risk_budget = sum(risk_units for that team/service)

You can then set guardrails, e.g.:

Gold‑tier service: budget ≤ 5.0
Silver: ≤ 15.0
Bronze: ≤ 30.0

3.2 Guardrail behaviors

If a team exceeds its budget:

Show warnings in the Stella Ops UI on the service details page.
Add a banner in the high‑impact queue: “Unknown budget exceeded by 3.2 units.”
Optional: feed into deployment checks:
- Above 2× budget → require security approval before prod deploy.
- Above 1× budget → must plan unknown work in next sprint.

This ties the heuristics to behavior change without being draconian.

4) Implementation Sketch (API & Code)

4.1 C# model sketch

public enum UnknownState { Unknown, KnownAffected, KnownNotAffected, Ignored }

public sealed class UnknownRecord
{
    public string Id { get; set; } = default!;

    public string SubjectType { get; set; } = default!; // "package", "service", ...
    public string? Purl { get; set; }
    public string? ServiceId { get; set; }
    public string Env { get; set; } = "prod";

    public string? VulnId { get; set; }  // CVE, GHSA, etc.
    public string Dimension { get; set; } = "reachability";
    public UnknownState State { get; set; } = UnknownState.Unknown;
    public string UnknownCause { get; set; } = "data_missing";

    // Confidence fields persisted
    public double ConfidenceValueAtT0 { get; set; }
    public DateTimeOffset ConfidenceT0 { get; set; }
    public double HalfLifeDays { get; set; }
    public double ThresholdLow { get; set; }
    public double ThresholdHigh { get; set; }
    public DateTimeOffset NextReviewAt { get; set; }

    // Impact factors
    public bool RuntimePresence { get; set; }
    public bool InternetExposed { get; set; }
    public string PrivilegeLevel { get; set; } = "low";
    public string DataSensitivity { get; set; } = "none";
    public double FleetPrevalence { get; set; }
    public string SlaTier { get; set; } = "bronze";

    // Ownership & audit
    public string OwnerTeam { get; set; } = default!;
    public DateTimeOffset CreatedAt { get; set; }
    public DateTimeOffset UpdatedAt { get; set; }
}

Helper to compute ConfidenceNow:

public static class ConfidenceCalculator
{
    public static double ComputeNow(UnknownRecord r, DateTimeOffset now)
    {
        var deltaDays = (now - r.ConfidenceT0).TotalDays;
        if (deltaDays <= 0) return Clamp01(r.ConfidenceValueAtT0);

        var factor = Math.Pow(0.5, deltaDays / r.HalfLifeDays);
        return Clamp01(r.ConfidenceValueAtT0 * factor);
    }

    public static (double valueAtT0, DateTimeOffset t0, DateTimeOffset nextReviewAt)
        ApplyEvidence(UnknownRecord r, double deltaConfidence, double? newHalfLifeDays, DateTimeOffset now)
    {
        var current = ComputeNow(r, now);
        var updated = Clamp01(current + deltaConfidence);
        var halfLife = newHalfLifeDays ?? r.HalfLifeDays;

        var daysToThreshold = DaysUntilThreshold(updated, r.ThresholdLow, halfLife);
        var nextReview = now.AddDays(daysToThreshold);

        return (updated, now, nextReview);
    }

    private static double DaysUntilThreshold(double valueAtT0, double threshold, double halfLifeDays)
    {
        if (valueAtT0 <= threshold) return 0;
        return halfLifeDays * Math.Log(threshold / valueAtT0) / Math.Log(0.5);
    }

    private static double Clamp01(double v) => v < 0 ? 0 : (v > 1 ? 1 : v);
}

Queue scoring:

public sealed class TriageConfig
{
    public double WEnv { get; set; } = 0.3;
    public double WData { get; set; } = 0.25;
    public double WPrev { get; set; } = 0.15;
    public double WSla { get; set; } = 0.15;
    public double WCvss { get; set; } = 0.15;

    public double MinImpactForQueue { get; set; } = 0.4;
    public double MaxRecencyBoost { get; set; } = 1.2;
}

public static class TriageScorer
{
    public static double ComputeImpact(UnknownRecord r, double cvssNorm, TriageConfig cfg)
    {
        var env = r.Env == "prod"
            ? (r.InternetExposed ? 1.0 : 0.7)
            : 0.3;

        var data = r.DataSensitivity switch
        {
            "none" => 0.0,
            "internal" => 0.3,
            "pii" => 0.7,
            "financial" => 1.0,
            _ => 0.3
        };

        var sla = r.SlaTier switch
        {
            "bronze" => 0.3,
            "silver" => 0.6,
            "gold" => 1.0,
            _ => 0.3
        };

        var prev = Math.Max(0, Math.Min(1, r.FleetPrevalence));

        return cfg.WEnv * env
             + cfg.WData * data
             + cfg.WPrev * prev
             + cfg.WSla * sla
             + cfg.WCvss * cvssNorm;
    }

    public static double ComputeTriageScore(
        UnknownRecord r,
        double cvssNorm,
        DateTimeOffset now,
        DateTimeOffset createdAt,
        TriageConfig cfg)
    {
        var impact = ComputeImpact(r, cvssNorm, cfg);
        var confidence = ConfidenceCalculator.ComputeNow(r, now);

        if (impact < cfg.MinImpactForQueue) return 0;

        var ageDays = (now - createdAt).TotalDays;
        var recencyBoost = Math.Min(cfg.MaxRecencyBoost, 1.0 + (ageDays / 30.0) * 0.2);

        return impact * (1 - confidence) * recencyBoost;
    }
}

This is all straightforward to wire into your existing C#/Angular stack.

5) How This Feeds Planning & Metrics

Once this is live, you get a bunch of useful knobs for product and leadership:

5.1 Per‑team dashboards

For each team/service, show:

Unknown count (total & by dimension)
Unknown risk budget (current vs target)
Distribution of confidence (e.g., histogram buckets: 0–0.25, 0.25–0.5, etc.)
Average age of unknowns
Queue throughput:
- of unknowns investigated this sprint
- Average time from enqueued → evidence added/ verdict

These tell you if teams are actually burning down epistemic risk or just tagging things.

5.2 Process metrics to tune heuristics

Every quarter, look at:

How many unknowns re‑enter the queue because decay hits threshold?
For unknowns that later become known‑affected incidents, what were their triage scores?
- If many “incident‑causing unknowns” had low triage scores, adjust weights.
Are teams routinely ignoring certain impact factors (e.g., low data sensitivity)?
- Maybe reduce weight or adjust scoring.

Because the heuristics are explicit and simple, you can iterate: tweak half‑lives and weights, observe effect on queue size and incident correlation.

If you’d like, next step I can sketch:

A REST API surface (GET /unknowns, GET /unknowns/triage, POST /unknowns/{id}/evidence)
Or specific Angular components for the Confidence Decay Card and High‑Impact Unknowns table, wired to these models.

22 KiB Raw Blame History Unescape Escape