Files
git.stella-ops.org/docs-archived/product/advisories/20260226 - Deterministic call‑stack analysis and resolver strategy.md

6.3 KiB
Raw Permalink Blame History

Heres a clean, firsttimefriendly blueprint for a deterministic crash analyzer pipeline you can drop into StellaOps (or any CI/CD + observability stack).


What this thing does (in plain words)

It ingests a crash “evidence tile” (a signed, canonical JSON blob + hash), looks up symbols from your chosen stores (ELF/PDB/dSYM), unwinds the stack deterministically, and returns a stable, symbolpinned call stack plus a replay manifest so you can reproduce the exact same result later—bitforbit.


The contract

Input (strict, deterministic)

  • signed_evidence_tile: Canonical JSON (JCS) with your payload (e.g., OS, arch, registers, fault addr, module list) and its sha256.

    • Canonicalization must follow RFC8785 JCS to make the hash verifiable.
  • symbol_pointers:

    • ELF: debuginfod buildid URIs
    • Windows: PDB GUID+Age
    • Apple: dSYM UUID
  • unwind_context: register snapshot, preference flags (e.g., “prefer unwind tables over framepointers”), OS/ABI hints.

  • deterministic_seed: single source of truth for any randomized tiebreakers or heuristics.

Output

  • call_stack: ordered vector of frames

    • addr, symbol_id, optional file:line, symbol_resolution_confidence, and resolver (which backend won).
  • replay_manifest: { seed, env_knobs, symbol_bundle_pointer } so you (or CI) can rerun the exact same resolution later.


Resolver abstraction (so CI can fanout)

Define a tiny interface and run resolvers in parallel; record who succeeded:

type Platform = "linux" | "windows" | "apple";

interface ResolveResult {
  symbol_id: string;       // stable id in your store
  file?: string;
  line?: number;
  confidence: number;      // 0..1
  resolver: string;        // e.g., "debuginfod", "dia", "dsymutil"
}

function resolve(address: string, platform: Platform, bundle_hint?: string): ResolveResult | null;

Backends:

  • Linux/ELF: debuginfod (by buildid), DWARF/unwind tables.
  • Windows: DIA/PDB (by GUID+Age).
  • Apple: dSYM/DWARF (by UUID), atos/llvm-symbolizer flow if desired.

Deterministic ingest & hashing

  • Parse incoming JSON → canonicalize via JCS → compute sha256 → verify signature → only then proceed.
  • Persist {canonical_json, sha256, signature, received_at} so downstream stages always pull the exact blob.

Unwinding & symbolization pipeline (deterministic)

  1. Normalize modules (match load addresses → buildids/GUIDs/UUIDs).

  2. Unwind using the declared policy in unwind_context (frame pointers vs. EH/CFI tables).

  3. For each PC:

    • Parallel resolve via resolvers (debuginfod, DIA/PDB, dSYM).
    • Pick the winner by deterministic reducer: highest confidence, then lexical tiebreak using deterministic_seed.
  4. Emit frames with symbol_id (stable, contentaddressed if possible), and optional file:line.


Telemetry & SLOs (what to measure)

  • replay_success_ratio (golden ≥ 95%) — same input → same output.
  • symbol_coverage_pct (prod ≥ 90%) — % of frames resolved to symbols.
  • verify_time_ms (median ≤ 3000ms) — signature + hash + canonicalization + core steps.
  • resolver_latency_ms per backend — for tuning caches and fallbacks.

Tradeoffs (make them explicit)

  • Ondemand decompilation / functionmatching

    • Higher confidence on stripped binaries
    • More CPU/latency; potentially leaks more symbol metadata (privacy)
  • Progressive fetch + partial symbolization

    • Lower latency, good UX under load
    • Lower confidence on some frames; riskier explainability (false positives)

Pick per environment via env_knobs and record that in the replay_manifest.


Minimal wire formats (copy/paste ready)

Evidence tile (canonical, prehash)

{
  "evidence_version": 1,
  "platform": "linux",
  "arch": "x86_64",
  "fault_addr": "0x7f1a2b3c",
  "registers": { "rip": "0x7f1a2b3c", "rsp": "0x7ffd...", "rbp": "0x..." },
  "modules": [
    {"name":"svc","base":"0x400000","build_id":"a1b2c3..."},
    {"name":"libc.so.6","base":"0x7f...","build_id":"d4e5f6..."}
  ],
  "ts_unix_ms": 1739999999999
}

Analyzer request

{
  "signed_evidence_tile": {
    "jcs_json": "<the exact JCS-canonical JSON above>",
    "sha256": "f1c2...deadbeef",
    "signature": "dsse/…"
  },
  "symbol_pointers": {
    "linux": ["debuginfod:buildid:a1b2c3..."],
    "windows": ["pdb:GUID+Age:..."],
    "apple": ["dsym:UUID:..."]
  },
  "unwind_context": {
    "prefer_unwind_tables": true,
    "stack_limit_bytes": 262144
  },
  "deterministic_seed": "6f5d7d1e-..."
}

Analyzer response

{
  "call_stack": [
    {
      "addr": "0x400abc",
      "symbol_id": "svc@a1b2c3...:main",
      "file": "main.cpp",
      "line": 127,
      "symbol_resolution_confidence": 0.98,
      "resolver": "debuginfod"
    }
  ],
  "replay_manifest": {
    "seed": "6f5d7d1e-...",
    "env_knobs": { "progressive_fetch": true, "max_resolvers": 3 },
    "symbol_bundle_pointer": "bundle://a1b2c3.../svc.sym"
  }
}

How this plugs into StellaOps

  • EvidenceLocker: store the JCScanonical tile + DSSE signature + sha256.
  • AdvisoryAI: consume symbolpinned stacks as firstclass facts for RCA, search, and explanations.
  • Attestor: sign analyzer outputs (DSSE) and attach to Releases/Incidents.
  • CI: on build, publish symbol bundles (ELF buildid / PDB GUID+Age / dSYM UUID) to your internal stores; register debuginfod endpoints.
  • SLO dashboards: show coverage, latency, and replay ratio by service and release.

Quick implementation checklist

  • JCS canonicalization + sha256 + DSSE verify gate
  • Resolver interface + parallel fanout + deterministic reducer
  • debuginfod client (ELF), DIA/PDB (Windows), dSYM/DWARF (Apple) adapters
  • Unwinder with policy switches (frameptr vs. CFI)
  • Contentaddressed symbol_id scheme
  • Replay harness honoring replay_manifest
  • Metrics emitters + SLO dashboards
  • Privacy guardrails (strip/leakcheck symbol metadata by env)

If you want, I can generate a tiny reference service (Go or C#) with: JCS canonicalizer, debuginfod lookup, DIA shim, dSYM flow, and the exact JSON contracts above so you can drop it into your build & incident pipeline.