6.3 KiB
Here’s a clean, first‑time‑friendly blueprint for a deterministic crash analyzer pipeline you can drop into Stella Ops (or any CI/CD + observability stack).
What this thing does (in plain words)
It ingests a crash “evidence tile” (a signed, canonical JSON blob + hash), looks up symbols from your chosen stores (ELF/PDB/dSYM), unwinds the stack deterministically, and returns a stable, symbol‑pinned call stack plus a replay manifest so you can reproduce the exact same result later—bit‑for‑bit.
The contract
Input (strict, deterministic)
-
signed_evidence_tile: Canonical JSON (JCS) with your payload (e.g., OS, arch, registers, fault addr, module list) and its
sha256.- Canonicalization must follow RFC 8785 JCS to make the hash verifiable.
-
symbol_pointers:
- ELF: debuginfod build‑id URIs
- Windows: PDB GUID+Age
- Apple: dSYM UUID
-
unwind_context: register snapshot, preference flags (e.g., “prefer unwind tables over frame‑pointers”), OS/ABI hints.
-
deterministic_seed: single source of truth for any randomized tie‑breakers or heuristics.
Output
-
call_stack: ordered vector of frames
addr,symbol_id, optionalfile:line,symbol_resolution_confidence, andresolver(which backend won).
-
replay_manifest:
{ seed, env_knobs, symbol_bundle_pointer }so you (or CI) can re‑run the exact same resolution later.
Resolver abstraction (so CI can fan‑out)
Define a tiny interface and run resolvers in parallel; record who succeeded:
type Platform = "linux" | "windows" | "apple";
interface ResolveResult {
symbol_id: string; // stable id in your store
file?: string;
line?: number;
confidence: number; // 0..1
resolver: string; // e.g., "debuginfod", "dia", "dsymutil"
}
function resolve(address: string, platform: Platform, bundle_hint?: string): ResolveResult | null;
Backends:
- Linux/ELF: debuginfod (by build‑id), DWARF/unwind tables.
- Windows: DIA/PDB (by GUID+Age).
- Apple: dSYM/DWARF (by UUID),
atos/llvm-symbolizerflow if desired.
Deterministic ingest & hashing
- Parse incoming JSON → canonicalize via JCS → compute
sha256→ verify signature → only then proceed. - Persist
{canonical_json, sha256, signature, received_at}so downstream stages always pull the exact blob.
Unwinding & symbolization pipeline (deterministic)
-
Normalize modules (match load addresses → build‑ids/GUIDs/UUIDs).
-
Unwind using the declared policy in
unwind_context(frame pointers vs. EH/CFI tables). -
For each PC:
- Parallel resolve via resolvers (
debuginfod, DIA/PDB, dSYM). - Pick the winner by deterministic reducer: highest
confidence, then lexical tie‑break usingdeterministic_seed.
- Parallel resolve via resolvers (
-
Emit frames with
symbol_id(stable, content‑addressed if possible), and optionalfile:line.
Telemetry & SLOs (what to measure)
- replay_success_ratio (golden ≥ 95%) — same input → same output.
- symbol_coverage_pct (prod ≥ 90%) — % of frames resolved to symbols.
- verify_time_ms (median ≤ 3000 ms) — signature + hash + canonicalization + core steps.
- resolver_latency_ms per backend — for tuning caches and fallbacks.
Trade‑offs (make them explicit)
-
On‑demand decompilation / function‑matching
- ✅ Higher confidence on stripped binaries
- ❌ More CPU/latency; potentially leaks more symbol metadata (privacy)
-
Progressive fetch + partial symbolization
- ✅ Lower latency, good UX under load
- ❌ Lower confidence on some frames; riskier explainability (false positives)
Pick per environment via env_knobs and record that in the replay_manifest.
Minimal wire formats (copy/paste ready)
Evidence tile (canonical, pre‑hash)
{
"evidence_version": 1,
"platform": "linux",
"arch": "x86_64",
"fault_addr": "0x7f1a2b3c",
"registers": { "rip": "0x7f1a2b3c", "rsp": "0x7ffd...", "rbp": "0x..." },
"modules": [
{"name":"svc","base":"0x400000","build_id":"a1b2c3..."},
{"name":"libc.so.6","base":"0x7f...","build_id":"d4e5f6..."}
],
"ts_unix_ms": 1739999999999
}
Analyzer request
{
"signed_evidence_tile": {
"jcs_json": "<the exact JCS-canonical JSON above>",
"sha256": "f1c2...deadbeef",
"signature": "dsse/…"
},
"symbol_pointers": {
"linux": ["debuginfod:buildid:a1b2c3..."],
"windows": ["pdb:GUID+Age:..."],
"apple": ["dsym:UUID:..."]
},
"unwind_context": {
"prefer_unwind_tables": true,
"stack_limit_bytes": 262144
},
"deterministic_seed": "6f5d7d1e-..."
}
Analyzer response
{
"call_stack": [
{
"addr": "0x400abc",
"symbol_id": "svc@a1b2c3...:main",
"file": "main.cpp",
"line": 127,
"symbol_resolution_confidence": 0.98,
"resolver": "debuginfod"
}
],
"replay_manifest": {
"seed": "6f5d7d1e-...",
"env_knobs": { "progressive_fetch": true, "max_resolvers": 3 },
"symbol_bundle_pointer": "bundle://a1b2c3.../svc.sym"
}
}
How this plugs into Stella Ops
- EvidenceLocker: store the JCS‑canonical tile + DSSE signature + sha256.
- AdvisoryAI: consume symbol‑pinned stacks as first‑class facts for RCA, search, and explanations.
- Attestor: sign analyzer outputs (DSSE) and attach to Releases/Incidents.
- CI: on build, publish symbol bundles (ELF build‑id / PDB GUID+Age / dSYM UUID) to your internal stores; register debuginfod endpoints.
- SLO dashboards: show coverage, latency, and replay ratio by service and release.
Quick implementation checklist
- JCS canonicalization + sha256 + DSSE verify gate
- Resolver interface + parallel fan‑out + deterministic reducer
- debuginfod client (ELF), DIA/PDB (Windows), dSYM/DWARF (Apple) adapters
- Unwinder with policy switches (frame‑ptr vs. CFI)
- Content‑addressed
symbol_idscheme - Replay harness honoring
replay_manifest - Metrics emitters + SLO dashboards
- Privacy guardrails (strip/leak‑check symbol metadata by env)
If you want, I can generate a tiny reference service (Go or C#) with: JCS canonicalizer, debuginfod lookup, DIA shim, dSYM flow, and the exact JSON contracts above so you can drop it into your build & incident pipeline.