5.7 KiB
Here’s a compact, end‑to‑end design you can drop into a repo: a cross‑platform call‑stack analyzer plus an offline capture/replay pipeline with provable symbol provenance—built to behave the same on Linux, Windows, and macOS, and to pass strict CI acceptance tests.
What this solves (quick context)
- Problem: stack unwinding differs by OS, binary format, runtime (signals/async/coroutines), and symbol sources—making incident triage noisy and non‑reproducible.
- Goal: one analyzer that normalizes unwinding invariants, records traces, resolves symbols offline, and replays to verify determinism and coverage—useful for Stella Ops evidence capture and air‑gapped flows.
Unwinding model (portable)
-
Primary CFI: DWARF
.eh_frame/.debug_frame(Linux/macOS),.pdata/ unwind info (Windows). -
IDs for symbol lookup:
- Linux: ELF build‑id (
.note.gnu.build-id) - macOS: Mach‑O UUID (dSYM)
- Windows: PDB GUID+Age
- Linux: ELF build‑id (
-
Fallback chain per frame (strict order, record provenance):
- CFI/CIE lookup (libunwind/LLVM, DIA on Windows, Apple DWARF tools)
- Frame‑pointer walk if available
- Language/runtime helpers (e.g., Go, Rust, JVM, .NET where present)
- Heuristic last‑resort (conservative unwind, stop on ambiguity)
-
Async/signal/coroutines: stitch segments by reading runtime metadata and signal trampolines, then join on saved contexts; tag boundaries so replay can validate.
-
Kernel/eBPF contexts (Linux): optional BTF‑assisted unwind for kernel frames when traces cross user/kernel boundary.
Offline symbol bundles (content‑addressed)
Required bundle contents (per‑OS id map + index):
-
Content‑addressed index (sha256 keys)
-
Per‑OS mapping:
- Linux: build‑id → path/blob
- Windows: PDB GUID+Age → PDB blob
- macOS: UUID → dSYM blob
-
symbol_index.json(addr → file:line + function) -
DSSE signature (+ signer)
-
Rekor inclusion proof or embedded tile fragment (for transparency)
Acceptance rules:
-
symbol_coverage_pct ≥ 90%per trace (resolver chain: debuginfod → local bundle → heuristic demangle) -
Replay across 5 seeds:
replay_success_ratio ≥ 0.95 -
DSSE + Rekor proofs verify offline
-
Platform checks:
- ELF build‑id matches binary note
- PDB GUID+Age matches module metadata
- dSYM UUID matches Mach‑O UUID
Minimal Postgres schema (ready to run)
CREATE TABLE traces(
trace_id UUID PRIMARY KEY,
platform TEXT,
captured_at TIMESTAMP,
build_id TEXT,
symbol_bundle_sha256 TEXT,
dsse_ref TEXT
);
CREATE TABLE frames(
trace_id UUID REFERENCES traces,
frame_index INT,
ip BIGINT,
module_path TEXT,
module_build_id TEXT,
resolved_symbol TEXT,
symbol_offset BIGINT,
resolver TEXT,
PRIMARY KEY(trace_id, frame_index)
);
CREATE TABLE symbol_bundles(
sha256 TEXT PRIMARY KEY,
os TEXT,
bundle_blob BYTEA,
index_json JSONB,
signer TEXT,
rekor_tile_ref TEXT
);
CREATE TABLE replays(
replay_id UUID PRIMARY KEY,
trace_id UUID REFERENCES traces,
seed BIGINT,
started_at TIMESTAMP,
finished_at TIMESTAMP,
replay_success_ratio FLOAT,
verify_time_ms INT,
verifier_version TEXT,
notes JSONB
);
Event payloads (wire format)
{"event":"trace.capture","trace_id":"...","platform":"linux","build_id":"<gnu-build-id>","frames":[{"ip":"0x..","module":"/usr/bin/foo","module_build_id":"<id>"}],"symbol_bundle_ref":"sha256:...","dsse_ref":"dsse:..."}
{"event":"replay.result","replay_id":"...","trace_id":"...","seed":42,"replay_success_ratio":0.98,"symbol_coverage_pct":93,"verify_time_ms":8423}
Resolver policy (per‑OS, enforced)
- Linux: debuginfod → local bundle (build‑id) → DWARF CFI → FP → heuristic demangle
- Windows: local bundle (PDB GUID+Age via DIA) → .pdata unwind → FP → demangle
- macOS: local bundle (dSYM UUID) → DWARF CFI → FP → demangle
Record
resolverused on every frame.
CI acceptance scripts (tiny but strict)
- Run capture → resolve → replay across 5 seeds; fail merge if any SLO unmet.
- Verify DSSE signature and Rekor inclusion offline.
- Assert per‑platform ID matches (build‑id / GUID+Age / UUID).
- Emit a short JUnit‑style report plus
% coverageand% success.
Implementation notes (drop‑in)
-
Use libunwind/LLVM (Linux/macOS), DIA SDK (Windows).
-
Add small shims for signal trampolines and runtime helpers (Go/Rust/JVM/.NET) when present.
-
Protobuf or JSON Lines for event logs; gzip + content‑address everything (sha256).
-
Store provenance per frame (
resolver, source, bundle hash). -
Provide a tiny CLI:
trace-capture --with-btf --pid ...trace-resolve --bundle sha256:...trace-replay --trace ... --seeds 5trace-verify --bundle sha256:... --dsse --rekor
Why this fits your stack (Stella Ops)
- Air‑gap/attestation first: DSSE, Rekor tile fragments, offline verification—aligns with your evidence model.
- Deterministic evidence: replayable traces with SLOs → reliable RCA artifacts you can store beside SBOM/VEX.
- Provenance: per‑frame resolver trail supports auditor queries (“how was this line derived?”).
Next steps (ready‑made tasks)
- Add a SymbolBundleBuilder job to produce DSSE‑signed bundles per release.
- Integrate Capture→Resolve→Replay into CI and gate merges on SLOs above.
- Expose a Stella Ops Evidence card: coverage%, success ratio, verifier version, and links to frames.
If you want, I’ll generate a starter repo (CLI skeleton, DSSE/Rekor validators, Postgres migrations, CI workflow, and a tiny sample bundle) so you can try it immediately.