Files
git.stella-ops.org/docs/runbooks/reachability-runtime.md
StellaOps Bot 1c782897f7
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
up
2025-11-26 07:47:08 +02:00

4.6 KiB
Raw Blame History

Runbook: Runtime Reachability Facts (Zastava → Signals)

Goal

Stream runtime symbol evidence from Zastava Observer to Signals in NDJSON batches that align with the runtime/static union schema, stay deterministic, and are replayable.

Endpoints

  • Signals structured ingest: POST /signals/runtime-facts
  • Signals NDJSON ingest: POST /signals/runtime-facts/ndjson
    • Headers: Content-Encoding: gzip (optional), Content-Type: application/x-ndjson
    • Query/header metadata: callgraphId (required), scanId|imageDigest|component+version, optional source

NDJSON event shape (one per line)

{
  "symbolId": "pkg:python/django.views:View.as_view",
  "codeId": "buildid-abc123",
  "purl": "pkg:pypi/django@4.2.7",
  "loaderBase": "0x7f23c01000",
  "processId": 214,
  "processName": "uwsgi",
  "containerId": "c123",
  "socketAddress": "10.0.0.5:8443",
  "hitCount": 3,
  "observedAt": "2025-11-26T12:00:00Z",
  "metadata": { "pid": "214" }
}

Required: symbolId, hitCount; callgraphId is provided via query/header metadata. Optional fields shown for correlation.

Batch rules

  • NDJSON MUST NOT be empty; empty streams are rejected.
  • Compress with gzip when large; maintain stable line ordering.
  • Use UTC timestamps (ISO-8601 observedAt).
  • Avoid PII; redact process/user info before send.

CAS alignment

  • When runtime trace bundles are produced, store under cas://runtime_traces/<hh>/<sha>.tar.zst and include meta.json with analysisId.
  • Pass the same analysisId in X-Analysis-Id (if present) when uploading union bundles so replay manifests can link graphs+traces.

Errors & remediation

  • 400 callgraphId is required → set callgraphId header/query.
  • 400 runtime fact stream was empty → ensure NDJSON has events.
  • 400 Subject must include scanId/imageDigest/component+version → populate subject metadata.

Determinism checklist

  • Stable ordering of NDJSON lines.
  • No host-dependent paths; only IDs/digests.
  • Fixed gzip level if used (suggest 6) to aid reproducibility.

Zastava Observer setup (runtime sampler)

  • Sampling mode: deterministic EntryTrace sampler; default 1:1 (no drop) for pilot. Enable rate/CPU guard: Sampler:MaxEventsPerSecond (default 500), Sampler:MaxCpuPercent (default 35). When rates are exceeded, emit sampler.dropped counters with drop reason rate_limit/cpu_guard.
  • Symbol capture: enable build-id collection (SymbolCapture:CollectBuildIds=true) and loader base addresses (SymbolCapture:EmitLoaderBase=true) to match static graphs.
  • Batching: buffer up to 1,000 events or 2s, whichever comes first (Ingest:BatchSize, Ingest:FlushIntervalMs). Batches are sorted by observedAt before send to keep deterministic order.
  • Transport: NDJSON POST to Signals /signals/runtime-facts/ndjson with headers X-Callgraph-Id, optional X-Analysis-Id. Set Content-Encoding: gzip when batches exceed 64KiB.
  • CAS traces (optional): if EntryTrace raw traces are persisted, package as cas://runtime_traces/<hh>/<sha>.tar.zst with meta.json containing analysisId, nodeCount, edgeCount, traceVersion. Include the CAS URI in metadata.casUri on each NDJSON event.
  • Security/offline: disable egress by default; allowlist only the Signals host. TLS must be enabled; supply client certs per platform runbook if required. No PID/user names are emitted—only digests/IDs.

Example appsettings (Observer)

{
  "Sampler": {
    "MaxEventsPerSecond": 500,
    "MaxCpuPercent": 35
  },
  "SymbolCapture": {
    "CollectBuildIds": true,
    "EmitLoaderBase": true
  },
  "Ingest": {
    "BatchSize": 1000,
    "FlushIntervalMs": 2000,
    "Endpoint": "https://signals.local/signals/runtime-facts/ndjson",
    "Headers": {
      "X-Callgraph-Id": "cg-123"
    }
  }
}

Operational steps

  1. Enable EntryTrace sampler in Zastava Observer with the config above; verify sampler.dropped stays at 0 during pilot.
  2. Run a 5-minute capture and send NDJSON to a staging Signals instance using the smoke test; confirm 202 and CAS pointers recorded.
  3. Correlate runtime facts to static graphs by callgraphId in Signals; ensure counts match sampler totals.
  4. Promote config to prod/offline bundle; freeze config hashes for replay.

Smoke test

cat events.ndjson | gzip -c | \
  curl -X POST "https://signals.local/signals/runtime-facts/ndjson?callgraphId=cg-123&component=web&version=1.0.0" \
       -H "Content-Type: application/x-ndjson" \
       -H "Content-Encoding: gzip" \
       --data-binary @-

Expect 202 Accepted with SubjectKey in response; Signals will recompute reachability and emit signals.fact.updated@v1.