Files
git.stella-ops.org/docs/runbooks/reachability-runtime.md
master babb81af52
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
feat(scanner): Implement Deno analyzer and associated tests
- Added Deno analyzer with comprehensive metadata and evidence structure.
- Created a detailed implementation plan for Sprint 130 focusing on Deno analyzer.
- Introduced AdvisoryAiGuardrailOptions for managing guardrail configurations.
- Developed GuardrailPhraseLoader for loading blocked phrases from JSON files.
- Implemented tests for AdvisoryGuardrailOptions binding and phrase loading.
- Enhanced telemetry for Advisory AI with metrics tracking.
- Added VexObservationProjectionService for querying VEX observations.
- Created extensive tests for VexObservationProjectionService functionality.
- Introduced Ruby language analyzer with tests for simple and complex workspaces.
- Added Ruby application fixtures for testing purposes.
2025-11-12 10:01:54 +02:00

5.1 KiB
Raw Blame History

Runbook — Reachability Runtime Ingestion

Audience: Signals Guild · Zastava Guild · Scanner Guild · Ops Guild
Prereqs: docs/reachability/DELIVERY_GUIDE.md, docs/reachability/function-level-evidence.md, docs/modules/platform/architecture-overview.md §5

This runbook documents how to stage, ingest, and troubleshoot runtime evidence (/signals/runtime-facts) so function-level reachability data remains provable across online and air-gapped environments.


1 · Runtime capture pipeline

  1. Zastava Observer / runtime probes
    • Emit NDJSON lines with symbolId, codeId, loaderBase, hitCount, process{Id,Name}, socketAddress, containerId, optional evidenceUri, and metadata map.
    • Compress large batches with gzip (.ndjson.gz), max 10MiB per chunk, monotonic timestamps.
    • Attach subject context via HTTP query (scanId, imageDigest, component, version) when using the streaming endpoint.
  2. CAS staging (optional but recommended)
    • Upload raw batches to cas://reachability/runtime/<sha256> before ingestion.
    • Store CAS URIs alongside probe metadata so Signals can echo them in ReachabilityFactDocument.Metadata.
  3. Signals ingestion
    • POST /signals/runtime-facts (JSON) for one-off uploads or stream NDJSON to /signals/runtime-facts/ndjson (set Content-Encoding: gzip when applicable).
    • Signals validates schema, dedupes events by (symbolId, codeId, loaderBase), and updates runtimeFacts with cumulative hitCount.
  4. Reachability scoring
    • ReachabilityScoringService recomputes lattice states (Unknown → Observed), persists references to runtime CAS artifacts, and emits signals.fact.updated once GAP-SIG-003 lands.

2 · Operator checklist

Step Action Owner Notes
1 Verify probe health (zastava observer status) and confirm NDJSON batches include symbolId + codeId. Runtime Guild Reject batches missing symbolId; restart probe with debug logging.
2 Stage batches in CAS (stella cas put reachability/runtime ...) and record the returned URI. Ops Guild Required for replay-grade evidence.
3 Call /signals/runtime-facts/ndjson with tenant and callgraphId headers, streaming the gzip payload. Signals Guild Use service identity with signals.runtime:write.
4 Monitor ingestion metrics: signals_runtime_events_total, signals_runtime_ingest_failures_total. Observability Alert if failures exceed 1% over 5min.
5 Trigger recompute (POST /signals/reachability/recompute) when new runtime batches arrive for an active scan. Signals Guild Provide callgraphId + subject tuple.
6 Validate Policy/UI surfaces by requesting /policy/findings?includeReachability=true and checking reachability.evidence. Policy + UI Guilds Ensure evidence references the CAS URIs from Step2.

3 · Air-gapped workflow

  1. Export runtime NDJSON batches via Offline Kit: offline/reachability/runtime/<scan-id>/<timestamp>.ndjson.gz + manifest.
  2. On the secure network, load CAS entries locally (stella cas load ...) and invoke stella signals runtime-facts ingest --from offline/....
  3. Re-run stella replay manifest.json --section reachability to ensure manifests cite the imported runtime digests.
  4. Sync ingestion receipts (signals-runtime-ingest.log) back to the air-gapped environment for audit.

4 · Troubleshooting

Symptom Cause Resolution
422 Unprocessable Entity: missing symbolId Probe emitted incomplete JSON. Restart probe with --include-symbols, confirm symbol server availability, regenerate batch.
403 Forbidden: sealed-mode evidence invalid Signals sealed-mode verifier rejected payload (likely missing CAS proof). Upload batch to CAS first, include X-Reachability-Cas-Uri header, or disable sealed-mode in non-prod.
Runtime facts missing from Policy/UI Recompute not triggered or callgraphId mismatch. List facts via /signals/reachability/facts?subject=..., confirm callgraphId, then POST recompute.
CAS hash mismatch during replay Batch mutated post-ingestion. Re-stage from original gzip, invalidate old CAS entry, rerun ingestion to regenerate manifest references.

5 · Retention & observability

  • Default retention: 30days hot in Signals Mongo, 180days in CAS (match replay policy). Configure via signals.runtimeFacts.retentionDays.
  • Metrics to alert on:
    • signals_runtime_ingest_latency_seconds (P95 < 2s).
    • signals_runtime_cas_miss_total (should be 0 once CAS is mandatory).
  • Logs/traces:
    • Category Reachability.Runtime records ingestion batches and CAS URIs.
    • Trace attributes: callgraphId, subjectKey, casUri, eventCount.

6 · References

  • docs/reachability/DELIVERY_GUIDE.md
  • docs/reachability/function-level-evidence.md
  • docs/replay/DETERMINISTIC_REPLAY.md
  • docs/modules/platform/architecture-overview.md §5 (Replay CAS)
  • docs/runbooks/replay_ops.md

Update this runbook whenever endpoints, retention knobs, or CAS layouts change.