up
This commit is contained in:
@@ -1,80 +1,95 @@
|
||||
# Runbook — Reachability Runtime Ingestion
|
||||
# Runbook: Runtime Reachability Facts (Zastava → Signals)
|
||||
|
||||
> **Audience:** Signals Guild · Zastava Guild · Scanner Guild · Ops Guild
|
||||
> **Prereqs:** `docs/reachability/DELIVERY_GUIDE.md`, `docs/reachability/function-level-evidence.md`, `docs/modules/platform/architecture-overview.md` §5
|
||||
## Goal
|
||||
Stream runtime symbol evidence from Zastava Observer to Signals in NDJSON batches that align with the runtime/static union schema, stay deterministic, and are replayable.
|
||||
|
||||
This runbook documents how to stage, ingest, and troubleshoot runtime evidence (`/signals/runtime-facts`) so function-level reachability data remains provable across online and air-gapped environments.
|
||||
## Endpoints
|
||||
- Signals structured ingest: `POST /signals/runtime-facts`
|
||||
- Signals NDJSON ingest: `POST /signals/runtime-facts/ndjson`
|
||||
- Headers: `Content-Encoding: gzip` (optional), `Content-Type: application/x-ndjson`
|
||||
- Query/header metadata: `callgraphId` (required), `scanId|imageDigest|component+version`, optional `source`
|
||||
|
||||
---
|
||||
## NDJSON event shape (one per line)
|
||||
```json
|
||||
{
|
||||
"symbolId": "pkg:python/django.views:View.as_view",
|
||||
"codeId": "buildid-abc123",
|
||||
"purl": "pkg:pypi/django@4.2.7",
|
||||
"loaderBase": "0x7f23c01000",
|
||||
"processId": 214,
|
||||
"processName": "uwsgi",
|
||||
"containerId": "c123",
|
||||
"socketAddress": "10.0.0.5:8443",
|
||||
"hitCount": 3,
|
||||
"observedAt": "2025-11-26T12:00:00Z",
|
||||
"metadata": { "pid": "214" }
|
||||
}
|
||||
```
|
||||
|
||||
## 1 · Runtime capture pipeline
|
||||
Required: `symbolId`, `hitCount`; `callgraphId` is provided via query/header metadata. Optional fields shown for correlation.
|
||||
|
||||
1. **Zastava Observer / runtime probes**
|
||||
- Emit NDJSON lines with `symbolId`, `codeId`, `loaderBase`, `hitCount`, `process{Id,Name}`, `socketAddress`, `containerId`, optional `evidenceUri`, and `metadata` map.
|
||||
- Compress large batches with gzip (`.ndjson.gz`), max 10 MiB per chunk, monotonic timestamps.
|
||||
- Attach subject context via HTTP query (`scanId`, `imageDigest`, `component`, `version`) when using the streaming endpoint.
|
||||
2. **CAS staging (optional but recommended)**
|
||||
- Upload raw batches to `cas://reachability/runtime/<sha256>` before ingestion.
|
||||
- Store CAS URIs alongside probe metadata so Signals can echo them in `ReachabilityFactDocument.Metadata`.
|
||||
3. **Signals ingestion**
|
||||
- POST `/signals/runtime-facts` (JSON) for one-off uploads or stream NDJSON to `/signals/runtime-facts/ndjson` (set `Content-Encoding: gzip` when applicable).
|
||||
- Signals validates schema, dedupes events by `(symbolId, codeId, loaderBase)`, and updates `runtimeFacts` with cumulative `hitCount`.
|
||||
4. **Reachability scoring**
|
||||
- `ReachabilityScoringService` recomputes lattice states (`Unknown → Observed`), persists references to runtime CAS artifacts, and emits `signals.fact.updated` once `GAP-SIG-003` lands.
|
||||
## Batch rules
|
||||
- NDJSON MUST NOT be empty; empty streams are rejected.
|
||||
- Compress with gzip when large; maintain stable line ordering.
|
||||
- Use UTC timestamps (ISO-8601 `observedAt`).
|
||||
- Avoid PII; redact process/user info before send.
|
||||
|
||||
---
|
||||
## CAS alignment
|
||||
- When runtime trace bundles are produced, store under `cas://runtime_traces/<hh>/<sha>.tar.zst` and include `meta.json` with analysisId.
|
||||
- Pass the same `analysisId` in `X-Analysis-Id` (if present) when uploading union bundles so replay manifests can link graphs+traces.
|
||||
|
||||
## 2 · Operator checklist
|
||||
## Errors & remediation
|
||||
- `400 callgraphId is required` → set `callgraphId` header/query.
|
||||
- `400 runtime fact stream was empty` → ensure NDJSON has events.
|
||||
- `400 Subject must include scanId/imageDigest/component+version` → populate subject metadata.
|
||||
|
||||
| Step | Action | Owner | Notes |
|
||||
|------|--------|-------|-------|
|
||||
| 1 | Verify probe health (`zastava observer status`) and confirm NDJSON batches include `symbolId` + `codeId`. | Runtime Guild | Reject batches missing `symbolId`; restart probe with debug logging. |
|
||||
| 2 | Stage batches in CAS (`stella cas put reachability/runtime ...`) and record the returned URI. | Ops Guild | Required for replay-grade evidence. |
|
||||
| 3 | Call `/signals/runtime-facts/ndjson` with `tenant` and `callgraphId` headers, streaming the gzip payload. | Signals Guild | Use service identity with `signals.runtime:write`. |
|
||||
| 4 | Monitor ingestion metrics: `signals_runtime_events_total`, `signals_runtime_ingest_failures_total`. | Observability | Alert if failures exceed 1% over 5 min. |
|
||||
| 5 | Trigger recompute (`POST /signals/reachability/recompute`) when new runtime batches arrive for an active scan. | Signals Guild | Provide `callgraphId` + subject tuple. |
|
||||
| 6 | Validate Policy/UI surfaces by requesting `/policy/findings?includeReachability=true` and checking `reachability.evidence`. | Policy + UI Guilds | Ensure evidence references the CAS URIs from Step 2. |
|
||||
## Determinism checklist
|
||||
- Stable ordering of NDJSON lines.
|
||||
- No host-dependent paths; only IDs/digests.
|
||||
- Fixed gzip level if used (suggest 6) to aid reproducibility.
|
||||
|
||||
---
|
||||
## Zastava Observer setup (runtime sampler)
|
||||
- **Sampling mode:** deterministic EntryTrace sampler; default 1:1 (no drop) for pilot. Enable rate/CPU guard: `Sampler:MaxEventsPerSecond` (default 500), `Sampler:MaxCpuPercent` (default 35). When rates are exceeded, emit `sampler.dropped` counters with drop reason `rate_limit`/`cpu_guard`.
|
||||
- **Symbol capture:** enable build-id collection (`SymbolCapture:CollectBuildIds=true`) and loader base addresses (`SymbolCapture:EmitLoaderBase=true`) to match static graphs.
|
||||
- **Batching:** buffer up to 1,000 events or 2s, whichever comes first (`Ingest:BatchSize`, `Ingest:FlushIntervalMs`). Batches are sorted by `observedAt` before send to keep deterministic order.
|
||||
- **Transport:** NDJSON POST to Signals `/signals/runtime-facts/ndjson` with headers `X-Callgraph-Id`, optional `X-Analysis-Id`. Set `Content-Encoding: gzip` when batches exceed 64 KiB.
|
||||
- **CAS traces (optional):** if EntryTrace raw traces are persisted, package as `cas://runtime_traces/<hh>/<sha>.tar.zst` with `meta.json` containing `analysisId`, `nodeCount`, `edgeCount`, `traceVersion`. Include the CAS URI in `metadata.casUri` on each NDJSON event.
|
||||
- **Security/offline:** disable egress by default; allowlist only the Signals host. TLS must be enabled; supply client certs per platform runbook if required. No PID/user names are emitted—only digests/IDs.
|
||||
|
||||
## 3 · Air-gapped workflow
|
||||
### Example appsettings (Observer)
|
||||
```json
|
||||
{
|
||||
"Sampler": {
|
||||
"MaxEventsPerSecond": 500,
|
||||
"MaxCpuPercent": 35
|
||||
},
|
||||
"SymbolCapture": {
|
||||
"CollectBuildIds": true,
|
||||
"EmitLoaderBase": true
|
||||
},
|
||||
"Ingest": {
|
||||
"BatchSize": 1000,
|
||||
"FlushIntervalMs": 2000,
|
||||
"Endpoint": "https://signals.local/signals/runtime-facts/ndjson",
|
||||
"Headers": {
|
||||
"X-Callgraph-Id": "cg-123"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
1. Export runtime NDJSON batches via Offline Kit: `offline/reachability/runtime/<scan-id>/<timestamp>.ndjson.gz` + manifest.
|
||||
2. On the secure network, load CAS entries locally (`stella cas load ...`) and invoke `stella signals runtime-facts ingest --from offline/...`.
|
||||
3. Re-run `stella replay manifest.json --section reachability` to ensure manifests cite the imported runtime digests.
|
||||
4. Sync ingestion receipts (`signals-runtime-ingest.log`) back to the air-gapped environment for audit.
|
||||
### Operational steps
|
||||
1) Enable EntryTrace sampler in Zastava Observer with the config above; verify `sampler.dropped` stays at 0 during pilot.
|
||||
2) Run a 5-minute capture and send NDJSON to a staging Signals instance using the smoke test; confirm 202 and CAS pointers recorded.
|
||||
3) Correlate runtime facts to static graphs by callgraphId in Signals; ensure counts match sampler totals.
|
||||
4) Promote config to prod/offline bundle; freeze config hashes for replay.
|
||||
|
||||
---
|
||||
|
||||
## 4 · Troubleshooting
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| `422 Unprocessable Entity: missing symbolId` | Probe emitted incomplete JSON. | Restart probe with `--include-symbols`, confirm symbol server availability, regenerate batch. |
|
||||
| `403 Forbidden: sealed-mode evidence invalid` | Signals sealed-mode verifier rejected payload (likely missing CAS proof). | Upload batch to CAS first, include `X-Reachability-Cas-Uri` header, or disable sealed-mode in non-prod. |
|
||||
| Runtime facts missing from Policy/UI | Recompute not triggered or `callgraphId` mismatch. | List facts via `/signals/reachability/facts?subject=...`, confirm `callgraphId`, then POST recompute. |
|
||||
| CAS hash mismatch during replay | Batch mutated post-ingestion. | Re-stage from original gzip, invalidate old CAS entry, rerun ingestion to regenerate manifest references. |
|
||||
|
||||
---
|
||||
|
||||
## 5 · Retention & observability
|
||||
|
||||
- Default retention: 30 days hot in Signals Mongo, 180 days in CAS (match replay policy). Configure via `signals.runtimeFacts.retentionDays`.
|
||||
- Metrics to alert on:
|
||||
- `signals_runtime_ingest_latency_seconds` (P95 < 2 s).
|
||||
- `signals_runtime_cas_miss_total` (should be 0 once CAS is mandatory).
|
||||
- Logs/traces:
|
||||
- Category `Reachability.Runtime` records ingestion batches and CAS URIs.
|
||||
- Trace attributes: `callgraphId`, `subjectKey`, `casUri`, `eventCount`.
|
||||
|
||||
---
|
||||
|
||||
## 6 · References
|
||||
|
||||
- `docs/reachability/DELIVERY_GUIDE.md`
|
||||
- `docs/reachability/function-level-evidence.md`
|
||||
- `docs/replay/DETERMINISTIC_REPLAY.md`
|
||||
- `docs/modules/platform/architecture-overview.md` §5 (Replay CAS)
|
||||
- `docs/runbooks/replay_ops.md`
|
||||
|
||||
Update this runbook whenever endpoints, retention knobs, or CAS layouts change.
|
||||
## Smoke test
|
||||
```bash
|
||||
cat events.ndjson | gzip -c | \
|
||||
curl -X POST "https://signals.local/signals/runtime-facts/ndjson?callgraphId=cg-123&component=web&version=1.0.0" \
|
||||
-H "Content-Type: application/x-ndjson" \
|
||||
-H "Content-Encoding: gzip" \
|
||||
--data-binary @-
|
||||
```
|
||||
Expect 202 Accepted with SubjectKey in response; Signals will recompute reachability and emit `signals.fact.updated@v1`.
|
||||
|
||||
Reference in New Issue
Block a user