up
This commit is contained in:
34
docs/reachability/callgraph-formats.md
Normal file
34
docs/reachability/callgraph-formats.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Reachability Callgraph Formats (richgraph-v1)
|
||||
|
||||
## Purpose
|
||||
Normalize static callgraphs across languages so Signals can merge them with runtime traces and replay bundles deterministically.
|
||||
|
||||
## Core fields (per node/edge)
|
||||
- `nodes[].id` — canonical SymbolID (language-specific, stable, lowercase where applicable).
|
||||
- `nodes[].kind` — e.g., method/function/class/file.
|
||||
- `edges[].sourceId` / `edges[].targetId` — SymbolIDs; edge types include `call`, `import`, `inherit`, `reference`.
|
||||
- `artifact` — CAS paths for source graph files; include `sha256`, `uri`, optional `generator` (analyzer name/version).
|
||||
|
||||
## Language-specific notes
|
||||
- **JVM**: use JVM internal names; include signature for overloads.
|
||||
- **.NET/Roslyn**: fully-qualified method token; include assembly and module for cross-assembly edges.
|
||||
- **Go SSA**: package path + function; include receiver for methods.
|
||||
- **Node/Deno TS**: module path + exported symbol; ES module graph only.
|
||||
- **Rust MIR**: crate::module::symbol; monomorphized forms allowed if stable.
|
||||
- **Swift SIL**: mangled name; demangled kept in metadata only.
|
||||
- **Shell/binaries**: when present, use ELF/PE symbol+offset; mark `kind=binary`.
|
||||
|
||||
## CAS layout
|
||||
- Store graph bundles under `reachability_graphs/<hh>/<sha>.tar.zst`.
|
||||
- Bundle SHOULD contain `meta.json` with analyzer, version, language, component, and entry points (array).
|
||||
- File order inside tar must be lexicographic to keep hashes stable.
|
||||
|
||||
## Validation rules
|
||||
- No duplicate node IDs; edges must reference existing nodes.
|
||||
- Entry points list must be present (even if empty) for Signals recompute.
|
||||
- Graph SHA256 must match tar content; Signals rejects mismatched SHA.
|
||||
- Only ASCII; UTF-8 paths are allowed but must be normalized (NFC).
|
||||
|
||||
## References
|
||||
- Union schema: `docs/reachability/runtime-static-union-schema.md`
|
||||
- Delivery guide: `docs/reachability/DELIVERY_GUIDE.md`
|
||||
48
docs/reachability/reachability.md
Normal file
48
docs/reachability/reachability.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Reachability · Runtime + Static Union (v0.1)
|
||||
|
||||
## What this covers
|
||||
- End-to-end flow for combining static callgraphs (Scanner) and runtime traces (Zastava) into replayable reachability bundles.
|
||||
- Storage layout (CAS namespaces), manifest fields, and Signals APIs that consume/emit reachability facts.
|
||||
- How unknowns/pressure and scoring are derived so Policy/UI can explain outcomes.
|
||||
|
||||
## Pipeline (at a glance)
|
||||
1. **Scanner** emits language-specific callgraphs as `richgraph-v1` and packs them into CAS under `reachability_graphs/<digest>.tar.zst` with manifest `meta.json`.
|
||||
2. **Zastava Observer** streams NDJSON runtime facts (`symbol_id`, `code_id`, `hit_count`, `loader_base`, `cas_uri`) to Signals `POST /signals/runtime-facts` or `/runtime-facts/ndjson`.
|
||||
3. **Union bundles** (runtime + static) are uploaded as ZIP to `POST /signals/reachability/union` with optional `X-Analysis-Id`; Signals stores under `reachability_graphs/{analysisId}/`.
|
||||
4. **Signals scoring** consumes union data + runtime facts, computes per-target states (bucket, weight, confidence, score), fact-level score, unknowns pressure, and publishes `signals.fact.updated@v1` events.
|
||||
5. **Replay** records provenance: reachability section in replay manifest lists CAS URIs (graphs + runtime traces), namespaces, analyzer/version, callgraphIds, and the shared `analysisId`.
|
||||
|
||||
## Storage & CAS namespaces
|
||||
- Static graphs: `cas://reachability_graphs/<hh>/<sha>.tar.zst` (meta.json + graph files).
|
||||
- Runtime traces: `cas://runtime_traces/<hh>/<sha>.tar.zst` (NDJSON or zipped stream).
|
||||
- Replay manifest now includes `analysisId` to correlate graphs/traces; each reference also carries `namespace` and `callgraphId` (static) for unambiguous replay.
|
||||
|
||||
## Signals API quick reference
|
||||
- `POST /signals/runtime-facts` — structured request body; recomputes reachability.
|
||||
- `POST /signals/runtime-facts/ndjson` — streaming NDJSON/gzip; requires `callgraphId` header params.
|
||||
- `POST /signals/reachability/union` — upload ZIP bundle; optional `X-Analysis-Id`.
|
||||
- `GET /signals/reachability/union/{analysisId}/meta` — returns meta.json.
|
||||
- `GET /signals/reachability/union/{analysisId}/files/{fileName}` — download bundled graph/trace files.
|
||||
- `GET /signals/facts/{subjectKey}` — fetch latest reachability fact (includes unknowns counters and targets).
|
||||
|
||||
## Scoring and unknowns
|
||||
- Buckets (default weights): entrypoint 1.0, direct 0.85, runtime 0.45, unknown 0.5, unreachable 0.0.
|
||||
- Confidence: reachable vs unreachable base, runtime bonus, clamped between Min/Max (defaults 0.05–0.99).
|
||||
- Unknowns: Signals counts unresolved symbols/edges per subject; `UnknownsPressure = unknowns / (states + unknowns)` (capped). Fact score is reduced by `UnknownsPenaltyCeiling` (default 0.35) × pressure.
|
||||
- Events: `signals.fact.updated@v1` now emits `unknownsCount` and `unknownsPressure` plus bucket/weight/stateCount/targets.
|
||||
|
||||
## Replay contract changes (v0.1 add-ons)
|
||||
- `reachability.analysisId` (string, optional) — ties to Signals union ingest.
|
||||
- Graph refs include `namespace`, `callgraphId`, analyzer, version, sha256, casUri.
|
||||
- Runtime trace refs include `namespace`, recordedAt, sha256, casUri.
|
||||
|
||||
## Operator checklist
|
||||
- Use deterministic CAS paths; never embed absolute file paths.
|
||||
- When emitting runtime NDJSON, include `loader_base` and `code_id` when available for de-dup.
|
||||
- Ensure `analysisId` is propagated from Scanner/Zastava into Signals ingest to keep replay manifests linked.
|
||||
- Keep feeds frozen for reproducibility; avoid external downloads in union preparation.
|
||||
|
||||
## References
|
||||
- Schema: `docs/reachability/runtime-static-union-schema.md`
|
||||
- Delivery guide: `docs/reachability/DELIVERY_GUIDE.md`
|
||||
- Unknowns registry & scoring: Signals code (`ReachabilityScoringService`, `UnknownsIngestionService`) and events doc `docs/signals/events-24-005.md`.
|
||||
38
docs/reachability/runtime-facts.md
Normal file
38
docs/reachability/runtime-facts.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Runtime Facts (Signals/Zastava) v0.1
|
||||
|
||||
## Payload shapes
|
||||
- **Structured** (`POST /signals/runtime-facts`):
|
||||
- `subject` (imageDigest | scanId | component+version)
|
||||
- `callgraphId` (required)
|
||||
- `events[]`: `{ symbolId, codeId?, purl?, buildId?, loaderBase?, processId?, processName?, socketAddress?, containerId?, evidenceUri?, hitCount, observedAt?, metadata{} }`
|
||||
- **Streaming NDJSON** (`POST /signals/runtime-facts/ndjson`): one JSON object per line with the same fields; supports `Content-Encoding: gzip`; callgraphId provided via query/header metadata.
|
||||
|
||||
## Provenance/metadata
|
||||
- Signals stamps:
|
||||
- `provenance.source` (defaults to `runtime` unless provided in metadata)
|
||||
- `provenance.ingestedAt` (ISO-8601 UTC)
|
||||
- `provenance.callgraphId`
|
||||
- Runtime hits are aggregated per `symbolId` (summing hitCount) before persisting and feeding scoring.
|
||||
|
||||
## Validation
|
||||
- `symbolId` required; events list must not be empty.
|
||||
- `callgraphId` required and must resolve to a stored callgraph/union bundle.
|
||||
- Subject must yield a non-empty `subjectKey`.
|
||||
- Empty runtime stream is rejected.
|
||||
|
||||
## Storage and cache
|
||||
- Stored alongside reachability facts in Mongo collection `reachability_facts`.
|
||||
- Runtime hits cached in Redis via `reachability_cache:*` entries; invalidated on ingest.
|
||||
|
||||
## Interaction with scoring
|
||||
- Ingest triggers recompute: runtime hits added to prior facts’ hits, targets set to symbols observed, entryPoints taken from callgraph.
|
||||
- Reachability states include runtime evidence on the path; bucket/weight may be `runtime` when hits are present.
|
||||
- Unknowns registry stays separate; unknowns count still factors into fact score via pressure penalty.
|
||||
|
||||
## Replay alignment
|
||||
- Runtime traces packaged under CAS namespace `runtime_traces`; referenced in replay manifest with `namespace` and `analysisId` to link to static graphs.
|
||||
|
||||
## Determinism rules
|
||||
- Keep NDJSON ordering stable when generating bundles.
|
||||
- Use UTC timestamps; avoid environment-dependent metadata values.
|
||||
- No external network lookups during ingest.
|
||||
Reference in New Issue
Block a user