Files
git.stella-ops.org/docs/reachability/runtime-static-union-schema.md
2025-11-24 00:34:20 +02:00

130 lines
5.1 KiB
Markdown

# Runtime + Static Reachability Union Schema (v0.1, 2025-11-23)
## Goals
- Provide a single, deterministic graph shape that merges static lifter output and runtime traces across languages.
- Keep SymbolID stable across hosts (path/location independent) so CAS lookups are reproducible and cacheable.
- Make outputs offline-friendly: line-delimited JSON, UTF-8, sorted, with explicit content hashes.
## File layout (CAS)
- Namespace root: `reachability_graphs/<analysis_id>/` (analysis_id is caller-supplied UUID or hash).
- Files (all NDJSON, UTF-8, newline terminated, sorted as noted):
- `nodes.ndjson` (sorted by `symbol_id`)
- `edges.ndjson` (sorted by `from` then `to` then `edge_type`)
- `facts_runtime.ndjson` (sorted by `symbol_id`, optional)
- `meta.json` (single JSON object; schema version, produced_by, timestamps, tool versions, hashes)
- Hashing: SHA-256 of each file recorded in `meta.json` under `files[]` with `path`, `sha256`, `records`.
- Compression/packaging is left to the CAS store; files must be valid uncompressed NDJSON first.
## SymbolID (language-agnostic envelope)
```
symbol_id = "sym:" + <lang> + ":" + <stable-fragment>
```
- `lang`: `java|dotnet|go|node|deno|rust|swift|shell|binary`
- `stable-fragment`: SHA-256(base64url-no-pad) of the canonical tuple per language:
- **java**: (`package`, `class`, `method`, `descriptor`) lowercased, descriptor in JVM format.
- **dotnet**: (`assembly_name`, `namespace`, `type`, `member_signature`) using ECMA-335 signature string.
- **node/deno**: (`pkg_name_or_path`, `export_path`, `kind`) where `export_path` is slash-joined ESM/CJS path; `pkg_name_or_path` uses npm name or normalized absolute path with drive stripped.
- **go**: (`module_path`, `package_path`, `receiver`, `func`), with receiver empty for functions.
- **rust**: (`crate`, `module_path`, `item_name`, `mangled`)
- **swift**: (`module`, `type`, `member`, `swift-mangled`)
- **shell**: (`script_relpath`, `function_or_cmd`)
- **binary**: (`binary_build_id`, `section`, `symbol_name`)
## nodes.ndjson
Each line:
```
{
"symbol_id": "sym:lang:...",
"lang": "dotnet",
"kind": "function|method|type|module|package|binary",
"display": "Human readable name",
"source": {
"file": "relative/or/pkg/path",
"line": 123,
"col": 1,
"digest": "sha256:<hex>"
},
"attributes": {
"visibility": "public|internal|private",
"async": true,
"static": false,
"generic_arity": 2
}
}
```
Fields are optional when not applicable; omit rather than null. Additional language-specific fields allowed inside `attributes` (e.g., `jvm_descriptor`, `dotnet_signature`).
## edges.ndjson
Each line (static or runtime-derived; see `source`):
```
{
"from": "sym:...",
"to": "sym:...",
"edge_type": "call|import|inherits|loads|dynamic|reflects|dlopen|ffi|wasm|spawn",
"confidence": "certain|high|medium|low",
"source": {
"origin": "static|runtime",
"provenance": "jvm-bytecode|il|ts-ast|ssa|ebpf|etw|jfr|hook",
"evidence": "file:path:line"
}
}
```
- Ordering: primary `from`, secondary `to`, tertiary `edge_type`.
- Duplicate edges with different provenance are allowed; consumers deduplicate by (`from`,`to`,`edge_type`,`provenance`).
## facts_runtime.ndjson (optional)
Runtime-only observations attached to symbols:
```
{
"symbol_id": "sym:...",
"samples": {
"call_count": 14,
"first_seen_utc": "2025-11-22T18:21:12Z",
"last_seen_utc": "2025-11-22T18:23:01Z"
},
"env": {
"pid": 1234,
"image": "sha256:...",
"entrypoint": "main",
"tags": ["sealed","offline"]
}
}
```
Sorting by `symbol_id`. Time fields must be UTC ISO-8601 with `Z`.
## meta.json
```
{
"schema": "reachability-union@0.1",
"generated_at": "2025-11-23T00:00:00Z",
"produced_by": {
"tool": "StellaOps.Scanner.Worker",
"version": "0.1.0",
"analyzers": ["dotnet-11.1.0","jvm-8.0.0","node-6.2.0"]
},
"files": [
{"path":"nodes.ndjson","sha256":"...","records":1234},
{"path":"edges.ndjson","sha256":"...","records":4567},
{"path":"facts_runtime.ndjson","sha256":"...","records":89}
],
"options": {
"dedupe_edges": false,
"include_runtime": true
}
}
```
## Determinism rules
- Sort order as noted; no nulls; omit empty objects/arrays.
- All strings UTF-8 NFC; booleans lower-case; edge_type enumerated list above.
- Hash inputs use exact serialized bytes (no trailing spaces, newline `\n` only).
## Validation
- JSON Schema draft 2020-12 available at `docs/reachability/runtime-static-union-schema.json` (to be generated from this spec; allowable values match enumerations above).
- Minimal required fields: `symbol_id`, `lang`, `kind` (nodes); `from`, `to`, `edge_type`, `source.origin` (edges).
## Integration guidance
- Static lifters must emit SymbolIDs using the language rules; runtime probes must map call targets to the same SymbolID space (via demangled names + package/module resolution).
- CAS writers store each file under the namespace path and return the root manifest path for downstream consumers (Signals, Replay, Policy).
- Consumers should treat runtime edges as additive; when both origins exist, prefer `origin=runtime` for exploitability scoring but keep static edges for coverage.