Files
git.stella-ops.org/docs/reachability/evidence-schema.md
StellaOps Bot efaf3cb789
Some checks failed
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
up
2025-12-12 09:35:37 +02:00

5.2 KiB

Reachability Evidence Schema (Draft v1, Nov 2026)

Purpose: define the canonical fields for reachability graph nodes/edges, runtime facts, and unknowns so Scanner, Signals, Policy, Replay, CLI/UI, and SbomService stay aligned. This replaces scattered notes in advisories.

1. Core identifiers

  • symbol_id: canonical ID for a function/symbol; includes {format, build_id?, file_hash?, section?, addr, length} plus optional code_block_hash. Always deterministic and lowercase.
  • code_id: {format, build_id?, file_hash?, start, length, code_block_hash?}; used when symbol names are absent.
  • symbol_digest: sha256 of normalized signature (demangled name + params + return type; strip addresses). For stripped code, combine synthetic name + block hash.
  • purl: package URL of the owning component (from SBOM resolver); pkg:unknown when unresolved.

2. Graph payload (richgraph-v1 additions)

{
  "nodes": [
    {
      "id": "sym:sha256:...",
      "symbol_id": "func:ELF:sha256:...",
      "code_id": "code:ELF:sha256:...",
      "code_block_hash": "sha256:deadbeef...",
      "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",
      "symbol": { "mangled": "_Z15ssl3_read_bytes", "demangled": "ssl3_read_bytes", "source": "DWARF", "confidence": 0.98 },
      "build_id": "a1b2c3...",
      "lang": "c",
      "evidence": ["dwarf", "dynsym"],
      "analyzer": { "name": "scanner.native", "version": "1.2.0", "toolchain": "ghidra-11" }
    }
  ],
  "edges": [
    {
      "from": "sym:sha256:caller",
      "to": "sym:sha256:callee",
      "kind": "direct|plt|indirect|runtime",
      "purl": "pkg:deb/ubuntu/openssl@3.0.2?arch=amd64",          // callee owner
      "symbol_digest": "sha256:...",                              // callee digest
      "candidates": ["pkg:deb/openssl@3.0.2", "pkg:deb/openssl@3.0.1"],
      "confidence": 0.92,
      "evidence": ["import", "reloc@GOT"]
    }
  ],
  "roots": [
    { "id": "init_array@0x401000", "phase": "load", "source": "DT_INIT_ARRAY" },
    { "id": "main", "phase": "runtime" }
  ],
  "graph_hash": "blake3:..."
}

2.5 Attestation levels (hybrid default)

  • Graph DSSE (required): one DSSE envelope over the canonical graph JSON (sorted arrays/keys) with graph_hash = BLAKE3 of body; Rekor publish always (or mirror when offline).
  • Edge-bundle DSSE (optional): batches of ≤512 edges, emitted only for high-signal cases (runtime, init_array/TLS roots, contested/third-party edges). Each bundle carries graph_hash, bundle_reason, per-edge reason, symbol_digest, purl, confidence, and optional revoked=true for quarantine. Rekor publish is configurable; CAS storage is mandatory.
  • CAS layout additions:
    • Graph body: cas://reachability/graphs/{blake3}
    • Graph DSSE: cas://reachability/graphs/{blake3}.dsse
    • Edge bundle: cas://reachability/edges/{graph_hash}/{bundle_id} + .dsse
  • Determinism: bundle ordering by (bundle_reason, edge_id); arrays sorted before hashing.

3. Runtime facts (Signals ingestion)

Fields per NDJSON event:

  • symbolId (required), codeId, symbolDigest?, purl?
  • hitCount, observedAt, loaderBase, processId, processName, containerId, socketAddress?
  • callgraphId or scanId, plus evidenceUri (CAS) if trace stored externally
  • Determinism: sort keys when persisting; timestamps UTC ISO-8601.

4. Unknowns registry payload

See docs/signals/unknowns-registry.md; reachability producers emit Unknowns when:

  • symbol→purl unresolved,
  • call edge target unresolved,
  • build-id missing for ELF and file hash used instead.

Unknowns must include unknown_type, scope, provenance, confidence.p, and labels.

5. CAS layout

  • Graphs: cas://reachability/graphs/{blake3} (canonical JSON, sorted keys/arrays)
  • Runtime traces: cas://reachability/runtime/{sha256}
  • Unknowns evidence (optional large blobs): cas://unknowns/{sha256}
  • Edge bundles: cas://reachability/edges/{graph_hash}/{bundle_id} (JSON + .dsse)

Metadata for each CAS object: { schema: "richgraph-v1", analyzer: {name,version}, createdAtUtc, toolchain_digest }. When analyzer metadata is supplied at ingest (Signals OpenAPI), persist it alongside parsed analyzer fields from the artifact.

6. Validation rules

  • All edges must carry either purl or candidates[]; never leave both empty.
  • If build_id present, symbol_id and code_id must store it; if absent, record build_id_source: "FileHash".
  • Evidence arrays sorted; confidence in [0,1].
  • code_block_hash (when present) must be lowercase hex with an algorithm prefix (e.g., sha256:) and only accompany stripped/heuristic nodes.
  • Roots must include load-time constructors when present.
  • When edge_bundles are present, each edge in a bundle must also exist in the graph edge set; revoked=true bundles override graph edges for policy/scoring.
  • Graph DSSE is mandatory per scan; edge-bundle DSSEs are optional but must reference graph_hash and bundle_id.

7. Acceptance checklist

  • Schema reflected in Scanner/Signals DTOs and OpenAPI responses.
  • CAS writers enforce canonicalization before hashing.
  • Fixtures include: build-id present/absent, init-array roots, purl-resolved imports-only edge, stripped binary with block-hash symbol digest, and an Unknowns case.