Files
git.stella-ops.org/docs/modules/graph/schema.md
master 2eb6852d34
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Add unit tests for SBOM ingestion and transformation
- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly.
- Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps.
- Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges.
- Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges.
- Set up project file for the test project with necessary dependencies and configurations.
- Include JSON fixture files for testing purposes.
2025-11-04 07:49:39 +02:00

8.5 KiB

Graph Index Canonical Schema

Ownership: Graph Indexer Guild • Version 2025-11-03 (Sprint 140)
Scope: Canonical node and edge schemas, attribute dictionary, identity rules, and fixture references for the Graph Indexer foundations (GRAPH-INDEX-28-001).

1. Purpose

  • Provide a deterministic schema contract for graph indexing pipelines.
  • Document the attribute dictionary consumed by SBOM, Advisory, VEX, Policy, and Runtime signal feeds.
  • Define the identity rules that guarantee stable node and edge identifiers across rebuilds.
  • Point implementers and QA to the seed fixtures used in unit/integration tests.

2. Node taxonomy

Node kind Identity tuple (ordered) Required attributes Primary sources
artifact tenant, artifact_digest, sbom_digest display_name, artifact_digest, sbom_digest, environment, labels[], origin_registry, supply_chain_stage Scanner WebService, SBOM Service
component tenant, purl, source_type purl, version, ecosystem, scope, license_spdx, usage SBOM Service analyzers
file tenant, artifact_digest, normalized_path, content_sha256 normalized_path, content_sha256, language_hint, size_bytes, scope SBOM layer analyzers
license tenant, license_spdx, source_digest license_spdx, name, classification, notice_uri SBOM Service, Concelier
advisory tenant, advisory_source, advisory_id, content_hash advisory_source, advisory_id, severity, published_at, content_hash, linkset_digest Concelier
vex_statement tenant, vex_source, statement_id, content_hash status, statement_id, justification, issued_at, expires_at, content_hash Excititor
policy_version tenant, policy_pack_digest, effective_from policy_pack_digest, policy_name, effective_from, expires_at, explain_hash Policy Engine
runtime_context tenant, runtime_fingerprint, collector, observed_at runtime_fingerprint, collector, observed_at, cluster, namespace, workload_kind, runtime_state Signals, Zastava

3. Edge taxonomy

Edge kind Source → Target Identity tuple (ordered) Required attributes Default validity
CONTAINS artifactcomponent tenant, artifact_node_id, component_node_id, sbom_digest detected_by, layer_digest, scope, evidence_digest valid_from = sbom_collected_at, valid_to = null
DEPENDS_ON componentcomponent tenant, component_node_id, dependency_purl, sbom_digest dependency_purl, dependency_version, relationship, evidence_digest Derived from SBOM dependency graph
DECLARED_IN componentfile tenant, component_node_id, file_node_id, sbom_digest detected_by, scope, evidence_digest Mirrors SBOM declaration
BUILT_FROM artifactartifact tenant, parent_artifact_node_id, child_artifact_digest build_type, builder_id, attestation_digest Derived from provenance attestations
AFFECTED_BY componentadvisory tenant, component_node_id, advisory_node_id, linkset_digest evidence_digest, matched_versions, cvss, confidence Concelier overlays
VEX_EXEMPTS componentvex_statement tenant, component_node_id, vex_node_id, statement_hash status, justification, impact_statement, evidence_digest Excititor overlays
GOVERNS_WITH policy_versioncomponent tenant, policy_node_id, component_node_id, finding_explain_hash verdict, explain_hash, policy_rule_id, evaluation_timestamp Policy Engine overlays
OBSERVED_RUNTIME runtime_contextcomponent tenant, runtime_node_id, component_node_id, runtime_fingerprint process_name, entrypoint_kind, runtime_evidence_digest, confidence Signals/Zastava ingestion

4. Attribute dictionary

Attribute Type Applies to Description
tenant string nodes, edges Tenant identifier (enforced on storage and query).
kind string nodes, edges One of the values listed in the taxonomy tables.
canonical_key object nodes Ordered tuple persisted as a JSON object matching the identity tuple components.
id string nodes, edges Deterministic identifier (gn: or ge: prefix + Base32-encoded SHA-256).
hash string nodes, edges SHA-256 of the canonical JSON representation (normalized by sorted keys).
attributes object nodes, edges Domain-specific attributes (all dictionary keys kebab-case).
provenance object nodes, edges Includes source, collected_at, sbom_digest, attestation_digest, event_offset.
valid_from string (ISO-8601) nodes, edges Inclusive timestamp describing when the record became effective.
valid_to string (ISO-8601 or null) nodes, edges Exclusive timestamp; null means open-ended.
scope string nodes, edges Scope label (e.g., runtime, build, dev-dependency).
labels array[string] nodes Free-form but deterministic ordering (ASCII sort).
confidence number edges 0-1 numeric confidence score for overlay-derived edges.
evidence_digest string edges SHA-256 digest referencing the immutable evidence payload.
linkset_digest string nodes, edges SHA-256 digest to Concelier linkset documents.
explain_hash string nodes, edges Hash of Policy Engine explain trace payload.
runtime_state string runtime_context nodes Aggregated runtime state (e.g., Running, Terminated).

5. Identity rules

  1. Node IDs (gn: prefix).
    id = "gn:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))
    identity_tuple concatenates tuple components with | (no escaping) and lower-cases both keys and values unless the component is a hash or digest.
  2. Edge IDs (ge: prefix).
    id = "ge:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))
    Edge tuples must include the resolved node IDs rather than only the canonical keys to ensure immutability under re-key events.
  3. Hashes.
    hash is computed by serializing the canonical document with:
    • UTF-8 JSON
    • Object keys sorted lexicographically
    • Arrays sorted where semantics allow (e.g., labels, matched_versions)
    • Timestamps normalized to UTC ISO-8601 (YYYY-MM-DDTHH:MM:SSZ)
  4. Deterministic provenance.
    provenance.source is a dotted string (scanner.sbom.v1, concelier.linkset.v1) and provenance.event_offset is a monotonic integer for replay.

6. Validity window semantics

  • valid_from equals the upstream event timestamp at ingestion time (SBOM collected timestamp, advisory published timestamp, policy evaluation timestamp, runtime observation timestamp).
  • valid_to stays null until a newer version supersedes the record. Superseding records carry a supersedes reference in attributes.
  • Snapshots freeze the set of nodes/edges with valid_from <= snapshot_at < coalesce(valid_to, +∞).

7. Fixtures & verification

  • Seed fixtures live under tests/Graph/StellaOps.Graph.Indexer.Tests/Fixtures/v1/.
  • Fixture files:
    • nodes.json — canonical node samples (per node kind).
    • edges.json — canonical edge samples including overlay references.
    • schema-matrix.json — lists attribute coverage per node/edge kind for regression tests.
  • Unit tests assert:
    • Identifier determinism (GraphIdentityTests.NodeIds_are_stable).
    • Hash determinism under property ordering variations.
    • Attribute coverage against schema-matrix.json.
  • Fixtures follow the attribute dictionary above; new attributes require dictionary updates and fixture refresh.

8. Change control

  • Increment schema version in fixture folder (v1, v2, …) when making breaking changes.
  • Update this document and the JSON fixtures together; do not ship mismatched versions.
  • Notify SBOM Service, Concelier, Excititor, Policy, Signals, and Zastava owners before promoting changes to DOING/DONE state.

9. References

  • docs/modules/graph/architecture.md — high-level architecture.
  • docs/modules/platform/architecture-overview.md — platform context.
  • src/Graph/StellaOps.Graph.Indexer/TASKS.md — task tracking.
  • seed-data/ — additional sample payloads for offline kit packaging (future work).