Add unit tests for SBOM ingestion and transformation
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly.
- Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps.
- Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges.
- Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges.
- Set up project file for the test project with necessary dependencies and configurations.
- Include JSON fixture files for testing purposes.
This commit is contained in:
master
2025-11-04 07:49:39 +02:00
parent f72c5c513a
commit 2eb6852d34
491 changed files with 39445 additions and 3917 deletions

View File

@@ -0,0 +1,98 @@
# Graph Index Canonical Schema
> Ownership: Graph Indexer Guild • Version 2025-11-03 (Sprint 140)\
> Scope: Canonical node and edge schemas, attribute dictionary, identity rules, and fixture references for the Graph Indexer foundations (GRAPH-INDEX-28-001).
## 1. Purpose
- Provide a deterministic schema contract for graph indexing pipelines.
- Document the attribute dictionary consumed by SBOM, Advisory, VEX, Policy, and Runtime signal feeds.
- Define the identity rules that guarantee stable node and edge identifiers across rebuilds.
- Point implementers and QA to the seed fixtures used in unit/integration tests.
## 2. Node taxonomy
| Node kind | Identity tuple (ordered) | Required attributes | Primary sources |
|-----------|--------------------------|---------------------|-----------------|
| `artifact` | `tenant`, `artifact_digest`, `sbom_digest` | `display_name`, `artifact_digest`, `sbom_digest`, `environment`, `labels[]`, `origin_registry`, `supply_chain_stage` | Scanner WebService, SBOM Service |
| `component` | `tenant`, `purl`, `source_type` | `purl`, `version`, `ecosystem`, `scope`, `license_spdx`, `usage` | SBOM Service analyzers |
| `file` | `tenant`, `artifact_digest`, `normalized_path`, `content_sha256` | `normalized_path`, `content_sha256`, `language_hint`, `size_bytes`, `scope` | SBOM layer analyzers |
| `license` | `tenant`, `license_spdx`, `source_digest` | `license_spdx`, `name`, `classification`, `notice_uri` | SBOM Service, Concelier |
| `advisory` | `tenant`, `advisory_source`, `advisory_id`, `content_hash` | `advisory_source`, `advisory_id`, `severity`, `published_at`, `content_hash`, `linkset_digest` | Concelier |
| `vex_statement` | `tenant`, `vex_source`, `statement_id`, `content_hash` | `status`, `statement_id`, `justification`, `issued_at`, `expires_at`, `content_hash` | Excititor |
| `policy_version` | `tenant`, `policy_pack_digest`, `effective_from` | `policy_pack_digest`, `policy_name`, `effective_from`, `expires_at`, `explain_hash` | Policy Engine |
| `runtime_context` | `tenant`, `runtime_fingerprint`, `collector`, `observed_at` | `runtime_fingerprint`, `collector`, `observed_at`, `cluster`, `namespace`, `workload_kind`, `runtime_state` | Signals, Zastava |
## 3. Edge taxonomy
| Edge kind | Source → Target | Identity tuple (ordered) | Required attributes | Default validity |
|-----------|-----------------|--------------------------|---------------------|------------------|
| `CONTAINS` | `artifact``component` | `tenant`, `artifact_node_id`, `component_node_id`, `sbom_digest` | `detected_by`, `layer_digest`, `scope`, `evidence_digest` | `valid_from = sbom_collected_at`, `valid_to = null` |
| `DEPENDS_ON` | `component``component` | `tenant`, `component_node_id`, `dependency_purl`, `sbom_digest` | `dependency_purl`, `dependency_version`, `relationship`, `evidence_digest` | Derived from SBOM dependency graph |
| `DECLARED_IN` | `component``file` | `tenant`, `component_node_id`, `file_node_id`, `sbom_digest` | `detected_by`, `scope`, `evidence_digest` | Mirrors SBOM declaration |
| `BUILT_FROM` | `artifact``artifact` | `tenant`, `parent_artifact_node_id`, `child_artifact_digest` | `build_type`, `builder_id`, `attestation_digest` | Derived from provenance attestations |
| `AFFECTED_BY` | `component``advisory` | `tenant`, `component_node_id`, `advisory_node_id`, `linkset_digest` | `evidence_digest`, `matched_versions`, `cvss`, `confidence` | Concelier overlays |
| `VEX_EXEMPTS` | `component``vex_statement` | `tenant`, `component_node_id`, `vex_node_id`, `statement_hash` | `status`, `justification`, `impact_statement`, `evidence_digest` | Excititor overlays |
| `GOVERNS_WITH` | `policy_version``component` | `tenant`, `policy_node_id`, `component_node_id`, `finding_explain_hash` | `verdict`, `explain_hash`, `policy_rule_id`, `evaluation_timestamp` | Policy Engine overlays |
| `OBSERVED_RUNTIME` | `runtime_context``component` | `tenant`, `runtime_node_id`, `component_node_id`, `runtime_fingerprint` | `process_name`, `entrypoint_kind`, `runtime_evidence_digest`, `confidence` | Signals/Zastava ingestion |
## 4. Attribute dictionary
| Attribute | Type | Applies to | Description |
|-----------|------|------------|-------------|
| `tenant` | `string` | nodes, edges | Tenant identifier (enforced on storage and query). |
| `kind` | `string` | nodes, edges | One of the values listed in the taxonomy tables. |
| `canonical_key` | `object` | nodes | Ordered tuple persisted as a JSON object matching the identity tuple components. |
| `id` | `string` | nodes, edges | Deterministic identifier (`gn:` or `ge:` prefix + Base32-encoded SHA-256). |
| `hash` | `string` | nodes, edges | SHA-256 of the canonical JSON representation (normalized by sorted keys). |
| `attributes` | `object` | nodes, edges | Domain-specific attributes (all dictionary keys kebab-case). |
| `provenance` | `object` | nodes, edges | Includes `source`, `collected_at`, `sbom_digest`, `attestation_digest`, `event_offset`. |
| `valid_from` | `string (ISO-8601)` | nodes, edges | Inclusive timestamp describing when the record became effective. |
| `valid_to` | `string (ISO-8601 or null)` | nodes, edges | Exclusive timestamp; `null` means open-ended. |
| `scope` | `string` | nodes, edges | Scope label (e.g., `runtime`, `build`, `dev-dependency`). |
| `labels` | `array[string]` | nodes | Free-form but deterministic ordering (ASCII sort). |
| `confidence` | `number` | edges | 0-1 numeric confidence score for overlay-derived edges. |
| `evidence_digest` | `string` | edges | SHA-256 digest referencing the immutable evidence payload. |
| `linkset_digest` | `string` | nodes, edges | SHA-256 digest to Concelier linkset documents. |
| `explain_hash` | `string` | nodes, edges | Hash of Policy Engine explain trace payload. |
| `runtime_state` | `string` | `runtime_context` nodes | Aggregated runtime state (e.g., `Running`, `Terminated`). |
## 5. Identity rules
1. **Node IDs (`gn:` prefix).**
`id = "gn:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
`identity_tuple` concatenates tuple components with `|` (no escaping) and lower-cases both keys and values unless the component is a hash or digest.
2. **Edge IDs (`ge:` prefix).**
`id = "ge:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
Edge tuples must include the resolved node IDs rather than only the canonical keys to ensure immutability under re-key events.
3. **Hashes.**
`hash` is computed by serializing the canonical document with:
- UTF-8 JSON
- Object keys sorted lexicographically
- Arrays sorted where semantics allow (e.g., `labels`, `matched_versions`)
- Timestamps normalized to UTC ISO-8601 (`YYYY-MM-DDTHH:MM:SSZ`)
4. **Deterministic provenance.**
`provenance.source` is a dotted string (`scanner.sbom.v1`, `concelier.linkset.v1`) and `provenance.event_offset` is a monotonic integer for replay.
## 6. Validity window semantics
- `valid_from` equals the upstream event timestamp at ingestion time (SBOM collected timestamp, advisory published timestamp, policy evaluation timestamp, runtime observation timestamp).
- `valid_to` stays `null` until a newer version supersedes the record. Superseding records carry a `supersedes` reference in `attributes`.
- Snapshots freeze the set of nodes/edges with `valid_from <= snapshot_at < coalesce(valid_to, +∞)`.
## 7. Fixtures & verification
- Seed fixtures live under `tests/Graph/StellaOps.Graph.Indexer.Tests/Fixtures/v1/`.
- Fixture files:
- `nodes.json` — canonical node samples (per node kind).
- `edges.json` — canonical edge samples including overlay references.
- `schema-matrix.json` — lists attribute coverage per node/edge kind for regression tests.
- Unit tests assert:
- Identifier determinism (`GraphIdentityTests.NodeIds_are_stable`).
- Hash determinism under property ordering variations.
- Attribute coverage against `schema-matrix.json`.
- Fixtures follow the attribute dictionary above; new attributes require dictionary updates and fixture refresh.
## 8. Change control
- Increment schema version in fixture folder (`v1`, `v2`, …) when making breaking changes.
- Update this document and the JSON fixtures together; do not ship mismatched versions.
- Notify SBOM Service, Concelier, Excititor, Policy, Signals, and Zastava owners before promoting changes to DOING/DONE state.
## 9. References
- `docs/modules/graph/architecture.md` — high-level architecture.
- `docs/modules/platform/architecture-overview.md` — platform context.
- `src/Graph/StellaOps.Graph.Indexer/TASKS.md` — task tracking.
- `seed-data/` — additional sample payloads for offline kit packaging (future work).