Add unit tests for SBOM ingestion and transformation
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly.
- Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps.
- Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges.
- Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges.
- Set up project file for the test project with necessary dependencies and configurations.
- Include JSON fixture files for testing purposes.
This commit is contained in:
master
2025-11-04 07:49:39 +02:00
parent f72c5c513a
commit 2eb6852d34
491 changed files with 39445 additions and 3917 deletions

View File

@@ -2,11 +2,12 @@
Graph module (upcoming) will power graph-indexed queries for SBOM relationships, lineage, and blast-radius analysis.
## Responsibilities
- Model SBOM and advisory entities as a navigable graph.
- Provide APIs for dependency impact, provenance chains, and reachability analysis.
- Integrate with Scheduler/Policy for graph-driven re-evaluation.
- Expose tooling for offline explorers.
## Responsibilities
- Model SBOM and advisory entities as a navigable graph.
- Provide APIs for dependency impact, provenance chains, and reachability analysis.
- Integrate with Scheduler/Policy for graph-driven re-evaluation.
- Expose tooling for offline explorers.
- Maintain [Graph Index Canonical Schema](schema.md) with deterministic identities, fixtures, and attribute dictionary.
### Domain highlights (Epic5)
- **Nodes:** artifacts/images, SBOM components, packages/versions, files/paths, licences, advisories, VEX statements, provenance attestations, policy versions.

View File

@@ -38,8 +38,9 @@
## 5) Offline & export
- Each snapshot packages `nodes.jsonl`, `edges.jsonl`, `overlays/` plus manifest with hash, counts, and provenance. Export Center consumes these artefacts for graph-specific bundles.
- Saved queries and overlays include deterministic IDs so Offline Kit consumers can import and replay results.
- Each snapshot packages `nodes.jsonl`, `edges.jsonl`, `overlays/` plus manifest with hash, counts, and provenance. Export Center consumes these artefacts for graph-specific bundles.
- Saved queries and overlays include deterministic IDs so Offline Kit consumers can import and replay results.
- Runtime hosts register the SBOM ingest pipeline via `services.AddSbomIngestPipeline(...)`. Snapshot exports default to `./artifacts/graph-snapshots` but can be redirected with `STELLAOPS_GRAPH_SNAPSHOT_DIR` or the `SbomIngestOptions.SnapshotRootDirectory` callback.
## 6) Observability
@@ -47,10 +48,14 @@
- Logs: structured events for ETL stages and query execution (with trace IDs).
- Traces: ETL pipeline spans, query engine spans.
## 7) Rollout notes
- Phase 1: ingest SBOM + advisories, deliver impact queries.
- Phase 2: add VEX overlays, policy overlays, diff tooling.
- Phase 3: expose runtime/Zastava edges and AI-assisted recommendations (future).
## 7) Rollout notes
- Phase 1: ingest SBOM + advisories, deliver impact queries.
- Phase 2: add VEX overlays, policy overlays, diff tooling.
- Phase 3: expose runtime/Zastava edges and AI-assisted recommendations (future).
### Local testing note
Set `STELLAOPS_TEST_MONGO_URI` to a reachable MongoDB instance before running `tests/Graph/StellaOps.Graph.Indexer.Tests`. The test harness falls back to `mongodb://127.0.0.1:27017`, then Mongo2Go, but the CI workflow requires the environment variable to be present to ensure upsert coverage runs against a managed database. Use `STELLAOPS_GRAPH_SNAPSHOT_DIR` (or the `AddSbomIngestPipeline` options callback) to control where graph snapshot artefacts land during local runs.
Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised.

View File

@@ -0,0 +1,98 @@
# Graph Index Canonical Schema
> Ownership: Graph Indexer Guild • Version 2025-11-03 (Sprint 140)\
> Scope: Canonical node and edge schemas, attribute dictionary, identity rules, and fixture references for the Graph Indexer foundations (GRAPH-INDEX-28-001).
## 1. Purpose
- Provide a deterministic schema contract for graph indexing pipelines.
- Document the attribute dictionary consumed by SBOM, Advisory, VEX, Policy, and Runtime signal feeds.
- Define the identity rules that guarantee stable node and edge identifiers across rebuilds.
- Point implementers and QA to the seed fixtures used in unit/integration tests.
## 2. Node taxonomy
| Node kind | Identity tuple (ordered) | Required attributes | Primary sources |
|-----------|--------------------------|---------------------|-----------------|
| `artifact` | `tenant`, `artifact_digest`, `sbom_digest` | `display_name`, `artifact_digest`, `sbom_digest`, `environment`, `labels[]`, `origin_registry`, `supply_chain_stage` | Scanner WebService, SBOM Service |
| `component` | `tenant`, `purl`, `source_type` | `purl`, `version`, `ecosystem`, `scope`, `license_spdx`, `usage` | SBOM Service analyzers |
| `file` | `tenant`, `artifact_digest`, `normalized_path`, `content_sha256` | `normalized_path`, `content_sha256`, `language_hint`, `size_bytes`, `scope` | SBOM layer analyzers |
| `license` | `tenant`, `license_spdx`, `source_digest` | `license_spdx`, `name`, `classification`, `notice_uri` | SBOM Service, Concelier |
| `advisory` | `tenant`, `advisory_source`, `advisory_id`, `content_hash` | `advisory_source`, `advisory_id`, `severity`, `published_at`, `content_hash`, `linkset_digest` | Concelier |
| `vex_statement` | `tenant`, `vex_source`, `statement_id`, `content_hash` | `status`, `statement_id`, `justification`, `issued_at`, `expires_at`, `content_hash` | Excititor |
| `policy_version` | `tenant`, `policy_pack_digest`, `effective_from` | `policy_pack_digest`, `policy_name`, `effective_from`, `expires_at`, `explain_hash` | Policy Engine |
| `runtime_context` | `tenant`, `runtime_fingerprint`, `collector`, `observed_at` | `runtime_fingerprint`, `collector`, `observed_at`, `cluster`, `namespace`, `workload_kind`, `runtime_state` | Signals, Zastava |
## 3. Edge taxonomy
| Edge kind | Source → Target | Identity tuple (ordered) | Required attributes | Default validity |
|-----------|-----------------|--------------------------|---------------------|------------------|
| `CONTAINS` | `artifact``component` | `tenant`, `artifact_node_id`, `component_node_id`, `sbom_digest` | `detected_by`, `layer_digest`, `scope`, `evidence_digest` | `valid_from = sbom_collected_at`, `valid_to = null` |
| `DEPENDS_ON` | `component``component` | `tenant`, `component_node_id`, `dependency_purl`, `sbom_digest` | `dependency_purl`, `dependency_version`, `relationship`, `evidence_digest` | Derived from SBOM dependency graph |
| `DECLARED_IN` | `component``file` | `tenant`, `component_node_id`, `file_node_id`, `sbom_digest` | `detected_by`, `scope`, `evidence_digest` | Mirrors SBOM declaration |
| `BUILT_FROM` | `artifact``artifact` | `tenant`, `parent_artifact_node_id`, `child_artifact_digest` | `build_type`, `builder_id`, `attestation_digest` | Derived from provenance attestations |
| `AFFECTED_BY` | `component``advisory` | `tenant`, `component_node_id`, `advisory_node_id`, `linkset_digest` | `evidence_digest`, `matched_versions`, `cvss`, `confidence` | Concelier overlays |
| `VEX_EXEMPTS` | `component``vex_statement` | `tenant`, `component_node_id`, `vex_node_id`, `statement_hash` | `status`, `justification`, `impact_statement`, `evidence_digest` | Excititor overlays |
| `GOVERNS_WITH` | `policy_version``component` | `tenant`, `policy_node_id`, `component_node_id`, `finding_explain_hash` | `verdict`, `explain_hash`, `policy_rule_id`, `evaluation_timestamp` | Policy Engine overlays |
| `OBSERVED_RUNTIME` | `runtime_context``component` | `tenant`, `runtime_node_id`, `component_node_id`, `runtime_fingerprint` | `process_name`, `entrypoint_kind`, `runtime_evidence_digest`, `confidence` | Signals/Zastava ingestion |
## 4. Attribute dictionary
| Attribute | Type | Applies to | Description |
|-----------|------|------------|-------------|
| `tenant` | `string` | nodes, edges | Tenant identifier (enforced on storage and query). |
| `kind` | `string` | nodes, edges | One of the values listed in the taxonomy tables. |
| `canonical_key` | `object` | nodes | Ordered tuple persisted as a JSON object matching the identity tuple components. |
| `id` | `string` | nodes, edges | Deterministic identifier (`gn:` or `ge:` prefix + Base32-encoded SHA-256). |
| `hash` | `string` | nodes, edges | SHA-256 of the canonical JSON representation (normalized by sorted keys). |
| `attributes` | `object` | nodes, edges | Domain-specific attributes (all dictionary keys kebab-case). |
| `provenance` | `object` | nodes, edges | Includes `source`, `collected_at`, `sbom_digest`, `attestation_digest`, `event_offset`. |
| `valid_from` | `string (ISO-8601)` | nodes, edges | Inclusive timestamp describing when the record became effective. |
| `valid_to` | `string (ISO-8601 or null)` | nodes, edges | Exclusive timestamp; `null` means open-ended. |
| `scope` | `string` | nodes, edges | Scope label (e.g., `runtime`, `build`, `dev-dependency`). |
| `labels` | `array[string]` | nodes | Free-form but deterministic ordering (ASCII sort). |
| `confidence` | `number` | edges | 0-1 numeric confidence score for overlay-derived edges. |
| `evidence_digest` | `string` | edges | SHA-256 digest referencing the immutable evidence payload. |
| `linkset_digest` | `string` | nodes, edges | SHA-256 digest to Concelier linkset documents. |
| `explain_hash` | `string` | nodes, edges | Hash of Policy Engine explain trace payload. |
| `runtime_state` | `string` | `runtime_context` nodes | Aggregated runtime state (e.g., `Running`, `Terminated`). |
## 5. Identity rules
1. **Node IDs (`gn:` prefix).**
`id = "gn:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
`identity_tuple` concatenates tuple components with `|` (no escaping) and lower-cases both keys and values unless the component is a hash or digest.
2. **Edge IDs (`ge:` prefix).**
`id = "ge:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
Edge tuples must include the resolved node IDs rather than only the canonical keys to ensure immutability under re-key events.
3. **Hashes.**
`hash` is computed by serializing the canonical document with:
- UTF-8 JSON
- Object keys sorted lexicographically
- Arrays sorted where semantics allow (e.g., `labels`, `matched_versions`)
- Timestamps normalized to UTC ISO-8601 (`YYYY-MM-DDTHH:MM:SSZ`)
4. **Deterministic provenance.**
`provenance.source` is a dotted string (`scanner.sbom.v1`, `concelier.linkset.v1`) and `provenance.event_offset` is a monotonic integer for replay.
## 6. Validity window semantics
- `valid_from` equals the upstream event timestamp at ingestion time (SBOM collected timestamp, advisory published timestamp, policy evaluation timestamp, runtime observation timestamp).
- `valid_to` stays `null` until a newer version supersedes the record. Superseding records carry a `supersedes` reference in `attributes`.
- Snapshots freeze the set of nodes/edges with `valid_from <= snapshot_at < coalesce(valid_to, +∞)`.
## 7. Fixtures & verification
- Seed fixtures live under `tests/Graph/StellaOps.Graph.Indexer.Tests/Fixtures/v1/`.
- Fixture files:
- `nodes.json` — canonical node samples (per node kind).
- `edges.json` — canonical edge samples including overlay references.
- `schema-matrix.json` — lists attribute coverage per node/edge kind for regression tests.
- Unit tests assert:
- Identifier determinism (`GraphIdentityTests.NodeIds_are_stable`).
- Hash determinism under property ordering variations.
- Attribute coverage against `schema-matrix.json`.
- Fixtures follow the attribute dictionary above; new attributes require dictionary updates and fixture refresh.
## 8. Change control
- Increment schema version in fixture folder (`v1`, `v2`, …) when making breaking changes.
- Update this document and the JSON fixtures together; do not ship mismatched versions.
- Notify SBOM Service, Concelier, Excititor, Policy, Signals, and Zastava owners before promoting changes to DOING/DONE state.
## 9. References
- `docs/modules/graph/architecture.md` — high-level architecture.
- `docs/modules/platform/architecture-overview.md` — platform context.
- `src/Graph/StellaOps.Graph.Indexer/TASKS.md` — task tracking.
- `seed-data/` — additional sample payloads for offline kit packaging (future work).