Add unit tests for SBOM ingestion and transformation
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
- Implement `SbomIngestServiceCollectionExtensionsTests` to verify the SBOM ingestion pipeline exports snapshots correctly. - Create `SbomIngestTransformerTests` to ensure the transformation produces expected nodes and edges, including deduplication of license nodes and normalization of timestamps. - Add `SbomSnapshotExporterTests` to test the export functionality for manifest, adjacency, nodes, and edges. - Introduce `VexOverlayTransformerTests` to validate the transformation of VEX nodes and edges. - Set up project file for the test project with necessary dependencies and configurations. - Include JSON fixture files for testing purposes.
This commit is contained in:
@@ -2,11 +2,12 @@
|
||||
|
||||
Graph module (upcoming) will power graph-indexed queries for SBOM relationships, lineage, and blast-radius analysis.
|
||||
|
||||
## Responsibilities
|
||||
- Model SBOM and advisory entities as a navigable graph.
|
||||
- Provide APIs for dependency impact, provenance chains, and reachability analysis.
|
||||
- Integrate with Scheduler/Policy for graph-driven re-evaluation.
|
||||
- Expose tooling for offline explorers.
|
||||
## Responsibilities
|
||||
- Model SBOM and advisory entities as a navigable graph.
|
||||
- Provide APIs for dependency impact, provenance chains, and reachability analysis.
|
||||
- Integrate with Scheduler/Policy for graph-driven re-evaluation.
|
||||
- Expose tooling for offline explorers.
|
||||
- Maintain [Graph Index Canonical Schema](schema.md) with deterministic identities, fixtures, and attribute dictionary.
|
||||
|
||||
### Domain highlights (Epic 5)
|
||||
- **Nodes:** artifacts/images, SBOM components, packages/versions, files/paths, licences, advisories, VEX statements, provenance attestations, policy versions.
|
||||
|
||||
@@ -38,8 +38,9 @@
|
||||
|
||||
## 5) Offline & export
|
||||
|
||||
- Each snapshot packages `nodes.jsonl`, `edges.jsonl`, `overlays/` plus manifest with hash, counts, and provenance. Export Center consumes these artefacts for graph-specific bundles.
|
||||
- Saved queries and overlays include deterministic IDs so Offline Kit consumers can import and replay results.
|
||||
- Each snapshot packages `nodes.jsonl`, `edges.jsonl`, `overlays/` plus manifest with hash, counts, and provenance. Export Center consumes these artefacts for graph-specific bundles.
|
||||
- Saved queries and overlays include deterministic IDs so Offline Kit consumers can import and replay results.
|
||||
- Runtime hosts register the SBOM ingest pipeline via `services.AddSbomIngestPipeline(...)`. Snapshot exports default to `./artifacts/graph-snapshots` but can be redirected with `STELLAOPS_GRAPH_SNAPSHOT_DIR` or the `SbomIngestOptions.SnapshotRootDirectory` callback.
|
||||
|
||||
## 6) Observability
|
||||
|
||||
@@ -47,10 +48,14 @@
|
||||
- Logs: structured events for ETL stages and query execution (with trace IDs).
|
||||
- Traces: ETL pipeline spans, query engine spans.
|
||||
|
||||
## 7) Rollout notes
|
||||
|
||||
- Phase 1: ingest SBOM + advisories, deliver impact queries.
|
||||
- Phase 2: add VEX overlays, policy overlays, diff tooling.
|
||||
- Phase 3: expose runtime/Zastava edges and AI-assisted recommendations (future).
|
||||
## 7) Rollout notes
|
||||
|
||||
- Phase 1: ingest SBOM + advisories, deliver impact queries.
|
||||
- Phase 2: add VEX overlays, policy overlays, diff tooling.
|
||||
- Phase 3: expose runtime/Zastava edges and AI-assisted recommendations (future).
|
||||
|
||||
### Local testing note
|
||||
|
||||
Set `STELLAOPS_TEST_MONGO_URI` to a reachable MongoDB instance before running `tests/Graph/StellaOps.Graph.Indexer.Tests`. The test harness falls back to `mongodb://127.0.0.1:27017`, then Mongo2Go, but the CI workflow requires the environment variable to be present to ensure upsert coverage runs against a managed database. Use `STELLAOPS_GRAPH_SNAPSHOT_DIR` (or the `AddSbomIngestPipeline` options callback) to control where graph snapshot artefacts land during local runs.
|
||||
|
||||
Refer to the module README and implementation plan for immediate context, and update this document once component boundaries and data flows are finalised.
|
||||
|
||||
98
docs/modules/graph/schema.md
Normal file
98
docs/modules/graph/schema.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Graph Index Canonical Schema
|
||||
|
||||
> Ownership: Graph Indexer Guild • Version 2025-11-03 (Sprint 140)\
|
||||
> Scope: Canonical node and edge schemas, attribute dictionary, identity rules, and fixture references for the Graph Indexer foundations (GRAPH-INDEX-28-001).
|
||||
|
||||
## 1. Purpose
|
||||
- Provide a deterministic schema contract for graph indexing pipelines.
|
||||
- Document the attribute dictionary consumed by SBOM, Advisory, VEX, Policy, and Runtime signal feeds.
|
||||
- Define the identity rules that guarantee stable node and edge identifiers across rebuilds.
|
||||
- Point implementers and QA to the seed fixtures used in unit/integration tests.
|
||||
|
||||
## 2. Node taxonomy
|
||||
| Node kind | Identity tuple (ordered) | Required attributes | Primary sources |
|
||||
|-----------|--------------------------|---------------------|-----------------|
|
||||
| `artifact` | `tenant`, `artifact_digest`, `sbom_digest` | `display_name`, `artifact_digest`, `sbom_digest`, `environment`, `labels[]`, `origin_registry`, `supply_chain_stage` | Scanner WebService, SBOM Service |
|
||||
| `component` | `tenant`, `purl`, `source_type` | `purl`, `version`, `ecosystem`, `scope`, `license_spdx`, `usage` | SBOM Service analyzers |
|
||||
| `file` | `tenant`, `artifact_digest`, `normalized_path`, `content_sha256` | `normalized_path`, `content_sha256`, `language_hint`, `size_bytes`, `scope` | SBOM layer analyzers |
|
||||
| `license` | `tenant`, `license_spdx`, `source_digest` | `license_spdx`, `name`, `classification`, `notice_uri` | SBOM Service, Concelier |
|
||||
| `advisory` | `tenant`, `advisory_source`, `advisory_id`, `content_hash` | `advisory_source`, `advisory_id`, `severity`, `published_at`, `content_hash`, `linkset_digest` | Concelier |
|
||||
| `vex_statement` | `tenant`, `vex_source`, `statement_id`, `content_hash` | `status`, `statement_id`, `justification`, `issued_at`, `expires_at`, `content_hash` | Excititor |
|
||||
| `policy_version` | `tenant`, `policy_pack_digest`, `effective_from` | `policy_pack_digest`, `policy_name`, `effective_from`, `expires_at`, `explain_hash` | Policy Engine |
|
||||
| `runtime_context` | `tenant`, `runtime_fingerprint`, `collector`, `observed_at` | `runtime_fingerprint`, `collector`, `observed_at`, `cluster`, `namespace`, `workload_kind`, `runtime_state` | Signals, Zastava |
|
||||
|
||||
## 3. Edge taxonomy
|
||||
| Edge kind | Source → Target | Identity tuple (ordered) | Required attributes | Default validity |
|
||||
|-----------|-----------------|--------------------------|---------------------|------------------|
|
||||
| `CONTAINS` | `artifact` → `component` | `tenant`, `artifact_node_id`, `component_node_id`, `sbom_digest` | `detected_by`, `layer_digest`, `scope`, `evidence_digest` | `valid_from = sbom_collected_at`, `valid_to = null` |
|
||||
| `DEPENDS_ON` | `component` → `component` | `tenant`, `component_node_id`, `dependency_purl`, `sbom_digest` | `dependency_purl`, `dependency_version`, `relationship`, `evidence_digest` | Derived from SBOM dependency graph |
|
||||
| `DECLARED_IN` | `component` → `file` | `tenant`, `component_node_id`, `file_node_id`, `sbom_digest` | `detected_by`, `scope`, `evidence_digest` | Mirrors SBOM declaration |
|
||||
| `BUILT_FROM` | `artifact` → `artifact` | `tenant`, `parent_artifact_node_id`, `child_artifact_digest` | `build_type`, `builder_id`, `attestation_digest` | Derived from provenance attestations |
|
||||
| `AFFECTED_BY` | `component` → `advisory` | `tenant`, `component_node_id`, `advisory_node_id`, `linkset_digest` | `evidence_digest`, `matched_versions`, `cvss`, `confidence` | Concelier overlays |
|
||||
| `VEX_EXEMPTS` | `component` → `vex_statement` | `tenant`, `component_node_id`, `vex_node_id`, `statement_hash` | `status`, `justification`, `impact_statement`, `evidence_digest` | Excititor overlays |
|
||||
| `GOVERNS_WITH` | `policy_version` → `component` | `tenant`, `policy_node_id`, `component_node_id`, `finding_explain_hash` | `verdict`, `explain_hash`, `policy_rule_id`, `evaluation_timestamp` | Policy Engine overlays |
|
||||
| `OBSERVED_RUNTIME` | `runtime_context` → `component` | `tenant`, `runtime_node_id`, `component_node_id`, `runtime_fingerprint` | `process_name`, `entrypoint_kind`, `runtime_evidence_digest`, `confidence` | Signals/Zastava ingestion |
|
||||
|
||||
## 4. Attribute dictionary
|
||||
| Attribute | Type | Applies to | Description |
|
||||
|-----------|------|------------|-------------|
|
||||
| `tenant` | `string` | nodes, edges | Tenant identifier (enforced on storage and query). |
|
||||
| `kind` | `string` | nodes, edges | One of the values listed in the taxonomy tables. |
|
||||
| `canonical_key` | `object` | nodes | Ordered tuple persisted as a JSON object matching the identity tuple components. |
|
||||
| `id` | `string` | nodes, edges | Deterministic identifier (`gn:` or `ge:` prefix + Base32-encoded SHA-256). |
|
||||
| `hash` | `string` | nodes, edges | SHA-256 of the canonical JSON representation (normalized by sorted keys). |
|
||||
| `attributes` | `object` | nodes, edges | Domain-specific attributes (all dictionary keys kebab-case). |
|
||||
| `provenance` | `object` | nodes, edges | Includes `source`, `collected_at`, `sbom_digest`, `attestation_digest`, `event_offset`. |
|
||||
| `valid_from` | `string (ISO-8601)` | nodes, edges | Inclusive timestamp describing when the record became effective. |
|
||||
| `valid_to` | `string (ISO-8601 or null)` | nodes, edges | Exclusive timestamp; `null` means open-ended. |
|
||||
| `scope` | `string` | nodes, edges | Scope label (e.g., `runtime`, `build`, `dev-dependency`). |
|
||||
| `labels` | `array[string]` | nodes | Free-form but deterministic ordering (ASCII sort). |
|
||||
| `confidence` | `number` | edges | 0-1 numeric confidence score for overlay-derived edges. |
|
||||
| `evidence_digest` | `string` | edges | SHA-256 digest referencing the immutable evidence payload. |
|
||||
| `linkset_digest` | `string` | nodes, edges | SHA-256 digest to Concelier linkset documents. |
|
||||
| `explain_hash` | `string` | nodes, edges | Hash of Policy Engine explain trace payload. |
|
||||
| `runtime_state` | `string` | `runtime_context` nodes | Aggregated runtime state (e.g., `Running`, `Terminated`). |
|
||||
|
||||
## 5. Identity rules
|
||||
1. **Node IDs (`gn:` prefix).**
|
||||
`id = "gn:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
|
||||
`identity_tuple` concatenates tuple components with `|` (no escaping) and lower-cases both keys and values unless the component is a hash or digest.
|
||||
2. **Edge IDs (`ge:` prefix).**
|
||||
`id = "ge:" + tenant + ":" + kind + ":" + base32(sha256(identity_tuple))`\
|
||||
Edge tuples must include the resolved node IDs rather than only the canonical keys to ensure immutability under re-key events.
|
||||
3. **Hashes.**
|
||||
`hash` is computed by serializing the canonical document with:
|
||||
- UTF-8 JSON
|
||||
- Object keys sorted lexicographically
|
||||
- Arrays sorted where semantics allow (e.g., `labels`, `matched_versions`)
|
||||
- Timestamps normalized to UTC ISO-8601 (`YYYY-MM-DDTHH:MM:SSZ`)
|
||||
4. **Deterministic provenance.**
|
||||
`provenance.source` is a dotted string (`scanner.sbom.v1`, `concelier.linkset.v1`) and `provenance.event_offset` is a monotonic integer for replay.
|
||||
|
||||
## 6. Validity window semantics
|
||||
- `valid_from` equals the upstream event timestamp at ingestion time (SBOM collected timestamp, advisory published timestamp, policy evaluation timestamp, runtime observation timestamp).
|
||||
- `valid_to` stays `null` until a newer version supersedes the record. Superseding records carry a `supersedes` reference in `attributes`.
|
||||
- Snapshots freeze the set of nodes/edges with `valid_from <= snapshot_at < coalesce(valid_to, +∞)`.
|
||||
|
||||
## 7. Fixtures & verification
|
||||
- Seed fixtures live under `tests/Graph/StellaOps.Graph.Indexer.Tests/Fixtures/v1/`.
|
||||
- Fixture files:
|
||||
- `nodes.json` — canonical node samples (per node kind).
|
||||
- `edges.json` — canonical edge samples including overlay references.
|
||||
- `schema-matrix.json` — lists attribute coverage per node/edge kind for regression tests.
|
||||
- Unit tests assert:
|
||||
- Identifier determinism (`GraphIdentityTests.NodeIds_are_stable`).
|
||||
- Hash determinism under property ordering variations.
|
||||
- Attribute coverage against `schema-matrix.json`.
|
||||
- Fixtures follow the attribute dictionary above; new attributes require dictionary updates and fixture refresh.
|
||||
|
||||
## 8. Change control
|
||||
- Increment schema version in fixture folder (`v1`, `v2`, …) when making breaking changes.
|
||||
- Update this document and the JSON fixtures together; do not ship mismatched versions.
|
||||
- Notify SBOM Service, Concelier, Excititor, Policy, Signals, and Zastava owners before promoting changes to DOING/DONE state.
|
||||
|
||||
## 9. References
|
||||
- `docs/modules/graph/architecture.md` — high-level architecture.
|
||||
- `docs/modules/platform/architecture-overview.md` — platform context.
|
||||
- `src/Graph/StellaOps.Graph.Indexer/TASKS.md` — task tracking.
|
||||
- `seed-data/` — additional sample payloads for offline kit packaging (future work).
|
||||
Reference in New Issue
Block a user