Files
git.stella-ops.org/docs/modules/graph/README.md
2025-12-25 18:50:33 +02:00

91 lines
8.7 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# StellaOps Graph
Graph Indexer + Graph API build the tenant-scoped knowledge graph that powers blast-radius analysis, provenance timelines, and saved-query automation across StellaOps. Cartographer has been retired as of 2025-10-30 (see `docs/updates/2025-10-30-devops-governance.md`); this module now owns ingestion, storage, overlays, and query surfaces for graph data.
## Scope & responsibilities
- Ingest SBOM snapshots, advisory/VEX events, policy overlays, and runtime signals to maintain a first-party graph representation with deterministic node/edge identities.
- Serve APIs and saved-query tooling for impact analysis, dependency traversal, diffing, and policy/VEX overlays with explainable provenance.
- Supply Graph Explorer UI/CLI experiences, plus Offline Kit exports (`nodes.jsonl`, `edges.jsonl`, `overlays/`) with DSSE manifests for air-gapped replay. Analytics overlays are emitted as NDJSON (`overlays/clusters.ndjson`, `overlays/centrality.ndjson`) with deterministic ordering; PostgreSQL-backed providers support production wiring.
- Maintain the [Graph Index Canonical Schema](schema.md) and coordinate query/overlay lifecycle with Scheduler, Policy Engine, Vulnerability Explorer, and Export Center.
## Architecture snapshot (Sprint 30 groundwork)
- **Graph Indexer service** — consumes SBOM (`sbom_snapshot`), advisory, and VEX events; normalises identifiers; persists into `graph_nodes`, `graph_edges`, `graph_snapshots`, and overlay caches with tenant partitions.
- **Graph API service** — exposes `GET /graph/nodes`, `/graph/impact/{advisory}`, `/graph/query/saved`, `/graph/diff`, and overlay endpoints with RBAC scopes defined in Authority (`docs/updates/2025-10-26-authority-graph-scopes.md`).
- **Overlay & diff workers** — materialise impact lists, saved-query caches, and signed diff manifests; feed Scheduler `GraphBuildJob`/`GraphOverlayJob` contracts (`docs/updates/2025-10-26-scheduler-graph-jobs.md`).
- **Console & CLI integrations** — planned modules deliver WebGL explorer, timeline viz, and CLI `stella sbom graph ...` commands aligned with implementation plan phases.
- **Storage abstraction** — supports document + adjacency (PostgreSQL) or pluggable graph engine; both paths enforce deterministic ordering and export manifests.
## Current workstreams (Q42025)
- `GRAPH-SVC-30-00x` (see `src/Graph/StellaOps.Graph.Indexer/TASKS.md`) — stand up Graph Indexer pipeline, identity registry, snapshot exports.
- Active sprint: `docs/implplan/SPRINT_0141_0001_0001_graph_indexer.md` (Runtime & Signals 140.A) — clustering/centrality jobs, incremental/backfill pipeline, determinism tests, packaging.
- `GRAPH-API-30-00x` — draft API planner/cost guard, streaming responses, and Authority scope integration.
- `DOCS-GRAPH-24-003` & related backlog — author overview/API/query language docs; update this README again once those deliverables land.
- Deployment/DevOps follow-ups (`DEVOPS-VEX-30-001`, `DEPLOY-VEX-30-001`) coordinate dashboards, load tests, and Helm/Compose overlays for the graph stack.
## Integrations & dependencies
- **SBOM Service** (Scanner WebService + Worker) produce `sbom_snapshot` events consumed by Graph Indexer.
- **Concelier/Excititor** contribute advisory + VEX edges; VEX Lens consensus overlays attach to graph nodes as attributes.
- **Policy Engine & Scheduler** trigger recompute jobs and consume overlays for risk/impact automation.
- **Vulnerability Explorer & Console** surface graph queries, saved views, and diff visualisations.
- **Authority** defines scopes (`graph.viewer`, `graph.operator`) and client registrations; secrets managed via existing platform patterns.
## Data, observability & offline
- Collections/tables: `graph_nodes`, `graph_edges`, `graph_snapshots`, `graph_saved_queries`, `graph_overlays_cache`, append-only change logs for replay.
- Metrics: `graph_ingest_lag_seconds`, `graph_nodes_total`, `graph_query_latency_seconds{queryId}`, overlay/diff duration counters.
- Logs/traces: structured ETL logs, query planner traces, WebGL interaction telemetry (once UI lands).
- Offline bundles: deterministic `nodes.jsonl`, `edges.jsonl`, overlay manifests + DSSE signatures, consumable by Export Center and CLI mirroring.
## Operations & runbook (Sprint 030)
- Dashboards: import `Observability/graph-api-grafana.json` (panels for latency, budget denials, overlay cache ratio, export latency). Apply tenant filter in every panel.
- Health checks: `/healthz` should be 200; search/query/paths/diff/export endpoints require `X-Stella-Tenant`, `Authorization`, and scopes (`graph:read/query/export`).
- Key metrics (new):
- `graph_tile_latency_seconds` histogram (label `route`); alert when p95 > 1.5s for 5m.
- `graph_query_budget_denied_total` counter (label `reason`); investigate spikes (>50 in 5m).
- `graph_overlay_cache_hits_total` / `graph_overlay_cache_misses_total`; watch miss ratio > 0.4 for 10m.
- `graph_export_latency_seconds` histogram (label `format`); alert when p95 > 2s for ndjson/graphml.
- Triage playbook:
- Budget denials: lower default edges/nodes budget or guide callers to request smaller scopes; verify overlay includes are truly required.
- Overlay cache misses: ensure cache TTL is ≥5m; check overlay service connectivity to Policy Engine; warm cache by replaying recent hot nodes.
- Export slowness: reduce export `Limit`, offload PNG/SVG to worker, and confirm disk I/O headroom.
- If alerts fire, capture tenant, route, cursor/budget values, and recent deploy SHA in incident note.
## Key docs & updates
- [`architecture.md`](architecture.md) — inputs, pipelines, APIs, storage choices, observability, offline handling.
- [`implementation_plan.md`](implementation_plan.md) — phased delivery roadmap, work breakdown, risks, test strategy.
- [`schema.md`](schema.md) — canonical node/edge schema and attribute dictionary (keep in sync with indexer code).
- API surface: `docs/api/graph-gateway-spec-draft.yaml` (NDJSON tiles for `/graph/search|query|paths|diff|export`, budgets, overlays).
- Updates: `docs/updates/2025-10-26-scheduler-graph-jobs.md`, `docs/updates/2025-10-26-authority-graph-scopes.md`, `docs/updates/2025-10-30-devops-governance.md` for the latest decisions/dependencies.
- Index: see `architecture-index.md` for data model, ingestion pipeline, overlays/caches, events, and API/observability pointers.
## Epic alignment
- **Epic 5 SBOM Graph Explorer:** Graph Indexer, Graph API, saved queries, overlays, Console/CLI experiences, Offline Kit parity.
- Cross-epic ties: Policy reasoning (explain overlays), Scheduler recompute, Notify/Task Runner integration for graph incidents.
## Implementation Status
### Delivery Phases
- **Phase 1 Graph Indexer foundations:** Stand up Graph Indexer service, node/edge schemas, ingestion from SBOM/Concelier/Excititor events, identity stability, snapshot materialisation
- **Phase 2 Graph API service:** Expose search, query, path, impact, diff, and overlay endpoints with RBAC, cost controls, streaming responses
- **Phase 3 Console & CLI experiences:** Ship Graph Explorer UI (WebGL canvas, filters, diff mode, overlays) and CLI for automation pipelines
- **Phase 4 Advanced analytics:** Implement clustering, centrality, saved queries, overlay caching, Policy Engine explain integration
- **Phase 5 Exports & offline:** Deliver GraphML/CSV/NDJSON exports, Offline Kit bundles with deterministic manifests
- **Phase 6 Observability & hardening:** Complete dashboards, alerts, runbooks, load/perf testing, a11y review
### Acceptance Criteria
- Graph Indexer ingests SBOM/advisory/VEX events deterministically with tenant isolation and append-only provenance
- Graph API serves endpoints within budgeted latency and enforces cost limits + RBAC
- Console explorer visualises topology, overlays, diffs; CLI commands mirror functionality for automation
- Exports and Offline Kit bundles reproduce snapshots and overlays with signed manifests
- Observability dashboards/alerts detect ingest lag, query failures, cache churn, memory pressure; runbooks guide remediation
- Policy/VEX overlays align with Policy Engine explain traces and VEX suppressions
### Key Risks & Mitigations
- **Graph scale/complexity:** Adopt adjacency compression, cached overlays, streaming pagination, enforced query budgets
- **Tenant bleed:** Strict tenant filters, fuzz tests, data masking, compliance reviews
- **Runaway queries/visualization:** Cost planner, query timeout, UI hints, safe mode renders
- **Cache poisoning:** Input validation, schema versioning, eviction policies
- **Offline parity gaps:** Deterministic export pipeline, integration tests for Offline Kit import
### Current Active Sprint
- Runtime & Signals 140.A: Clustering/centrality jobs, incremental/backfill pipeline, determinism tests, packaging