Files
git.stella-ops.org/docs/modules/orchestrator/implementation_plan.md
master 7b5bdcf4d3 feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules
- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
2025-10-30 00:09:39 +02:00

4.3 KiB
Raw Blame History

Implementation plan — Source & Job Orchestrator

Delivery phases

  • Phase 1 Core service & job ledger
    Implement source registry, run/job tables, queue abstraction, lease management, token-bucket rate limiting, watchdogs, and API primitives (/sources, /runs, /jobs).
  • Phase 2 Worker SDK & artifact registry
    Embed worker SDK in Conseiller, Excititor, SBOM, Policy Engine; capture artifact metadata + hashes, enforce idempotency, publish progress/metrics.
  • Phase 3 Observability & dashboard
    Ship metrics, traces, incident logging, SSE/WebSocket feeds, and Console dashboard (DAG/timeline, heatmaps, error clustering, SLO burn rate).
  • Phase 4 Controls & resilience
    Deliver pause/resume/throttle/retry/backfill tooling, dead-letter review, circuit breakers, blackouts, backpressure handling, and automation hooks.
  • Phase 5 Offline & compliance
    Generate deterministic audit bundles (jobs.jsonl, history.jsonl, throttles.jsonl), provenance manifests, and offline replay scripts.

Work breakdown

  • Service & persistence
    • Postgres schema (sources, runs, jobs, artifacts, dag_edges, quotas, schedules, incidents).
    • Lease manager with heartbeats, retries, and dead-letter queues.
    • Token-bucket rate limiter per {tenant, source.host}; adaptive refill on upstream throttles.
    • Watermark/backfill orchestration for event-time windows.
  • Worker SDK
    • Claim/heartbeat/report contract, deterministic artifact hashing, idempotency enforcement.
    • Library release for .NET workers plus language bindings for Rust/Go ingestion agents.
  • Control plane APIs
    • CRUD for sources, runs, jobs, quotas, schedules; control actions (retry, cancel, prioritize, pause/resume, backfill).
    • SSE/WebSocket stream for Console updates.
  • Observability
    • Metrics: queue depth, job latency, failure classes, rate-limit hits, burn rate.
    • Error clustering (HTTP 429/5xx, schema mismatch, parse errors), incident logging with reason codes.
    • Gantt timeline and DAG JSON for Console visualisation.
  • Console & CLI
    • Console app sections: overview, sources, runs, job detail, incidents, throttles.
    • CLI commands: stella orchestrator sources|runs|jobs|throttle|backfill.
  • Compliance & offline
    • Immutable audit bundles with signatures; exports via Export Center; Offline Kit instructions.
    • Tenant isolation validation and secret redaction for logs/UI.

Acceptance criteria

  • Orchestrator schedules all advisory/VEX/SBOM/policy jobs with quotas, rate limits, and idempotency; retries and replay preserve provenance.
  • Console dashboard reflects real-time DAG status, queue depth, SLO burn rate, and allows pause/resume/throttle/backfill with audit trail.
  • Worker SDK integrated across producer services, emitting progress and artifact metadata.
  • Observability stack exposes metrics, logs, traces, incidents, and alerts for stuck jobs, throttling, and failure spikes.
  • Offline audit bundles reproduce job history deterministically and verify signatures.

Risks & mitigations

  • Backpressure/queue overload: adaptive token buckets, circuit breakers, dynamic concurrency; degrade gracefully.
  • Upstream vendor throttles: throttle management with user-visible state, automatic jitter and retry.
  • Tenant leakage: enforce tenant filters at API/queue/storage, fuzz tests, redaction.
  • Complex DAG errors: built-in diagnostics, error clustering, partial replay tooling.
  • Operator error: confirmation prompts, RBAC, runbook guidance, reason codes logged.

Test strategy

  • Unit: scheduling, quota enforcement, lease renewals, token bucket, watermark arithmetic.
  • Integration: worker SDK with Conseiller/Excititor/SBOM pipelines, pause/resume/backfill flows, failure recovery.
  • Performance: high-volume job workloads, queue backpressure, concurrency caps, dashboard SSE load tests.
  • Chaos: simulate upstream outages, stuck workers, clock skew, Postgres failover.
  • Compliance: audit bundle generation, signature verification, offline replay.

Definition of done

  • All phases delivered with telemetry, dashboards, and runbooks published.
  • Console + CLI parity validated; Offline Kit instructions complete.
  • ./TASKS.md and ../../TASKS.md updated with status; documentation (README/architecture/this plan) reflects latest behaviour.