Files
git.stella-ops.org/docs/modules/orchestrator/architecture.md
StellaOps Bot 17d45a6d30
Some checks failed
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
feat: Implement Filesystem and MongoDB provenance writers for PackRun execution context
- Added `FilesystemPackRunProvenanceWriter` to write provenance manifests to the filesystem.
- Introduced `MongoPackRunArtifactReader` to read artifacts from MongoDB.
- Created `MongoPackRunProvenanceWriter` to store provenance manifests in MongoDB.
- Developed unit tests for filesystem and MongoDB provenance writers.
- Established `ITimelineEventStore` and `ITimelineIngestionService` interfaces for timeline event handling.
- Implemented `TimelineIngestionService` to validate and persist timeline events with hashing.
- Created PostgreSQL schema and migration scripts for timeline indexing.
- Added dependency injection support for timeline indexer services.
- Developed tests for timeline ingestion and schema validation.
2025-11-30 15:38:14 +02:00

5.3 KiB
Raw Blame History

Source & Job Orchestrator architecture

Based on Epic9 Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.

1) Topology

  • Orchestrator API (StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*).
  • Job ledger (Mongo). Collections jobs, job_history, sources, quotas, throttles, incidents. Append-only history ensures auditability.
  • Queue abstraction. Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
  • Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.

2) Job lifecycle

  1. Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit JobRequest records containing jobType, tenant, priority, payloadDigest, dependencies.
  2. Scheduling. Orchestrator applies quotas and rate limits per {tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
  3. Leasing (Task Runner bridge). Workers poll LeaseJob endpoint; Orchestrator returns job with leaseId, leaseUntil, idempotencyKey, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (tenant, project, correlationId, taskRunnerId).
  4. Completion. Worker reports status (succeeded, failed, canceled, timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
  5. Replay. Operators trigger POST /jobs/{id}/replay which clones job payload, sets replayOf pointer, and requeues with high priority while preserving determinism metadata.

Pack-run lifecycle (phase III)

  • Register pack-run job type with task runner hints (artifacts, log channel, heartbeat cadence).
  • Logs/Artifacts: SSE/WS stream keyed by packRunId + tenant/project; artifacts published with content digests and URI metadata.
  • Events: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.

3) Rate-limit & quota governance

  • Quotas defined per tenant/profile (maxActive, maxPerHour, burst). Stored in quotas and enforced before leasing.
  • Dynamic throttles allow ops to pause specific sources (pauseSource, resumeSource) or reduce concurrency.
  • Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
  • Control plane quota updates require Authority scope orch:quota (issued via Orch.Admin role). Historical rebuilds/backfills additionally require orch:backfill and must supply backfill_reason and backfill_ticket alongside the operator metadata. Authority persists all four fields (quota_reason, quota_ticket, backfill_reason, backfill_ticket) for audit replay.

4) APIs

  • GET /api/jobs?status= — list jobs with filters (tenant, jobType, status, time window).
  • GET /api/jobs/{id} — job detail (payload digest, attempts, worker, lease history, metrics).
  • POST /api/jobs/{id}/cancel — cancel running/pending job with audit reason.
  • POST /api/jobs/{id}/replay — schedule replay.
  • POST /api/limits/throttle — apply throttle (requires elevated scope).
  • GET /api/dashboard/metrics — aggregated metrics for Console dashboards.
  • Event envelope draft (docs/modules/orchestrator/event-envelope.md) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
  • OpenAPI discovery: /.well-known/openapi exposes /openapi/orchestrator.json (OAS 3.1) with pagination/idempotency/error-envelope examples; legacy job detail/summary endpoints now ship Deprecation + Link headers that point to their replacements.

All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.

5) Observability

  • Metrics: job_queue_depth{jobType,tenant}, job_latency_seconds, job_failures_total, job_retry_total, lease_extensions_total.
  • Task Runner bridge adds pack_run_logs_stream_lag_seconds, pack_run_heartbeats_total, pack_run_artifacts_total.
  • Logs: structured with jobId, jobType, tenant, workerId, leaseId, status. Incident logs flagged for Ops.
  • Traces: spans covering enqueue, schedule, lease, worker_execute, complete. Trace IDs propagate to worker spans for end-to-end correlation.

6) Offline support

  • Orchestrator exports audit bundles: jobs.jsonl, history.jsonl, throttles.jsonl, manifest.json, signatures/. Used for offline investigations and compliance.
  • Replay manifests contain job digests and success/failure notes for deterministic proof.

7) Operational considerations

  • HA deployment with multiple API instances; queue storage determines redundancy strategy.
  • Support for maintenance mode halting leases while allowing status inspection.
  • Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).