- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes. - Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes. - Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables. - Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
		
			
				
	
	
	
		
			3.9 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			3.9 KiB
		
	
	
	
	
	
	
	
Source & Job Orchestrator architecture
Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
1) Topology
- Orchestrator API (StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*).
- Job ledger (Mongo). Collections jobs,job_history,sources,quotas,throttles,incidents. Append-only history ensures auditability.
- Queue abstraction. Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
2) Job lifecycle
- Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit JobRequestrecords containingjobType,tenant,priority,payloadDigest,dependencies.
- Scheduling. Orchestrator applies quotas and rate limits per {tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp.
- Leasing. Workers poll LeaseJobendpoint; Orchestrator returns job withleaseId,leaseUntil, and instrumentation tokens. Lease renewal required for long-running tasks.
- Completion. Worker reports status (succeeded,failed,canceled,timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded.
- Replay. Operators trigger POST /jobs/{id}/replaywhich clones job payload, setsreplayOfpointer, and requeues with high priority while preserving determinism metadata.
3) Rate-limit & quota governance
- Quotas defined per tenant/profile (maxActive,maxPerHour,burst). Stored inquotasand enforced before leasing.
- Dynamic throttles allow ops to pause specific sources (pauseSource,resumeSource) or reduce concurrency.
- Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
4) APIs
- GET /api/jobs?status=— list jobs with filters (tenant, jobType, status, time window).
- GET /api/jobs/{id}— job detail (payload digest, attempts, worker, lease history, metrics).
- POST /api/jobs/{id}/cancel— cancel running/pending job with audit reason.
- POST /api/jobs/{id}/replay— schedule replay.
- POST /api/limits/throttle— apply throttle (requires elevated scope).
- GET /api/dashboard/metrics— aggregated metrics for Console dashboards.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
5) Observability
- Metrics: job_queue_depth{jobType,tenant},job_latency_seconds,job_failures_total,job_retry_total,lease_extensions_total.
- Logs: structured with jobId,jobType,tenant,workerId,leaseId,status. Incident logs flagged for Ops.
- Traces: spans covering enqueue,schedule,lease,worker_execute,complete. Trace IDs propagate to worker spans for end-to-end correlation.
6) Offline support
- Orchestrator exports audit bundles: jobs.jsonl,history.jsonl,throttles.jsonl,manifest.json,signatures/. Used for offline investigations and compliance.
- Replay manifests contain job digests and success/failure notes for deterministic proof.
7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for maintenancemode halting leases while allowing status inspection.
- Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).