semi implemented and features implemented save checkpoint

This commit is contained in:
master
2026-02-08 18:00:49 +02:00
parent 04360dff63
commit 1bf6bbf395
20895 changed files with 716795 additions and 64 deletions

View File

@@ -0,0 +1,27 @@
# Agent Cluster Manager with HA Topologies
## Module
ReleaseOrchestrator
## Status
IMPLEMENTED
## Description
Agent clustering with support for multiple HA topologies (ActivePassive, ActiveActive, Sharded), leader election, health monitoring, and automatic failover for release orchestrator agents.
## Implementation Details
- **Modules**: `src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/`
- **Key Classes**:
- `AgentClusterManager` (`src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/AgentClusterManager.cs`) - manages agent clusters with configurable HA topologies
- `LeaderElection` (`src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/LeaderElection.cs`) - leader election for ActivePassive topology
- `FailoverManager` (`src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/FailoverManager.cs`) - automatic failover when leader becomes unhealthy
- `HealthMonitor` (`src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/HealthMonitor.cs`) - monitors cluster member health
- `StateSync` (`src/ReleaseOrchestrator/__Agents/StellaOps.Agent.Core/Resilience/StateSync.cs`) - state synchronization between cluster members
- **Source**: SPRINT_20260117_034
## E2E Test Plan
- [ ] Configure a 3-node ActivePassive cluster and verify leader election produces a single leader
- [ ] Verify failover: stop the leader node and confirm a new leader is elected within the timeout
- [ ] Verify ActiveActive topology: configure two active nodes and confirm both accept tasks
- [ ] Verify health monitoring: unhealthy node is detected and removed from the active set
- [ ] Verify state synchronization: cluster state converges after a node rejoins