For this step, you’re layering **liveness** and **basic routing intelligence** on top of the minimal handshake/dispatch you already designed. Target outcome: * Microservices send **heartbeats** over the existing connection. * The router tracks **LastHeartbeatUtc**, **health status**, and **AveragePingMs** per connection. * The router’s `IRoutingPlugin` uses **region + health + latency** to pick an instance. No need to handle cancellation or streaming yet; just make routing decisions *not* naive. --- ## 0. Preconditions Before starting, confirm: * `StellaOps.Router.Common` already has: * `InstanceHealthStatus` enum. * `ConnectionState` with at least `Instance`, `Status`, `LastHeartbeatUtc`, `AveragePingMs`, `TransportType`. * Minimal handshake is working: * Microservice sends HELLO (instance + endpoints). * Router creates `ConnectionState` & populates global routing view. * Router can send REQUEST and receive RESPONSE via InMemory transport. If any of that is incomplete, shore it up first. --- ## 1. Extend Common with heartbeat payloads **Project:** `StellaOps.Router.Common` **Owner:** Common dev Add DTOs for heartbeat frames. ### 1.1 Heartbeat payload ```csharp public sealed class HeartbeatPayload { public string InstanceId { get; init; } = string.Empty; public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy; // Optional basic metrics public int InFlightRequests { get; init; } public double ErrorRate { get; init; } // 0–1 range, optional } ``` * This is application-level health; `Status` lets the microservice say “Degraded” / “Draining”. * In-flight + error rate can be used later for smarter routing; initially, you can ignore them. ### 1.2 Wire into frame model Ensure: * `FrameType` includes `Heartbeat`: ```csharp public enum FrameType : byte { Hello = 1, Heartbeat = 2, EndpointsUpdate = 3, Request = 4, RequestStreamData = 5, Response = 6, ResponseStreamData = 7, Cancel = 8 } ``` * No behavior in Common; only DTOs and enums. --- ## 2. Microservice SDK: send heartbeats on the same connection **Project:** `StellaOps.Microservice` **Owner:** SDK dev You already have `MicroserviceConnectionHostedService` doing HELLO and request dispatch. Now add heartbeat sending. ### 2.1 Introduce heartbeat options Extend `StellaMicroserviceOptions` with simple settings: ```csharp public sealed class StellaMicroserviceOptions { // existing fields... public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10); public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here } ``` ### 2.2 Internal heartbeat sender Create an internal interface and implementation: ```csharp internal interface IHeartbeatSource { InstanceHealthStatus GetCurrentStatus(); int GetInFlightRequests(); double GetErrorRate(); } ``` For now you can implement a trivial `DefaultHeartbeatSource`: * `GetCurrentStatus()` → `Healthy`. * `GetInFlightRequests()` → 0. * `GetErrorRate()` → 0. Wire this in DI: ```csharp services.AddSingleton(); ``` ### 2.3 Add heartbeat loop to MicroserviceConnectionHostedService In `StartAsync` of `MicroserviceConnectionHostedService`: * After sending HELLO and subscribing to requests, start a background heartbeat loop. Pseudo-plan: ```csharp private Task? _heartbeatLoop; public async Task StartAsync(CancellationToken ct) { // existing HELLO logic... await _connection.SendHelloAsync(payload, ct); _connection.OnRequest(frame => HandleRequestAsync(frame, ct)); _heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct); } private async Task HeartbeatLoopAsync(CancellationToken outerCt) { var opt = _options.Value; var interval = opt.HeartbeatInterval; var instanceId = opt.InstanceId; while (!outerCt.IsCancellationRequested) { var payload = new HeartbeatPayload { InstanceId = instanceId, Status = _heartbeatSource.GetCurrentStatus(), InFlightRequests = _heartbeatSource.GetInFlightRequests(), ErrorRate = _heartbeatSource.GetErrorRate() }; var frame = new Frame { Type = FrameType.Heartbeat, CorrelationId = Guid.Empty, // or a reserved value Payload = SerializeHeartbeatPayload(payload) }; await _connection.SendHeartbeatAsync(frame, outerCt); try { await Task.Delay(interval, outerCt); } catch (TaskCanceledException) { break; } } } ``` You’ll need to extend `IMicroserviceConnection` with: ```csharp Task SendHeartbeatAsync(Frame frame, CancellationToken ct); ``` In this step, manipulation is simple: every N seconds, push a heartbeat. --- ## 3. Router: accept heartbeats and update connection health **Project:** `StellaOps.Gateway.WebService` **Owner:** Gateway dev You already have an InMemory router or similar structure that: * Handles HELLO frames, creates `ConnectionState`. * Maintains a global `IGlobalRoutingState`. Now you need to: * Handle HEARTBEAT frames. * Update `ConnectionState.Status` and `LastHeartbeatUtc`. ### 3.1 Frame dispatch on router side In your router’s InMemory server loop (or equivalent), add case for `FrameType.Heartbeat`: * Deserialize `HeartbeatPayload` from `frame.Payload`. * Find the corresponding `ConnectionState` by `InstanceId` (and/or connection ID). * Update: * `LastHeartbeatUtc` = `DateTime.UtcNow`. * `Status` = `payload.Status`. You can add a method in your routing-state implementation: ```csharp public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload) { if (!_connections.TryGetValue(connectionId, out var conn)) return; conn.LastHeartbeatUtc = DateTime.UtcNow; conn.Status = payload.Status; } ``` The router’s transport server should know which `connectionId` delivered the frame; pass that along. ### 3.2 Detect stale connections (health degradation) Add a background “health monitor” in the gateway: * Reads `HeartbeatTimeout` from configuration (can reuse the same default as microservice or have separate router-side config). * Periodically scans all `ConnectionState` entries: * If `Now - LastHeartbeatUtc > HeartbeatTimeout`, mark `Status = Unhealthy` (or remove connection entirely). * If connection drops (transport disconnect), also mark `Unhealthy` or remove. This can be a simple `IHostedService`: ```csharp internal sealed class ConnectionHealthMonitor : IHostedService { private readonly IGlobalRoutingState _state; private readonly TimeSpan _heartbeatTimeout; private Task? _loop; private CancellationTokenSource? _cts; public Task StartAsync(CancellationToken cancellationToken) { _cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); _loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token); return Task.CompletedTask; } public async Task StopAsync(CancellationToken cancellationToken) { _cts?.Cancel(); if (_loop is not null) await _loop; } private async Task MonitorLoopAsync(CancellationToken ct) { while (!ct.IsCancellationRequested) { _state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow); await Task.Delay(TimeSpan.FromSeconds(5), ct); } } } ``` You’ll add a method like `MarkStaleConnectionsUnhealthy` on your `IGlobalRoutingState` implementation. --- ## 4. Track basic latency (AveragePingMs) **Project:** Gateway + Common **Owner:** Gateway dev You want `AveragePingMs` per connection to inform routing decisions. ### 4.1 Decide where to measure Simplest: measure “request → response” round-trip time in the gateway: * When you send a `Request` frame to a specific connection, record: * `SentAtUtc[CorrelationId] = DateTime.UtcNow`. * When you receive a `Response` frame with that correlation: * Compute `latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds`. * Discard map entry. Then update `ConnectionState.AveragePingMs`, e.g. with an exponential moving average: ```csharp conn.AveragePingMs = conn.AveragePingMs <= 0 ? latencyMs : conn.AveragePingMs * 0.8 + latencyMs * 0.2; ``` ### 4.2 Where to hook this * In the **gateway-side transport client** (InMemory implementation for now): * When sending `Request` frame: * Register `SentAtUtc` per correlation ID. * When receiving `Response` frame: * Compute latency. * Call `IGlobalRoutingState.UpdateLatency(connectionId, latencyMs)`. Add a method to the routing state: ```csharp public void UpdateLatency(string connectionId, double latencyMs) { if (_connections.TryGetValue(connectionId, out var conn)) { if (conn.AveragePingMs <= 0) conn.AveragePingMs = latencyMs; else conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2; } } ``` You can keep it simple; sophistication can come later. --- ## 5. Basic routing plugin implementation **Project:** `StellaOps.Gateway.WebService` **Owner:** Gateway dev You already have `IRoutingPlugin` defined. Now implement a concrete `BasicRoutingPlugin` that respects: * Region (gateway region first, then neighbor tiers). * Health (`Healthy` / `Degraded` only). * Latency preference (`AveragePingMs`). ### 5.1 Inputs & data `RoutingContext` should carry: * `EndpointDescriptor` (with ServiceName, Version, Method, Path). * `GatewayRegion` (from `GatewayNodeConfig.Region`). * The `HttpContext` if you need headers (not needed for routing at this stage). `IGlobalRoutingState` should provide: * `GetConnectionsFor(serviceName, version, method, path)` returning all `ConnectionState`s that support that endpoint. ### 5.2 Basic algorithm Algorithm outline: ```csharp public sealed class BasicRoutingPlugin : IRoutingPlugin { private readonly IGlobalRoutingState _state; private readonly string[] _neighborRegions; // configured, can be empty public async Task ChooseInstanceAsync( RoutingContext context, CancellationToken cancellationToken) { var endpoint = context.Endpoint; var candidates = _state.GetConnectionsFor( endpoint.ServiceName, endpoint.Version, endpoint.Method, endpoint.Path); if (candidates.Count == 0) return null; // 1. Filter by health (only Healthy or Degraded) var healthy = candidates .Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded) .ToList(); if (healthy.Count == 0) return null; // 2. Partition by region tier var gatewayRegion = context.GatewayRegion; List tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList(); List tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList(); List tier3 = healthy.Except(tier1).Except(tier2).ToList(); var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3; if (chosenTier.Count == 0) return null; // 3. Sort by latency, then heartbeat freshness var ordered = chosenTier .OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs) .ThenByDescending(c => c.LastHeartbeatUtc) .ToList(); var winner = ordered[0]; // 4. Build decision return new RoutingDecision { Endpoint = endpoint, Connection = winner, TransportType = winner.TransportType, EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later }; } } ``` Wire it into DI: ```csharp services.AddSingleton(); ``` And ensure `RoutingDecisionMiddleware` calls it. --- ## 6. Integrate health-aware routing into the HTTP pipeline **Project:** `StellaOps.Gateway.WebService` **Owner:** Gateway dev Update your `RoutingDecisionMiddleware` to: * Use the final `IRoutingPlugin` instead of picking a random connection. * Handle null decision appropriately: * If `ChooseInstanceAsync` returns `null`, respond with `503 Service Unavailable` or `502 Bad Gateway` and a generic error body, log the incident. Check that: * Gateway’s region is injected (via `GatewayNodeConfig.Region`) into `RoutingContext.GatewayRegion`. * Endpoint descriptor is resolved before you call the plugin. --- ## 7. Testing plan **Project:** `StellaOps.Gateway.WebService.Tests`, `StellaOps.Microservice.Tests` **Owner:** test agent Write basic tests to lock in behavior. ### 7.1 Microservice heartbeat tests In `StellaOps.Microservice.Tests`: * Use a fake `IMicroserviceConnection` that records frames sent. * Configure `HeartbeatInterval` to a small number (e.g. 100 ms). * Start a Host with `AddStellaMicroservice`. * Wait some time, assert: * At least one HELLO frame was sent. * At least N HEARTBEAT frames were sent. * HEARTBEAT payload has correct `InstanceId` and `Status`. ### 7.2 Router health update tests In `StellaOps.Gateway.WebService.Tests` (or a separate routing-state test project): * Create an instance of your `IGlobalRoutingState` implementation. * Add a connection via HELLO simulation. * Call `UpdateHeartbeat` with a HeartbeatPayload. * Assert: * `LastHeartbeatUtc` updated. * `Status` set to `Healthy` (or whatever payload said). * Advance time (simulate via injecting a clock or mocking DateTime) and call `MarkStaleConnectionsUnhealthy`: * Assert that `Status` changed to `Unhealthy`. ### 7.3 Routing plugin tests Write tests for `BasicRoutingPlugin`: * Case 1: multiple connections, some unhealthy: * Only Healthy/Degraded are considered. * Case 2: multiple regions: * Instances in gateway region win over others. * Case 3: same region, different `AveragePingMs`: * Lower latency chosen. * Case 4: same latency, different `LastHeartbeatUtc`: * More recent heartbeat chosen. These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.). --- ## 8. Done criteria for “Add heartbeat, health, basic routing rules” You can declare this step complete when: * Microservices: * Periodically send HEARTBEAT frames on the same connection they use for requests. * Gateway/router: * Updates `LastHeartbeatUtc` and `Status` on receipt of HEARTBEAT. * Marks stale or disconnected connections as `Unhealthy` (or removes them). * Tracks `AveragePingMs` per connection based on request/response round trips. * Routing: * `IRoutingPlugin` chooses instances based on: * Strict `ServiceName` + `Version` + endpoint match. * Health (`Healthy`/`Degraded` only). * Region preference (gateway region > neighbors > others). * Latency (`AveragePingMs`) then heartbeat recency. * Tests: * Validate heartbeats are sent and processed. * Validate stale connections are marked unhealthy. * Validate routing plugin picks the expected instance in simple scenarios. Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add **cancellation** and then **streaming + payload limits** on top of the same structures.