15 KiB
For this step, you’re layering liveness and basic routing intelligence on top of the minimal handshake/dispatch you already designed.
Target outcome:
- Microservices send heartbeats over the existing connection.
- The router tracks LastHeartbeatUtc, health status, and AveragePingMs per connection.
- The router’s
IRoutingPluginuses region + health + latency to pick an instance.
No need to handle cancellation or streaming yet; just make routing decisions not naive.
0. Preconditions
Before starting, confirm:
-
StellaOps.Router.Commonalready has:InstanceHealthStatusenum.ConnectionStatewith at leastInstance,Status,LastHeartbeatUtc,AveragePingMs,TransportType.
-
Minimal handshake is working:
- Microservice sends HELLO (instance + endpoints).
- Router creates
ConnectionState& populates global routing view. - Router can send REQUEST and receive RESPONSE via InMemory transport.
If any of that is incomplete, shore it up first.
1. Extend Common with heartbeat payloads
Project: StellaOps.Router.Common
Owner: Common dev
Add DTOs for heartbeat frames.
1.1 Heartbeat payload
public sealed class HeartbeatPayload
{
public string InstanceId { get; init; } = string.Empty;
public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;
// Optional basic metrics
public int InFlightRequests { get; init; }
public double ErrorRate { get; init; } // 0–1 range, optional
}
- This is application-level health;
Statuslets the microservice say “Degraded” / “Draining”. - In-flight + error rate can be used later for smarter routing; initially, you can ignore them.
1.2 Wire into frame model
Ensure:
-
FrameTypeincludesHeartbeat:public enum FrameType : byte { Hello = 1, Heartbeat = 2, EndpointsUpdate = 3, Request = 4, RequestStreamData = 5, Response = 6, ResponseStreamData = 7, Cancel = 8 } -
No behavior in Common; only DTOs and enums.
2. Microservice SDK: send heartbeats on the same connection
Project: StellaOps.Microservice
Owner: SDK dev
You already have MicroserviceConnectionHostedService doing HELLO and request dispatch. Now add heartbeat sending.
2.1 Introduce heartbeat options
Extend StellaMicroserviceOptions with simple settings:
public sealed class StellaMicroserviceOptions
{
// existing fields...
public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
}
2.2 Internal heartbeat sender
Create an internal interface and implementation:
internal interface IHeartbeatSource
{
InstanceHealthStatus GetCurrentStatus();
int GetInFlightRequests();
double GetErrorRate();
}
For now you can implement a trivial DefaultHeartbeatSource:
GetCurrentStatus()→Healthy.GetInFlightRequests()→ 0.GetErrorRate()→ 0.
Wire this in DI:
services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();
2.3 Add heartbeat loop to MicroserviceConnectionHostedService
In StartAsync of MicroserviceConnectionHostedService:
- After sending HELLO and subscribing to requests, start a background heartbeat loop.
Pseudo-plan:
private Task? _heartbeatLoop;
public async Task StartAsync(CancellationToken ct)
{
// existing HELLO logic...
await _connection.SendHelloAsync(payload, ct);
_connection.OnRequest(frame => HandleRequestAsync(frame, ct));
_heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
}
private async Task HeartbeatLoopAsync(CancellationToken outerCt)
{
var opt = _options.Value;
var interval = opt.HeartbeatInterval;
var instanceId = opt.InstanceId;
while (!outerCt.IsCancellationRequested)
{
var payload = new HeartbeatPayload
{
InstanceId = instanceId,
Status = _heartbeatSource.GetCurrentStatus(),
InFlightRequests = _heartbeatSource.GetInFlightRequests(),
ErrorRate = _heartbeatSource.GetErrorRate()
};
var frame = new Frame
{
Type = FrameType.Heartbeat,
CorrelationId = Guid.Empty, // or a reserved value
Payload = SerializeHeartbeatPayload(payload)
};
await _connection.SendHeartbeatAsync(frame, outerCt);
try
{
await Task.Delay(interval, outerCt);
}
catch (TaskCanceledException)
{
break;
}
}
}
You’ll need to extend IMicroserviceConnection with:
Task SendHeartbeatAsync(Frame frame, CancellationToken ct);
In this step, manipulation is simple: every N seconds, push a heartbeat.
3. Router: accept heartbeats and update connection health
Project: StellaOps.Gateway.WebService
Owner: Gateway dev
You already have an InMemory router or similar structure that:
- Handles HELLO frames, creates
ConnectionState. - Maintains a global
IGlobalRoutingState.
Now you need to:
- Handle HEARTBEAT frames.
- Update
ConnectionState.StatusandLastHeartbeatUtc.
3.1 Frame dispatch on router side
In your router’s InMemory server loop (or equivalent), add case for FrameType.Heartbeat:
-
Deserialize
HeartbeatPayloadfromframe.Payload. -
Find the corresponding
ConnectionStatebyInstanceId(and/or connection ID). -
Update:
LastHeartbeatUtc=DateTime.UtcNow.Status=payload.Status.
You can add a method in your routing-state implementation:
public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
{
if (!_connections.TryGetValue(connectionId, out var conn))
return;
conn.LastHeartbeatUtc = DateTime.UtcNow;
conn.Status = payload.Status;
}
The router’s transport server should know which connectionId delivered the frame; pass that along.
3.2 Detect stale connections (health degradation)
Add a background “health monitor” in the gateway:
-
Reads
HeartbeatTimeoutfrom configuration (can reuse the same default as microservice or have separate router-side config). -
Periodically scans all
ConnectionStateentries:- If
Now - LastHeartbeatUtc > HeartbeatTimeout, markStatus = Unhealthy(or remove connection entirely). - If connection drops (transport disconnect), also mark
Unhealthyor remove.
- If
This can be a simple IHostedService:
internal sealed class ConnectionHealthMonitor : IHostedService
{
private readonly IGlobalRoutingState _state;
private readonly TimeSpan _heartbeatTimeout;
private Task? _loop;
private CancellationTokenSource? _cts;
public Task StartAsync(CancellationToken cancellationToken)
{
_cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
_loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
return Task.CompletedTask;
}
public async Task StopAsync(CancellationToken cancellationToken)
{
_cts?.Cancel();
if (_loop is not null)
await _loop;
}
private async Task MonitorLoopAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
_state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}
}
}
You’ll add a method like MarkStaleConnectionsUnhealthy on your IGlobalRoutingState implementation.
4. Track basic latency (AveragePingMs)
Project: Gateway + Common Owner: Gateway dev
You want AveragePingMs per connection to inform routing decisions.
4.1 Decide where to measure
Simplest: measure “request → response” round-trip time in the gateway:
-
When you send a
Requestframe to a specific connection, record:SentAtUtc[CorrelationId] = DateTime.UtcNow.
-
When you receive a
Responseframe with that correlation:- Compute
latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds. - Discard map entry.
- Compute
Then update ConnectionState.AveragePingMs, e.g. with an exponential moving average:
conn.AveragePingMs = conn.AveragePingMs <= 0
? latencyMs
: conn.AveragePingMs * 0.8 + latencyMs * 0.2;
4.2 Where to hook this
-
In the gateway-side transport client (InMemory implementation for now):
-
When sending
Requestframe:- Register
SentAtUtcper correlation ID.
- Register
-
When receiving
Responseframe:- Compute latency.
- Call
IGlobalRoutingState.UpdateLatency(connectionId, latencyMs).
-
Add a method to the routing state:
public void UpdateLatency(string connectionId, double latencyMs)
{
if (_connections.TryGetValue(connectionId, out var conn))
{
if (conn.AveragePingMs <= 0)
conn.AveragePingMs = latencyMs;
else
conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
}
}
You can keep it simple; sophistication can come later.
5. Basic routing plugin implementation
Project: StellaOps.Gateway.WebService
Owner: Gateway dev
You already have IRoutingPlugin defined. Now implement a concrete BasicRoutingPlugin that respects:
- Region (gateway region first, then neighbor tiers).
- Health (
Healthy/Degradedonly). - Latency preference (
AveragePingMs).
5.1 Inputs & data
RoutingContext should carry:
EndpointDescriptor(with ServiceName, Version, Method, Path).GatewayRegion(fromGatewayNodeConfig.Region).- The
HttpContextif you need headers (not needed for routing at this stage).
IGlobalRoutingState should provide:
GetConnectionsFor(serviceName, version, method, path)returning allConnectionStates that support that endpoint.
5.2 Basic algorithm
Algorithm outline:
public sealed class BasicRoutingPlugin : IRoutingPlugin
{
private readonly IGlobalRoutingState _state;
private readonly string[] _neighborRegions; // configured, can be empty
public async Task<RoutingDecision?> ChooseInstanceAsync(
RoutingContext context,
CancellationToken cancellationToken)
{
var endpoint = context.Endpoint;
var candidates = _state.GetConnectionsFor(
endpoint.ServiceName,
endpoint.Version,
endpoint.Method,
endpoint.Path);
if (candidates.Count == 0)
return null;
// 1. Filter by health (only Healthy or Degraded)
var healthy = candidates
.Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
.ToList();
if (healthy.Count == 0)
return null;
// 2. Partition by region tier
var gatewayRegion = context.GatewayRegion;
List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();
var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
if (chosenTier.Count == 0)
return null;
// 3. Sort by latency, then heartbeat freshness
var ordered = chosenTier
.OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.ToList();
var winner = ordered[0];
// 4. Build decision
return new RoutingDecision
{
Endpoint = endpoint,
Connection = winner,
TransportType = winner.TransportType,
EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later
};
}
}
Wire it into DI:
services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();
And ensure RoutingDecisionMiddleware calls it.
6. Integrate health-aware routing into the HTTP pipeline
Project: StellaOps.Gateway.WebService
Owner: Gateway dev
Update your RoutingDecisionMiddleware to:
-
Use the final
IRoutingPlugininstead of picking a random connection. -
Handle null decision appropriately:
- If
ChooseInstanceAsyncreturnsnull, respond with503 Service Unavailableor502 Bad Gatewayand a generic error body, log the incident.
- If
Check that:
- Gateway’s region is injected (via
GatewayNodeConfig.Region) intoRoutingContext.GatewayRegion. - Endpoint descriptor is resolved before you call the plugin.
7. Testing plan
Project: StellaOps.Gateway.WebService.Tests, StellaOps.Microservice.Tests
Owner: test agent
Write basic tests to lock in behavior.
7.1 Microservice heartbeat tests
In StellaOps.Microservice.Tests:
-
Use a fake
IMicroserviceConnectionthat records frames sent. -
Configure
HeartbeatIntervalto a small number (e.g. 100 ms). -
Start a Host with
AddStellaMicroservice. -
Wait some time, assert:
- At least one HELLO frame was sent.
- At least N HEARTBEAT frames were sent.
- HEARTBEAT payload has correct
InstanceIdandStatus.
7.2 Router health update tests
In StellaOps.Gateway.WebService.Tests (or a separate routing-state test project):
-
Create an instance of your
IGlobalRoutingStateimplementation. -
Add a connection via HELLO simulation.
-
Call
UpdateHeartbeatwith a HeartbeatPayload. -
Assert:
LastHeartbeatUtcupdated.Statusset toHealthy(or whatever payload said).
-
Advance time (simulate via injecting a clock or mocking DateTime) and call
MarkStaleConnectionsUnhealthy:- Assert that
Statuschanged toUnhealthy.
- Assert that
7.3 Routing plugin tests
Write tests for BasicRoutingPlugin:
-
Case 1: multiple connections, some unhealthy:
- Only Healthy/Degraded are considered.
-
Case 2: multiple regions:
- Instances in gateway region win over others.
-
Case 3: same region, different
AveragePingMs:- Lower latency chosen.
-
Case 4: same latency, different
LastHeartbeatUtc:- More recent heartbeat chosen.
These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).
8. Done criteria for “Add heartbeat, health, basic routing rules”
You can declare this step complete when:
-
Microservices:
- Periodically send HEARTBEAT frames on the same connection they use for requests.
-
Gateway/router:
- Updates
LastHeartbeatUtcandStatuson receipt of HEARTBEAT. - Marks stale or disconnected connections as
Unhealthy(or removes them). - Tracks
AveragePingMsper connection based on request/response round trips.
- Updates
-
Routing:
-
IRoutingPluginchooses instances based on:- Strict
ServiceName+Version+ endpoint match. - Health (
Healthy/Degradedonly). - Region preference (gateway region > neighbors > others).
- Latency (
AveragePingMs) then heartbeat recency.
- Strict
-
-
Tests:
- Validate heartbeats are sent and processed.
- Validate stale connections are marked unhealthy.
- Validate routing plugin picks the expected instance in simple scenarios.
Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add cancellation and then streaming + payload limits on top of the same structures.