542 lines
15 KiB
Markdown
542 lines
15 KiB
Markdown
For this step, you’re layering **liveness** and **basic routing intelligence** on top of the minimal handshake/dispatch you already designed.
|
||
|
||
Target outcome:
|
||
|
||
* Microservices send **heartbeats** over the existing connection.
|
||
* The router tracks **LastHeartbeatUtc**, **health status**, and **AveragePingMs** per connection.
|
||
* The router’s `IRoutingPlugin` uses **region + health + latency** to pick an instance.
|
||
|
||
No need to handle cancellation or streaming yet; just make routing decisions *not* naive.
|
||
|
||
---
|
||
|
||
## 0. Preconditions
|
||
|
||
Before starting, confirm:
|
||
|
||
* `StellaOps.Router.Common` already has:
|
||
|
||
* `InstanceHealthStatus` enum.
|
||
* `ConnectionState` with at least `Instance`, `Status`, `LastHeartbeatUtc`, `AveragePingMs`, `TransportType`.
|
||
* Minimal handshake is working:
|
||
|
||
* Microservice sends HELLO (instance + endpoints).
|
||
* Router creates `ConnectionState` & populates global routing view.
|
||
* Router can send REQUEST and receive RESPONSE via InMemory transport.
|
||
|
||
If any of that is incomplete, shore it up first.
|
||
|
||
---
|
||
|
||
## 1. Extend Common with heartbeat payloads
|
||
|
||
**Project:** `StellaOps.Router.Common`
|
||
**Owner:** Common dev
|
||
|
||
Add DTOs for heartbeat frames.
|
||
|
||
### 1.1 Heartbeat payload
|
||
|
||
```csharp
|
||
public sealed class HeartbeatPayload
|
||
{
|
||
public string InstanceId { get; init; } = string.Empty;
|
||
public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;
|
||
|
||
// Optional basic metrics
|
||
public int InFlightRequests { get; init; }
|
||
public double ErrorRate { get; init; } // 0–1 range, optional
|
||
}
|
||
```
|
||
|
||
* This is application-level health; `Status` lets the microservice say “Degraded” / “Draining”.
|
||
* In-flight + error rate can be used later for smarter routing; initially, you can ignore them.
|
||
|
||
### 1.2 Wire into frame model
|
||
|
||
Ensure:
|
||
|
||
* `FrameType` includes `Heartbeat`:
|
||
|
||
```csharp
|
||
public enum FrameType : byte
|
||
{
|
||
Hello = 1,
|
||
Heartbeat = 2,
|
||
EndpointsUpdate = 3,
|
||
Request = 4,
|
||
RequestStreamData = 5,
|
||
Response = 6,
|
||
ResponseStreamData = 7,
|
||
Cancel = 8
|
||
}
|
||
```
|
||
|
||
* No behavior in Common; only DTOs and enums.
|
||
|
||
---
|
||
|
||
## 2. Microservice SDK: send heartbeats on the same connection
|
||
|
||
**Project:** `StellaOps.Microservice`
|
||
**Owner:** SDK dev
|
||
|
||
You already have `MicroserviceConnectionHostedService` doing HELLO and request dispatch. Now add heartbeat sending.
|
||
|
||
### 2.1 Introduce heartbeat options
|
||
|
||
Extend `StellaMicroserviceOptions` with simple settings:
|
||
|
||
```csharp
|
||
public sealed class StellaMicroserviceOptions
|
||
{
|
||
// existing fields...
|
||
public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
|
||
public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
|
||
}
|
||
```
|
||
|
||
### 2.2 Internal heartbeat sender
|
||
|
||
Create an internal interface and implementation:
|
||
|
||
```csharp
|
||
internal interface IHeartbeatSource
|
||
{
|
||
InstanceHealthStatus GetCurrentStatus();
|
||
int GetInFlightRequests();
|
||
double GetErrorRate();
|
||
}
|
||
```
|
||
|
||
For now you can implement a trivial `DefaultHeartbeatSource`:
|
||
|
||
* `GetCurrentStatus()` → `Healthy`.
|
||
* `GetInFlightRequests()` → 0.
|
||
* `GetErrorRate()` → 0.
|
||
|
||
Wire this in DI:
|
||
|
||
```csharp
|
||
services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();
|
||
```
|
||
|
||
### 2.3 Add heartbeat loop to MicroserviceConnectionHostedService
|
||
|
||
In `StartAsync` of `MicroserviceConnectionHostedService`:
|
||
|
||
* After sending HELLO and subscribing to requests, start a background heartbeat loop.
|
||
|
||
Pseudo-plan:
|
||
|
||
```csharp
|
||
private Task? _heartbeatLoop;
|
||
|
||
public async Task StartAsync(CancellationToken ct)
|
||
{
|
||
// existing HELLO logic...
|
||
await _connection.SendHelloAsync(payload, ct);
|
||
|
||
_connection.OnRequest(frame => HandleRequestAsync(frame, ct));
|
||
|
||
_heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
|
||
}
|
||
|
||
private async Task HeartbeatLoopAsync(CancellationToken outerCt)
|
||
{
|
||
var opt = _options.Value;
|
||
var interval = opt.HeartbeatInterval;
|
||
var instanceId = opt.InstanceId;
|
||
|
||
while (!outerCt.IsCancellationRequested)
|
||
{
|
||
var payload = new HeartbeatPayload
|
||
{
|
||
InstanceId = instanceId,
|
||
Status = _heartbeatSource.GetCurrentStatus(),
|
||
InFlightRequests = _heartbeatSource.GetInFlightRequests(),
|
||
ErrorRate = _heartbeatSource.GetErrorRate()
|
||
};
|
||
|
||
var frame = new Frame
|
||
{
|
||
Type = FrameType.Heartbeat,
|
||
CorrelationId = Guid.Empty, // or a reserved value
|
||
Payload = SerializeHeartbeatPayload(payload)
|
||
};
|
||
|
||
await _connection.SendHeartbeatAsync(frame, outerCt);
|
||
|
||
try
|
||
{
|
||
await Task.Delay(interval, outerCt);
|
||
}
|
||
catch (TaskCanceledException)
|
||
{
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
You’ll need to extend `IMicroserviceConnection` with:
|
||
|
||
```csharp
|
||
Task SendHeartbeatAsync(Frame frame, CancellationToken ct);
|
||
```
|
||
|
||
In this step, manipulation is simple: every N seconds, push a heartbeat.
|
||
|
||
---
|
||
|
||
## 3. Router: accept heartbeats and update connection health
|
||
|
||
**Project:** `StellaOps.Gateway.WebService`
|
||
**Owner:** Gateway dev
|
||
|
||
You already have an InMemory router or similar structure that:
|
||
|
||
* Handles HELLO frames, creates `ConnectionState`.
|
||
* Maintains a global `IGlobalRoutingState`.
|
||
|
||
Now you need to:
|
||
|
||
* Handle HEARTBEAT frames.
|
||
* Update `ConnectionState.Status` and `LastHeartbeatUtc`.
|
||
|
||
### 3.1 Frame dispatch on router side
|
||
|
||
In your router’s InMemory server loop (or equivalent), add case for `FrameType.Heartbeat`:
|
||
|
||
* Deserialize `HeartbeatPayload` from `frame.Payload`.
|
||
* Find the corresponding `ConnectionState` by `InstanceId` (and/or connection ID).
|
||
* Update:
|
||
|
||
* `LastHeartbeatUtc` = `DateTime.UtcNow`.
|
||
* `Status` = `payload.Status`.
|
||
|
||
You can add a method in your routing-state implementation:
|
||
|
||
```csharp
|
||
public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
|
||
{
|
||
if (!_connections.TryGetValue(connectionId, out var conn))
|
||
return;
|
||
|
||
conn.LastHeartbeatUtc = DateTime.UtcNow;
|
||
conn.Status = payload.Status;
|
||
}
|
||
```
|
||
|
||
The router’s transport server should know which `connectionId` delivered the frame; pass that along.
|
||
|
||
### 3.2 Detect stale connections (health degradation)
|
||
|
||
Add a background “health monitor” in the gateway:
|
||
|
||
* Reads `HeartbeatTimeout` from configuration (can reuse the same default as microservice or have separate router-side config).
|
||
* Periodically scans all `ConnectionState` entries:
|
||
|
||
* If `Now - LastHeartbeatUtc > HeartbeatTimeout`, mark `Status = Unhealthy` (or remove connection entirely).
|
||
* If connection drops (transport disconnect), also mark `Unhealthy` or remove.
|
||
|
||
This can be a simple `IHostedService`:
|
||
|
||
```csharp
|
||
internal sealed class ConnectionHealthMonitor : IHostedService
|
||
{
|
||
private readonly IGlobalRoutingState _state;
|
||
private readonly TimeSpan _heartbeatTimeout;
|
||
private Task? _loop;
|
||
private CancellationTokenSource? _cts;
|
||
|
||
public Task StartAsync(CancellationToken cancellationToken)
|
||
{
|
||
_cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
|
||
_loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
|
||
return Task.CompletedTask;
|
||
}
|
||
|
||
public async Task StopAsync(CancellationToken cancellationToken)
|
||
{
|
||
_cts?.Cancel();
|
||
if (_loop is not null)
|
||
await _loop;
|
||
}
|
||
|
||
private async Task MonitorLoopAsync(CancellationToken ct)
|
||
{
|
||
while (!ct.IsCancellationRequested)
|
||
{
|
||
_state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
|
||
await Task.Delay(TimeSpan.FromSeconds(5), ct);
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
You’ll add a method like `MarkStaleConnectionsUnhealthy` on your `IGlobalRoutingState` implementation.
|
||
|
||
---
|
||
|
||
## 4. Track basic latency (AveragePingMs)
|
||
|
||
**Project:** Gateway + Common
|
||
**Owner:** Gateway dev
|
||
|
||
You want `AveragePingMs` per connection to inform routing decisions.
|
||
|
||
### 4.1 Decide where to measure
|
||
|
||
Simplest: measure “request → response” round-trip time in the gateway:
|
||
|
||
* When you send a `Request` frame to a specific connection, record:
|
||
|
||
* `SentAtUtc[CorrelationId] = DateTime.UtcNow`.
|
||
* When you receive a `Response` frame with that correlation:
|
||
|
||
* Compute `latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds`.
|
||
* Discard map entry.
|
||
|
||
Then update `ConnectionState.AveragePingMs`, e.g. with an exponential moving average:
|
||
|
||
```csharp
|
||
conn.AveragePingMs = conn.AveragePingMs <= 0
|
||
? latencyMs
|
||
: conn.AveragePingMs * 0.8 + latencyMs * 0.2;
|
||
```
|
||
|
||
### 4.2 Where to hook this
|
||
|
||
* In the **gateway-side transport client** (InMemory implementation for now):
|
||
|
||
* When sending `Request` frame:
|
||
|
||
* Register `SentAtUtc` per correlation ID.
|
||
* When receiving `Response` frame:
|
||
|
||
* Compute latency.
|
||
* Call `IGlobalRoutingState.UpdateLatency(connectionId, latencyMs)`.
|
||
|
||
Add a method to the routing state:
|
||
|
||
```csharp
|
||
public void UpdateLatency(string connectionId, double latencyMs)
|
||
{
|
||
if (_connections.TryGetValue(connectionId, out var conn))
|
||
{
|
||
if (conn.AveragePingMs <= 0)
|
||
conn.AveragePingMs = latencyMs;
|
||
else
|
||
conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
|
||
}
|
||
}
|
||
```
|
||
|
||
You can keep it simple; sophistication can come later.
|
||
|
||
---
|
||
|
||
## 5. Basic routing plugin implementation
|
||
|
||
**Project:** `StellaOps.Gateway.WebService`
|
||
**Owner:** Gateway dev
|
||
|
||
You already have `IRoutingPlugin` defined. Now implement a concrete `BasicRoutingPlugin` that respects:
|
||
|
||
* Region (gateway region first, then neighbor tiers).
|
||
* Health (`Healthy` / `Degraded` only).
|
||
* Latency preference (`AveragePingMs`).
|
||
|
||
### 5.1 Inputs & data
|
||
|
||
`RoutingContext` should carry:
|
||
|
||
* `EndpointDescriptor` (with ServiceName, Version, Method, Path).
|
||
* `GatewayRegion` (from `GatewayNodeConfig.Region`).
|
||
* The `HttpContext` if you need headers (not needed for routing at this stage).
|
||
|
||
`IGlobalRoutingState` should provide:
|
||
|
||
* `GetConnectionsFor(serviceName, version, method, path)` returning all `ConnectionState`s that support that endpoint.
|
||
|
||
### 5.2 Basic algorithm
|
||
|
||
Algorithm outline:
|
||
|
||
```csharp
|
||
public sealed class BasicRoutingPlugin : IRoutingPlugin
|
||
{
|
||
private readonly IGlobalRoutingState _state;
|
||
private readonly string[] _neighborRegions; // configured, can be empty
|
||
|
||
public async Task<RoutingDecision?> ChooseInstanceAsync(
|
||
RoutingContext context,
|
||
CancellationToken cancellationToken)
|
||
{
|
||
var endpoint = context.Endpoint;
|
||
var candidates = _state.GetConnectionsFor(
|
||
endpoint.ServiceName,
|
||
endpoint.Version,
|
||
endpoint.Method,
|
||
endpoint.Path);
|
||
|
||
if (candidates.Count == 0)
|
||
return null;
|
||
|
||
// 1. Filter by health (only Healthy or Degraded)
|
||
var healthy = candidates
|
||
.Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
|
||
.ToList();
|
||
|
||
if (healthy.Count == 0)
|
||
return null;
|
||
|
||
// 2. Partition by region tier
|
||
var gatewayRegion = context.GatewayRegion;
|
||
|
||
List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
|
||
List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
|
||
List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();
|
||
|
||
var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
|
||
if (chosenTier.Count == 0)
|
||
return null;
|
||
|
||
// 3. Sort by latency, then heartbeat freshness
|
||
var ordered = chosenTier
|
||
.OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
|
||
.ThenByDescending(c => c.LastHeartbeatUtc)
|
||
.ToList();
|
||
|
||
var winner = ordered[0];
|
||
|
||
// 4. Build decision
|
||
return new RoutingDecision
|
||
{
|
||
Endpoint = endpoint,
|
||
Connection = winner,
|
||
TransportType = winner.TransportType,
|
||
EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later
|
||
};
|
||
}
|
||
}
|
||
```
|
||
|
||
Wire it into DI:
|
||
|
||
```csharp
|
||
services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();
|
||
```
|
||
|
||
And ensure `RoutingDecisionMiddleware` calls it.
|
||
|
||
---
|
||
|
||
## 6. Integrate health-aware routing into the HTTP pipeline
|
||
|
||
**Project:** `StellaOps.Gateway.WebService`
|
||
**Owner:** Gateway dev
|
||
|
||
Update your `RoutingDecisionMiddleware` to:
|
||
|
||
* Use the final `IRoutingPlugin` instead of picking a random connection.
|
||
* Handle null decision appropriately:
|
||
|
||
* If `ChooseInstanceAsync` returns `null`, respond with `503 Service Unavailable` or `502 Bad Gateway` and a generic error body, log the incident.
|
||
|
||
Check that:
|
||
|
||
* Gateway’s region is injected (via `GatewayNodeConfig.Region`) into `RoutingContext.GatewayRegion`.
|
||
* Endpoint descriptor is resolved before you call the plugin.
|
||
|
||
---
|
||
|
||
## 7. Testing plan
|
||
|
||
**Project:** `StellaOps.Gateway.WebService.Tests`, `StellaOps.Microservice.Tests`
|
||
**Owner:** test agent
|
||
|
||
Write basic tests to lock in behavior.
|
||
|
||
### 7.1 Microservice heartbeat tests
|
||
|
||
In `StellaOps.Microservice.Tests`:
|
||
|
||
* Use a fake `IMicroserviceConnection` that records frames sent.
|
||
* Configure `HeartbeatInterval` to a small number (e.g. 100 ms).
|
||
* Start a Host with `AddStellaMicroservice`.
|
||
* Wait some time, assert:
|
||
|
||
* At least one HELLO frame was sent.
|
||
* At least N HEARTBEAT frames were sent.
|
||
* HEARTBEAT payload has correct `InstanceId` and `Status`.
|
||
|
||
### 7.2 Router health update tests
|
||
|
||
In `StellaOps.Gateway.WebService.Tests` (or a separate routing-state test project):
|
||
|
||
* Create an instance of your `IGlobalRoutingState` implementation.
|
||
|
||
* Add a connection via HELLO simulation.
|
||
|
||
* Call `UpdateHeartbeat` with a HeartbeatPayload.
|
||
|
||
* Assert:
|
||
|
||
* `LastHeartbeatUtc` updated.
|
||
* `Status` set to `Healthy` (or whatever payload said).
|
||
|
||
* Advance time (simulate via injecting a clock or mocking DateTime) and call `MarkStaleConnectionsUnhealthy`:
|
||
|
||
* Assert that `Status` changed to `Unhealthy`.
|
||
|
||
### 7.3 Routing plugin tests
|
||
|
||
Write tests for `BasicRoutingPlugin`:
|
||
|
||
* Case 1: multiple connections, some unhealthy:
|
||
|
||
* Only Healthy/Degraded are considered.
|
||
* Case 2: multiple regions:
|
||
|
||
* Instances in gateway region win over others.
|
||
* Case 3: same region, different `AveragePingMs`:
|
||
|
||
* Lower latency chosen.
|
||
* Case 4: same latency, different `LastHeartbeatUtc`:
|
||
|
||
* More recent heartbeat chosen.
|
||
|
||
These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).
|
||
|
||
---
|
||
|
||
## 8. Done criteria for “Add heartbeat, health, basic routing rules”
|
||
|
||
You can declare this step complete when:
|
||
|
||
* Microservices:
|
||
|
||
* Periodically send HEARTBEAT frames on the same connection they use for requests.
|
||
* Gateway/router:
|
||
|
||
* Updates `LastHeartbeatUtc` and `Status` on receipt of HEARTBEAT.
|
||
* Marks stale or disconnected connections as `Unhealthy` (or removes them).
|
||
* Tracks `AveragePingMs` per connection based on request/response round trips.
|
||
* Routing:
|
||
|
||
* `IRoutingPlugin` chooses instances based on:
|
||
|
||
* Strict `ServiceName` + `Version` + endpoint match.
|
||
* Health (`Healthy`/`Degraded` only).
|
||
* Region preference (gateway region > neighbors > others).
|
||
* Latency (`AveragePingMs`) then heartbeat recency.
|
||
* Tests:
|
||
|
||
* Validate heartbeats are sent and processed.
|
||
* Validate stale connections are marked unhealthy.
|
||
* Validate routing plugin picks the expected instance in simple scenarios.
|
||
|
||
Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add **cancellation** and then **streaming + payload limits** on top of the same structures.
|