Files
git.stella-ops.org/docs/router/06-Step.md
2025-12-02 18:38:32 +02:00

542 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

For this step, youre layering **liveness** and **basic routing intelligence** on top of the minimal handshake/dispatch you already designed.
Target outcome:
* Microservices send **heartbeats** over the existing connection.
* The router tracks **LastHeartbeatUtc**, **health status**, and **AveragePingMs** per connection.
* The routers `IRoutingPlugin` uses **region + health + latency** to pick an instance.
No need to handle cancellation or streaming yet; just make routing decisions *not* naive.
---
## 0. Preconditions
Before starting, confirm:
* `StellaOps.Router.Common` already has:
* `InstanceHealthStatus` enum.
* `ConnectionState` with at least `Instance`, `Status`, `LastHeartbeatUtc`, `AveragePingMs`, `TransportType`.
* Minimal handshake is working:
* Microservice sends HELLO (instance + endpoints).
* Router creates `ConnectionState` & populates global routing view.
* Router can send REQUEST and receive RESPONSE via InMemory transport.
If any of that is incomplete, shore it up first.
---
## 1. Extend Common with heartbeat payloads
**Project:** `StellaOps.Router.Common`
**Owner:** Common dev
Add DTOs for heartbeat frames.
### 1.1 Heartbeat payload
```csharp
public sealed class HeartbeatPayload
{
public string InstanceId { get; init; } = string.Empty;
public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;
// Optional basic metrics
public int InFlightRequests { get; init; }
public double ErrorRate { get; init; } // 01 range, optional
}
```
* This is application-level health; `Status` lets the microservice say “Degraded” / “Draining”.
* In-flight + error rate can be used later for smarter routing; initially, you can ignore them.
### 1.2 Wire into frame model
Ensure:
* `FrameType` includes `Heartbeat`:
```csharp
public enum FrameType : byte
{
Hello = 1,
Heartbeat = 2,
EndpointsUpdate = 3,
Request = 4,
RequestStreamData = 5,
Response = 6,
ResponseStreamData = 7,
Cancel = 8
}
```
* No behavior in Common; only DTOs and enums.
---
## 2. Microservice SDK: send heartbeats on the same connection
**Project:** `StellaOps.Microservice`
**Owner:** SDK dev
You already have `MicroserviceConnectionHostedService` doing HELLO and request dispatch. Now add heartbeat sending.
### 2.1 Introduce heartbeat options
Extend `StellaMicroserviceOptions` with simple settings:
```csharp
public sealed class StellaMicroserviceOptions
{
// existing fields...
public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
}
```
### 2.2 Internal heartbeat sender
Create an internal interface and implementation:
```csharp
internal interface IHeartbeatSource
{
InstanceHealthStatus GetCurrentStatus();
int GetInFlightRequests();
double GetErrorRate();
}
```
For now you can implement a trivial `DefaultHeartbeatSource`:
* `GetCurrentStatus()` → `Healthy`.
* `GetInFlightRequests()` → 0.
* `GetErrorRate()` → 0.
Wire this in DI:
```csharp
services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();
```
### 2.3 Add heartbeat loop to MicroserviceConnectionHostedService
In `StartAsync` of `MicroserviceConnectionHostedService`:
* After sending HELLO and subscribing to requests, start a background heartbeat loop.
Pseudo-plan:
```csharp
private Task? _heartbeatLoop;
public async Task StartAsync(CancellationToken ct)
{
// existing HELLO logic...
await _connection.SendHelloAsync(payload, ct);
_connection.OnRequest(frame => HandleRequestAsync(frame, ct));
_heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
}
private async Task HeartbeatLoopAsync(CancellationToken outerCt)
{
var opt = _options.Value;
var interval = opt.HeartbeatInterval;
var instanceId = opt.InstanceId;
while (!outerCt.IsCancellationRequested)
{
var payload = new HeartbeatPayload
{
InstanceId = instanceId,
Status = _heartbeatSource.GetCurrentStatus(),
InFlightRequests = _heartbeatSource.GetInFlightRequests(),
ErrorRate = _heartbeatSource.GetErrorRate()
};
var frame = new Frame
{
Type = FrameType.Heartbeat,
CorrelationId = Guid.Empty, // or a reserved value
Payload = SerializeHeartbeatPayload(payload)
};
await _connection.SendHeartbeatAsync(frame, outerCt);
try
{
await Task.Delay(interval, outerCt);
}
catch (TaskCanceledException)
{
break;
}
}
}
```
Youll need to extend `IMicroserviceConnection` with:
```csharp
Task SendHeartbeatAsync(Frame frame, CancellationToken ct);
```
In this step, manipulation is simple: every N seconds, push a heartbeat.
---
## 3. Router: accept heartbeats and update connection health
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
You already have an InMemory router or similar structure that:
* Handles HELLO frames, creates `ConnectionState`.
* Maintains a global `IGlobalRoutingState`.
Now you need to:
* Handle HEARTBEAT frames.
* Update `ConnectionState.Status` and `LastHeartbeatUtc`.
### 3.1 Frame dispatch on router side
In your routers InMemory server loop (or equivalent), add case for `FrameType.Heartbeat`:
* Deserialize `HeartbeatPayload` from `frame.Payload`.
* Find the corresponding `ConnectionState` by `InstanceId` (and/or connection ID).
* Update:
* `LastHeartbeatUtc` = `DateTime.UtcNow`.
* `Status` = `payload.Status`.
You can add a method in your routing-state implementation:
```csharp
public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
{
if (!_connections.TryGetValue(connectionId, out var conn))
return;
conn.LastHeartbeatUtc = DateTime.UtcNow;
conn.Status = payload.Status;
}
```
The routers transport server should know which `connectionId` delivered the frame; pass that along.
### 3.2 Detect stale connections (health degradation)
Add a background “health monitor” in the gateway:
* Reads `HeartbeatTimeout` from configuration (can reuse the same default as microservice or have separate router-side config).
* Periodically scans all `ConnectionState` entries:
* If `Now - LastHeartbeatUtc > HeartbeatTimeout`, mark `Status = Unhealthy` (or remove connection entirely).
* If connection drops (transport disconnect), also mark `Unhealthy` or remove.
This can be a simple `IHostedService`:
```csharp
internal sealed class ConnectionHealthMonitor : IHostedService
{
private readonly IGlobalRoutingState _state;
private readonly TimeSpan _heartbeatTimeout;
private Task? _loop;
private CancellationTokenSource? _cts;
public Task StartAsync(CancellationToken cancellationToken)
{
_cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
_loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
return Task.CompletedTask;
}
public async Task StopAsync(CancellationToken cancellationToken)
{
_cts?.Cancel();
if (_loop is not null)
await _loop;
}
private async Task MonitorLoopAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
_state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}
}
}
```
Youll add a method like `MarkStaleConnectionsUnhealthy` on your `IGlobalRoutingState` implementation.
---
## 4. Track basic latency (AveragePingMs)
**Project:** Gateway + Common
**Owner:** Gateway dev
You want `AveragePingMs` per connection to inform routing decisions.
### 4.1 Decide where to measure
Simplest: measure “request → response” round-trip time in the gateway:
* When you send a `Request` frame to a specific connection, record:
* `SentAtUtc[CorrelationId] = DateTime.UtcNow`.
* When you receive a `Response` frame with that correlation:
* Compute `latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds`.
* Discard map entry.
Then update `ConnectionState.AveragePingMs`, e.g. with an exponential moving average:
```csharp
conn.AveragePingMs = conn.AveragePingMs <= 0
? latencyMs
: conn.AveragePingMs * 0.8 + latencyMs * 0.2;
```
### 4.2 Where to hook this
* In the **gateway-side transport client** (InMemory implementation for now):
* When sending `Request` frame:
* Register `SentAtUtc` per correlation ID.
* When receiving `Response` frame:
* Compute latency.
* Call `IGlobalRoutingState.UpdateLatency(connectionId, latencyMs)`.
Add a method to the routing state:
```csharp
public void UpdateLatency(string connectionId, double latencyMs)
{
if (_connections.TryGetValue(connectionId, out var conn))
{
if (conn.AveragePingMs <= 0)
conn.AveragePingMs = latencyMs;
else
conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
}
}
```
You can keep it simple; sophistication can come later.
---
## 5. Basic routing plugin implementation
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
You already have `IRoutingPlugin` defined. Now implement a concrete `BasicRoutingPlugin` that respects:
* Region (gateway region first, then neighbor tiers).
* Health (`Healthy` / `Degraded` only).
* Latency preference (`AveragePingMs`).
### 5.1 Inputs & data
`RoutingContext` should carry:
* `EndpointDescriptor` (with ServiceName, Version, Method, Path).
* `GatewayRegion` (from `GatewayNodeConfig.Region`).
* The `HttpContext` if you need headers (not needed for routing at this stage).
`IGlobalRoutingState` should provide:
* `GetConnectionsFor(serviceName, version, method, path)` returning all `ConnectionState`s that support that endpoint.
### 5.2 Basic algorithm
Algorithm outline:
```csharp
public sealed class BasicRoutingPlugin : IRoutingPlugin
{
private readonly IGlobalRoutingState _state;
private readonly string[] _neighborRegions; // configured, can be empty
public async Task<RoutingDecision?> ChooseInstanceAsync(
RoutingContext context,
CancellationToken cancellationToken)
{
var endpoint = context.Endpoint;
var candidates = _state.GetConnectionsFor(
endpoint.ServiceName,
endpoint.Version,
endpoint.Method,
endpoint.Path);
if (candidates.Count == 0)
return null;
// 1. Filter by health (only Healthy or Degraded)
var healthy = candidates
.Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
.ToList();
if (healthy.Count == 0)
return null;
// 2. Partition by region tier
var gatewayRegion = context.GatewayRegion;
List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();
var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
if (chosenTier.Count == 0)
return null;
// 3. Sort by latency, then heartbeat freshness
var ordered = chosenTier
.OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.ToList();
var winner = ordered[0];
// 4. Build decision
return new RoutingDecision
{
Endpoint = endpoint,
Connection = winner,
TransportType = winner.TransportType,
EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later
};
}
}
```
Wire it into DI:
```csharp
services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();
```
And ensure `RoutingDecisionMiddleware` calls it.
---
## 6. Integrate health-aware routing into the HTTP pipeline
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
Update your `RoutingDecisionMiddleware` to:
* Use the final `IRoutingPlugin` instead of picking a random connection.
* Handle null decision appropriately:
* If `ChooseInstanceAsync` returns `null`, respond with `503 Service Unavailable` or `502 Bad Gateway` and a generic error body, log the incident.
Check that:
* Gateways region is injected (via `GatewayNodeConfig.Region`) into `RoutingContext.GatewayRegion`.
* Endpoint descriptor is resolved before you call the plugin.
---
## 7. Testing plan
**Project:** `StellaOps.Gateway.WebService.Tests`, `StellaOps.Microservice.Tests`
**Owner:** test agent
Write basic tests to lock in behavior.
### 7.1 Microservice heartbeat tests
In `StellaOps.Microservice.Tests`:
* Use a fake `IMicroserviceConnection` that records frames sent.
* Configure `HeartbeatInterval` to a small number (e.g. 100 ms).
* Start a Host with `AddStellaMicroservice`.
* Wait some time, assert:
* At least one HELLO frame was sent.
* At least N HEARTBEAT frames were sent.
* HEARTBEAT payload has correct `InstanceId` and `Status`.
### 7.2 Router health update tests
In `StellaOps.Gateway.WebService.Tests` (or a separate routing-state test project):
* Create an instance of your `IGlobalRoutingState` implementation.
* Add a connection via HELLO simulation.
* Call `UpdateHeartbeat` with a HeartbeatPayload.
* Assert:
* `LastHeartbeatUtc` updated.
* `Status` set to `Healthy` (or whatever payload said).
* Advance time (simulate via injecting a clock or mocking DateTime) and call `MarkStaleConnectionsUnhealthy`:
* Assert that `Status` changed to `Unhealthy`.
### 7.3 Routing plugin tests
Write tests for `BasicRoutingPlugin`:
* Case 1: multiple connections, some unhealthy:
* Only Healthy/Degraded are considered.
* Case 2: multiple regions:
* Instances in gateway region win over others.
* Case 3: same region, different `AveragePingMs`:
* Lower latency chosen.
* Case 4: same latency, different `LastHeartbeatUtc`:
* More recent heartbeat chosen.
These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).
---
## 8. Done criteria for “Add heartbeat, health, basic routing rules”
You can declare this step complete when:
* Microservices:
* Periodically send HEARTBEAT frames on the same connection they use for requests.
* Gateway/router:
* Updates `LastHeartbeatUtc` and `Status` on receipt of HEARTBEAT.
* Marks stale or disconnected connections as `Unhealthy` (or removes them).
* Tracks `AveragePingMs` per connection based on request/response round trips.
* Routing:
* `IRoutingPlugin` chooses instances based on:
* Strict `ServiceName` + `Version` + endpoint match.
* Health (`Healthy`/`Degraded` only).
* Region preference (gateway region > neighbors > others).
* Latency (`AveragePingMs`) then heartbeat recency.
* Tests:
* Validate heartbeats are sent and processed.
* Validate stale connections are marked unhealthy.
* Validate routing plugin picks the expected instance in simple scenarios.
Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add **cancellation** and then **streaming + payload limits** on top of the same structures.