Files
git.stella-ops.org/docs/router/06-Step.md
2025-12-02 18:38:32 +02:00

15 KiB
Raw Blame History

For this step, youre layering liveness and basic routing intelligence on top of the minimal handshake/dispatch you already designed.

Target outcome:

  • Microservices send heartbeats over the existing connection.
  • The router tracks LastHeartbeatUtc, health status, and AveragePingMs per connection.
  • The routers IRoutingPlugin uses region + health + latency to pick an instance.

No need to handle cancellation or streaming yet; just make routing decisions not naive.


0. Preconditions

Before starting, confirm:

  • StellaOps.Router.Common already has:

    • InstanceHealthStatus enum.
    • ConnectionState with at least Instance, Status, LastHeartbeatUtc, AveragePingMs, TransportType.
  • Minimal handshake is working:

    • Microservice sends HELLO (instance + endpoints).
    • Router creates ConnectionState & populates global routing view.
    • Router can send REQUEST and receive RESPONSE via InMemory transport.

If any of that is incomplete, shore it up first.


1. Extend Common with heartbeat payloads

Project: StellaOps.Router.Common Owner: Common dev

Add DTOs for heartbeat frames.

1.1 Heartbeat payload

public sealed class HeartbeatPayload
{
    public string InstanceId { get; init; } = string.Empty;
    public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;

    // Optional basic metrics
    public int InFlightRequests { get; init; }
    public double ErrorRate { get; init; }  // 01 range, optional
}
  • This is application-level health; Status lets the microservice say “Degraded” / “Draining”.
  • In-flight + error rate can be used later for smarter routing; initially, you can ignore them.

1.2 Wire into frame model

Ensure:

  • FrameType includes Heartbeat:

    public enum FrameType : byte
    {
        Hello = 1,
        Heartbeat = 2,
        EndpointsUpdate = 3,
        Request = 4,
        RequestStreamData = 5,
        Response = 6,
        ResponseStreamData = 7,
        Cancel = 8
    }
    
  • No behavior in Common; only DTOs and enums.


2. Microservice SDK: send heartbeats on the same connection

Project: StellaOps.Microservice Owner: SDK dev

You already have MicroserviceConnectionHostedService doing HELLO and request dispatch. Now add heartbeat sending.

2.1 Introduce heartbeat options

Extend StellaMicroserviceOptions with simple settings:

public sealed class StellaMicroserviceOptions
{
    // existing fields...
    public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
    public TimeSpan HeartbeatTimeout  { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
}

2.2 Internal heartbeat sender

Create an internal interface and implementation:

internal interface IHeartbeatSource
{
    InstanceHealthStatus GetCurrentStatus();
    int GetInFlightRequests();
    double GetErrorRate();
}

For now you can implement a trivial DefaultHeartbeatSource:

  • GetCurrentStatus()Healthy.
  • GetInFlightRequests() → 0.
  • GetErrorRate() → 0.

Wire this in DI:

services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();

2.3 Add heartbeat loop to MicroserviceConnectionHostedService

In StartAsync of MicroserviceConnectionHostedService:

  • After sending HELLO and subscribing to requests, start a background heartbeat loop.

Pseudo-plan:

private Task? _heartbeatLoop;

public async Task StartAsync(CancellationToken ct)
{
    // existing HELLO logic...
    await _connection.SendHelloAsync(payload, ct);

    _connection.OnRequest(frame => HandleRequestAsync(frame, ct));

    _heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
}

private async Task HeartbeatLoopAsync(CancellationToken outerCt)
{
    var opt = _options.Value;
    var interval = opt.HeartbeatInterval;
    var instanceId = opt.InstanceId;

    while (!outerCt.IsCancellationRequested)
    {
        var payload = new HeartbeatPayload
        {
            InstanceId = instanceId,
            Status = _heartbeatSource.GetCurrentStatus(),
            InFlightRequests = _heartbeatSource.GetInFlightRequests(),
            ErrorRate = _heartbeatSource.GetErrorRate()
        };

        var frame = new Frame
        {
            Type = FrameType.Heartbeat,
            CorrelationId = Guid.Empty, // or a reserved value
            Payload = SerializeHeartbeatPayload(payload)
        };

        await _connection.SendHeartbeatAsync(frame, outerCt);

        try
        {
            await Task.Delay(interval, outerCt);
        }
        catch (TaskCanceledException)
        {
            break;
        }
    }
}

Youll need to extend IMicroserviceConnection with:

Task SendHeartbeatAsync(Frame frame, CancellationToken ct);

In this step, manipulation is simple: every N seconds, push a heartbeat.


3. Router: accept heartbeats and update connection health

Project: StellaOps.Gateway.WebService Owner: Gateway dev

You already have an InMemory router or similar structure that:

  • Handles HELLO frames, creates ConnectionState.
  • Maintains a global IGlobalRoutingState.

Now you need to:

  • Handle HEARTBEAT frames.
  • Update ConnectionState.Status and LastHeartbeatUtc.

3.1 Frame dispatch on router side

In your routers InMemory server loop (or equivalent), add case for FrameType.Heartbeat:

  • Deserialize HeartbeatPayload from frame.Payload.

  • Find the corresponding ConnectionState by InstanceId (and/or connection ID).

  • Update:

    • LastHeartbeatUtc = DateTime.UtcNow.
    • Status = payload.Status.

You can add a method in your routing-state implementation:

public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
{
    if (!_connections.TryGetValue(connectionId, out var conn))
        return;

    conn.LastHeartbeatUtc = DateTime.UtcNow;
    conn.Status = payload.Status;
}

The routers transport server should know which connectionId delivered the frame; pass that along.

3.2 Detect stale connections (health degradation)

Add a background “health monitor” in the gateway:

  • Reads HeartbeatTimeout from configuration (can reuse the same default as microservice or have separate router-side config).

  • Periodically scans all ConnectionState entries:

    • If Now - LastHeartbeatUtc > HeartbeatTimeout, mark Status = Unhealthy (or remove connection entirely).
    • If connection drops (transport disconnect), also mark Unhealthy or remove.

This can be a simple IHostedService:

internal sealed class ConnectionHealthMonitor : IHostedService
{
    private readonly IGlobalRoutingState _state;
    private readonly TimeSpan _heartbeatTimeout;
    private Task? _loop;
    private CancellationTokenSource? _cts;

    public Task StartAsync(CancellationToken cancellationToken)
    {
        _cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        _loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
        return Task.CompletedTask;
    }

    public async Task StopAsync(CancellationToken cancellationToken)
    {
        _cts?.Cancel();
        if (_loop is not null)
            await _loop;
    }

    private async Task MonitorLoopAsync(CancellationToken ct)
    {
        while (!ct.IsCancellationRequested)
        {
            _state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
            await Task.Delay(TimeSpan.FromSeconds(5), ct);
        }
    }
}

Youll add a method like MarkStaleConnectionsUnhealthy on your IGlobalRoutingState implementation.


4. Track basic latency (AveragePingMs)

Project: Gateway + Common Owner: Gateway dev

You want AveragePingMs per connection to inform routing decisions.

4.1 Decide where to measure

Simplest: measure “request → response” round-trip time in the gateway:

  • When you send a Request frame to a specific connection, record:

    • SentAtUtc[CorrelationId] = DateTime.UtcNow.
  • When you receive a Response frame with that correlation:

    • Compute latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds.
    • Discard map entry.

Then update ConnectionState.AveragePingMs, e.g. with an exponential moving average:

conn.AveragePingMs = conn.AveragePingMs <= 0
    ? latencyMs
    : conn.AveragePingMs * 0.8 + latencyMs * 0.2;

4.2 Where to hook this

  • In the gateway-side transport client (InMemory implementation for now):

    • When sending Request frame:

      • Register SentAtUtc per correlation ID.
    • When receiving Response frame:

      • Compute latency.
      • Call IGlobalRoutingState.UpdateLatency(connectionId, latencyMs).

Add a method to the routing state:

public void UpdateLatency(string connectionId, double latencyMs)
{
    if (_connections.TryGetValue(connectionId, out var conn))
    {
        if (conn.AveragePingMs <= 0)
            conn.AveragePingMs = latencyMs;
        else
            conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
    }
}

You can keep it simple; sophistication can come later.


5. Basic routing plugin implementation

Project: StellaOps.Gateway.WebService Owner: Gateway dev

You already have IRoutingPlugin defined. Now implement a concrete BasicRoutingPlugin that respects:

  • Region (gateway region first, then neighbor tiers).
  • Health (Healthy / Degraded only).
  • Latency preference (AveragePingMs).

5.1 Inputs & data

RoutingContext should carry:

  • EndpointDescriptor (with ServiceName, Version, Method, Path).
  • GatewayRegion (from GatewayNodeConfig.Region).
  • The HttpContext if you need headers (not needed for routing at this stage).

IGlobalRoutingState should provide:

  • GetConnectionsFor(serviceName, version, method, path) returning all ConnectionStates that support that endpoint.

5.2 Basic algorithm

Algorithm outline:

public sealed class BasicRoutingPlugin : IRoutingPlugin
{
    private readonly IGlobalRoutingState _state;
    private readonly string[] _neighborRegions; // configured, can be empty

    public async Task<RoutingDecision?> ChooseInstanceAsync(
        RoutingContext context,
        CancellationToken cancellationToken)
    {
        var endpoint = context.Endpoint;
        var candidates = _state.GetConnectionsFor(
            endpoint.ServiceName,
            endpoint.Version,
            endpoint.Method,
            endpoint.Path);

        if (candidates.Count == 0)
            return null;

        // 1. Filter by health (only Healthy or Degraded)
        var healthy = candidates
            .Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
            .ToList();

        if (healthy.Count == 0)
            return null;

        // 2. Partition by region tier
        var gatewayRegion = context.GatewayRegion;

        List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
        List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
        List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();

        var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
        if (chosenTier.Count == 0)
            return null;

        // 3. Sort by latency, then heartbeat freshness
        var ordered = chosenTier
            .OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
            .ThenByDescending(c => c.LastHeartbeatUtc)
            .ToList();

        var winner = ordered[0];

        // 4. Build decision
        return new RoutingDecision
        {
            Endpoint = endpoint,
            Connection = winner,
            TransportType = winner.TransportType,
            EffectiveTimeout = endpoint.DefaultTimeout  // or compose with config later
        };
    }
}

Wire it into DI:

services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();

And ensure RoutingDecisionMiddleware calls it.


6. Integrate health-aware routing into the HTTP pipeline

Project: StellaOps.Gateway.WebService Owner: Gateway dev

Update your RoutingDecisionMiddleware to:

  • Use the final IRoutingPlugin instead of picking a random connection.

  • Handle null decision appropriately:

    • If ChooseInstanceAsync returns null, respond with 503 Service Unavailable or 502 Bad Gateway and a generic error body, log the incident.

Check that:

  • Gateways region is injected (via GatewayNodeConfig.Region) into RoutingContext.GatewayRegion.
  • Endpoint descriptor is resolved before you call the plugin.

7. Testing plan

Project: StellaOps.Gateway.WebService.Tests, StellaOps.Microservice.Tests Owner: test agent

Write basic tests to lock in behavior.

7.1 Microservice heartbeat tests

In StellaOps.Microservice.Tests:

  • Use a fake IMicroserviceConnection that records frames sent.

  • Configure HeartbeatInterval to a small number (e.g. 100 ms).

  • Start a Host with AddStellaMicroservice.

  • Wait some time, assert:

    • At least one HELLO frame was sent.
    • At least N HEARTBEAT frames were sent.
    • HEARTBEAT payload has correct InstanceId and Status.

7.2 Router health update tests

In StellaOps.Gateway.WebService.Tests (or a separate routing-state test project):

  • Create an instance of your IGlobalRoutingState implementation.

  • Add a connection via HELLO simulation.

  • Call UpdateHeartbeat with a HeartbeatPayload.

  • Assert:

    • LastHeartbeatUtc updated.
    • Status set to Healthy (or whatever payload said).
  • Advance time (simulate via injecting a clock or mocking DateTime) and call MarkStaleConnectionsUnhealthy:

    • Assert that Status changed to Unhealthy.

7.3 Routing plugin tests

Write tests for BasicRoutingPlugin:

  • Case 1: multiple connections, some unhealthy:

    • Only Healthy/Degraded are considered.
  • Case 2: multiple regions:

    • Instances in gateway region win over others.
  • Case 3: same region, different AveragePingMs:

    • Lower latency chosen.
  • Case 4: same latency, different LastHeartbeatUtc:

    • More recent heartbeat chosen.

These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).


8. Done criteria for “Add heartbeat, health, basic routing rules”

You can declare this step complete when:

  • Microservices:

    • Periodically send HEARTBEAT frames on the same connection they use for requests.
  • Gateway/router:

    • Updates LastHeartbeatUtc and Status on receipt of HEARTBEAT.
    • Marks stale or disconnected connections as Unhealthy (or removes them).
    • Tracks AveragePingMs per connection based on request/response round trips.
  • Routing:

    • IRoutingPlugin chooses instances based on:

      • Strict ServiceName + Version + endpoint match.
      • Health (Healthy/Degraded only).
      • Region preference (gateway region > neighbors > others).
      • Latency (AveragePingMs) then heartbeat recency.
  • Tests:

    • Validate heartbeats are sent and processed.
    • Validate stale connections are marked unhealthy.
    • Validate routing plugin picks the expected instance in simple scenarios.

Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add cancellation and then streaming + payload limits on top of the same structures.