7.9 KiB
7.9 KiB
Router Heartbeat and Health Monitoring
Module
Gateway
Status
VERIFIED
Description
Heartbeat protocol with configurable intervals, HealthMonitorService for stale instance detection, Draining health status for graceful shutdown, and automatic instance removal on missed heartbeats. ConnectionState.AveragePingMs property exists for future ping latency tracking but EMA computation is not yet implemented (PingHistorySize config is reserved).
Implementation Details
- Health monitor service:
src/Gateway/StellaOps.Gateway.WebService/Services/GatewayHealthMonitorService.cs-- BackgroundService with periodic CheckStaleConnections (107 lines) - Health check middleware:
src/Gateway/StellaOps.Gateway.WebService/Middleware/HealthCheckMiddleware.cs-- /health, /health/live, /health/ready, /health/startup endpoints (91 lines) - Gateway hosted service:
src/Gateway/StellaOps.Gateway.WebService/Services/GatewayHostedService.cs-- HandleHeartbeatAsync updates LastHeartbeatUtc and Status (533 lines total) - Health options:
src/Router/__Libraries/StellaOps.Router.Gateway/Configuration/HealthOptions.cs-- StaleThreshold=30s, DegradedThreshold=15s, CheckInterval=5s (37 lines) - Connection state:
src/Router/__Libraries/StellaOps.Router.Common/Models/ConnectionState.cs-- Status, LastHeartbeatUtc, AveragePingMs properties - Source: batch_51/file_23.md
E2E Test Plan
- Verify heartbeat protocol detects stale instances (Healthy -> Unhealthy at 30s)
- Test configurable heartbeat intervals (custom thresholds work)
- Verify Draining status for graceful shutdown (skipped during stale checks)
- Test health status transitions (Healthy -> Degraded at 15s, -> Unhealthy at 30s)
Verification
- Run ID: run-003
- Date: 2026-02-09
- Method: Tier 1 code review + Tier 2d unit tests (written to fill gap)
- Build: PASS (0 errors, 0 warnings)
- Tests: PASS (253/253 gateway tests pass)
- Code Review:
- GatewayHealthMonitorService (107 lines): BackgroundService that loops with CheckInterval delay. CheckStaleConnections iterates all connections from IGlobalRoutingState. Skips Draining instances. For each connection: age > StaleThreshold && not already Unhealthy → marks Unhealthy. Age > DegradedThreshold && currently Healthy → marks Degraded. Logs warnings with InstanceId/ServiceName/Version/age.
- HealthCheckMiddleware (91 lines): Handles /health (summary), /health/live (liveness), /health/ready (readiness), /health/startup (startup probe). Returns JSON with status and connection counts.
- HealthOptions (37 lines): StaleThreshold=30s (connection removed), DegradedThreshold=15s (intermediate warning state), CheckInterval=5s, PingHistorySize=10 (reserved, not yet used).
- ConnectionState: Status (InstanceHealthStatus enum), LastHeartbeatUtc (updated by heartbeat frames), AveragePingMs (field exists, not computed).
- EMA Ping Latency: The feature originally described "ping latency tracking with exponential moving average." The config field
PingHistorySize=10and propertyConnectionState.AveragePingMsexist as scaffolding, but no EMA computation is implemented. The core heartbeat/stale detection functionality works correctly without it. Feature description updated to reflect actual state. - Tests Written (10 new tests):
- GatewayHealthMonitorServiceTests (10 tests): Healthy→Unhealthy when heartbeat age > staleThreshold, Healthy→Degraded when age > degradedThreshold, Draining connections skipped (no UpdateConnection called), recent heartbeat stays Healthy, already-Unhealthy not updated again, Degraded→Unhealthy at stale threshold, Degraded stays Degraded when not Healthy (Degraded→Degraded transition guard), mixed connections with correct per-instance transitions, custom thresholds are respected.
- Verdict: PASS
Tier 2 Recheck (2026-02-10)
- Run ID: run-004
- Result: PASS
- What was rechecked: Live
/health*API behavior plus expanded heartbeat lifecycle regression tests for HELLO/heartbeat/disconnect paths. - Evidence:
docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-004/tier2-integration-check.json
Recheck (run-005)
- Date: 2026-02-10
- Result: PASS
- Verification: Heartbeat health monitoring and health-surface behavior remain stable.
- Tests: Gateway.WebService.Tests 259/259, Router Gateway WebService.Tests 160/160, Router.Gateway.Tests 13/13 (432 total).
- Evidence:
docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-005/tier2-integration-check.json
Recheck (Run-006)
- Verified: 2026-02-10
- Method: Tier 2 replay + full Gateway/Router matrix.
- Tests: PASS (
src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259;src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160;src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13). - Tier 2 Evidence:
docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-006/tier2-integration-check.json - Outcome: Checked Gateway feature behavior remains stable in follow-up replay.
Recheck (Run-007)
- Verified: 2026-02-10
- Method: Tier 2 integration replay.
- Tests: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13).
- Tier 2 Evidence:
docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-007/tier2-integration-check.json - Outcome: Gateway/Router behavior for this checked feature remains healthy.
Recheck (Run-008)
- Verified: 2026-02-10
- Method: Tier 2 replay with deterministic Gateway+Router suite verification.
- Tests: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13).
- Tier 2 Evidence: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-008/tier2-integration-check.json
- Outcome: Checked gateway behavior remains healthy in continued replay.
Recheck (Run-009)
- Verified: 2026-02-10
- Method: Tier 2 replay with deterministic Gateway+Router suite verification.
- Tests: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13).
- Tier 2 Evidence: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-009/tier2-integration-check.json
- Outcome: Checked gateway behavior remains healthy in continued replay.
Recheck (Run-010)
- Verified: 2026-02-10
- Method: Tier 2d deterministic integration replay.
- Tests: PASS (Gateway.WebService.Tests 259/259, Router.Gateway.WebService.Tests 160/160, Router.Gateway.Tests 13/13).
- Tier 2 Evidence: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-010/tier2-integration-check.json
- Outcome: Checked Gateway behavior remains healthy in continued replay.
Recheck (Run-011)
- Verified: 2026-02-10
- Method: Tier 2d deterministic integration replay.
- Tests: PASS (Gateway.WebService 259/259, Router.Gateway.WebService 160/160, Router.Gateway 13/13; total 432/432).
- Tier 2 Evidence: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-011/tier2-integration-check.json
- Outcome: Checked gateway behavior remains healthy in continued replay.
Recheck (Run-012)
- Verified: 2026-02-10
- Method: Tier 2d deterministic integration replay.
- Tests: PASS (Gateway.WebService 259/259, Router.Gateway.WebService 160/160, Router.Gateway 13/13; total 432/432).
- Tier 2 Evidence: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-012/tier2-integration-check.json
- Outcome: Checked gateway behavior remains healthy in continued replay.