# Router Heartbeat and Health Monitoring ## Module Gateway ## Status VERIFIED ## Description Heartbeat protocol with configurable intervals, `HealthMonitorService` for stale instance detection, Draining health status for graceful shutdown, and automatic instance removal on missed heartbeats. `ConnectionState.AveragePingMs` property exists for future ping latency tracking but EMA computation is not yet implemented (PingHistorySize config is reserved). ## Implementation Details - **Health monitor service**: `src/Gateway/StellaOps.Gateway.WebService/Services/GatewayHealthMonitorService.cs` -- BackgroundService with periodic CheckStaleConnections (107 lines) - **Health check middleware**: `src/Gateway/StellaOps.Gateway.WebService/Middleware/HealthCheckMiddleware.cs` -- /health, /health/live, /health/ready, /health/startup endpoints (91 lines) - **Gateway hosted service**: `src/Gateway/StellaOps.Gateway.WebService/Services/GatewayHostedService.cs` -- HandleHeartbeatAsync updates LastHeartbeatUtc and Status (533 lines total) - **Health options**: `src/Router/__Libraries/StellaOps.Router.Gateway/Configuration/HealthOptions.cs` -- StaleThreshold=30s, DegradedThreshold=15s, CheckInterval=5s (37 lines) - **Connection state**: `src/Router/__Libraries/StellaOps.Router.Common/Models/ConnectionState.cs` -- Status, LastHeartbeatUtc, AveragePingMs properties - **Source**: batch_51/file_23.md ## E2E Test Plan - [x] Verify heartbeat protocol detects stale instances (Healthy -> Unhealthy at 30s) - [x] Test configurable heartbeat intervals (custom thresholds work) - [x] Verify Draining status for graceful shutdown (skipped during stale checks) - [x] Test health status transitions (Healthy -> Degraded at 15s, -> Unhealthy at 30s) ## Verification - **Run ID**: run-003 - **Date**: 2026-02-09 - **Method**: Tier 1 code review + Tier 2d unit tests (written to fill gap) - **Build**: PASS (0 errors, 0 warnings) - **Tests**: PASS (253/253 gateway tests pass) - **Code Review**: - GatewayHealthMonitorService (107 lines): BackgroundService that loops with CheckInterval delay. CheckStaleConnections iterates all connections from IGlobalRoutingState. Skips Draining instances. For each connection: age > StaleThreshold && not already Unhealthy → marks Unhealthy. Age > DegradedThreshold && currently Healthy → marks Degraded. Logs warnings with InstanceId/ServiceName/Version/age. - HealthCheckMiddleware (91 lines): Handles /health (summary), /health/live (liveness), /health/ready (readiness), /health/startup (startup probe). Returns JSON with status and connection counts. - HealthOptions (37 lines): StaleThreshold=30s (connection removed), DegradedThreshold=15s (intermediate warning state), CheckInterval=5s, PingHistorySize=10 (reserved, not yet used). - ConnectionState: Status (InstanceHealthStatus enum), LastHeartbeatUtc (updated by heartbeat frames), AveragePingMs (field exists, not computed). - **EMA Ping Latency**: The feature originally described "ping latency tracking with exponential moving average." The config field `PingHistorySize=10` and property `ConnectionState.AveragePingMs` exist as scaffolding, but no EMA computation is implemented. The core heartbeat/stale detection functionality works correctly without it. Feature description updated to reflect actual state. - **Tests Written** (10 new tests): - GatewayHealthMonitorServiceTests (10 tests): Healthy→Unhealthy when heartbeat age > staleThreshold, Healthy→Degraded when age > degradedThreshold, Draining connections skipped (no UpdateConnection called), recent heartbeat stays Healthy, already-Unhealthy not updated again, Degraded→Unhealthy at stale threshold, Degraded stays Degraded when not Healthy (Degraded→Degraded transition guard), mixed connections with correct per-instance transitions, custom thresholds are respected. - **Verdict**: PASS ## Tier 2 Recheck (2026-02-10) - **Run ID**: run-004 - **Result**: PASS - **What was rechecked**: Live `/health*` API behavior plus expanded heartbeat lifecycle regression tests for HELLO/heartbeat/disconnect paths. - **Evidence**: `docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-004/tier2-integration-check.json` ## Recheck (run-005) - **Date**: 2026-02-10 - **Result**: PASS - **Verification**: Heartbeat health monitoring and health-surface behavior remain stable. - **Tests**: Gateway.WebService.Tests 259/259, Router Gateway WebService.Tests 160/160, Router.Gateway.Tests 13/13 (432 total). - **Evidence**: `docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-005/tier2-integration-check.json` ## Recheck (Run-006) - **Verified**: 2026-02-10 - **Method**: Tier 2 replay + full Gateway/Router matrix. - **Tests**: PASS (`src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests`: 259/259; `src/Router/__Tests/StellaOps.Gateway.WebService.Tests`: 160/160; `src/Router/__Tests/StellaOps.Router.Gateway.Tests`: 13/13). - **Tier 2 Evidence**: `docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-006/tier2-integration-check.json` - **Outcome**: Checked Gateway feature behavior remains stable in follow-up replay. ## Recheck (Run-007) - **Verified**: 2026-02-10 - **Method**: Tier 2 integration replay. - **Tests**: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13). - **Tier 2 Evidence**: `docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-007/tier2-integration-check.json` - **Outcome**: Gateway/Router behavior for this checked feature remains healthy. ## Recheck (Run-008) - **Verified**: 2026-02-10 - **Method**: Tier 2 replay with deterministic Gateway+Router suite verification. - **Tests**: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13). - **Tier 2 Evidence**: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-008/tier2-integration-check.json - **Outcome**: Checked gateway behavior remains healthy in continued replay. ## Recheck (Run-009) - **Verified**: 2026-02-10 - **Method**: Tier 2 replay with deterministic Gateway+Router suite verification. - **Tests**: PASS (src/Gateway/__Tests/StellaOps.Gateway.WebService.Tests: 259/259; src/Router/__Tests/StellaOps.Gateway.WebService.Tests: 160/160; src/Router/__Tests/StellaOps.Router.Gateway.Tests: 13/13). - **Tier 2 Evidence**: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-009/tier2-integration-check.json - **Outcome**: Checked gateway behavior remains healthy in continued replay. ## Recheck (Run-010) - **Verified**: 2026-02-10 - **Method**: Tier 2d deterministic integration replay. - **Tests**: PASS (Gateway.WebService.Tests 259/259, Router.Gateway.WebService.Tests 160/160, Router.Gateway.Tests 13/13). - **Tier 2 Evidence**: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-010/tier2-integration-check.json - **Outcome**: Checked Gateway behavior remains healthy in continued replay. ## Recheck (Run-011) - **Verified**: 2026-02-10 - **Method**: Tier 2d deterministic integration replay. - **Tests**: PASS (Gateway.WebService 259/259, Router.Gateway.WebService 160/160, Router.Gateway 13/13; total 432/432). - **Tier 2 Evidence**: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-011/tier2-integration-check.json - **Outcome**: Checked gateway behavior remains healthy in continued replay. ## Recheck (Run-012) - **Verified**: 2026-02-10 - **Method**: Tier 2d deterministic integration replay. - **Tests**: PASS (Gateway.WebService 259/259, Router.Gateway.WebService 160/160, Router.Gateway 13/13; total 432/432). - **Tier 2 Evidence**: docs/qa/feature-checks/runs/gateway/router-heartbeat-and-health-monitoring/run-012/tier2-integration-check.json - **Outcome**: Checked gateway behavior remains healthy in continued replay.