# Step 23: Metrics & Health Checks **Phase 6: Observability & Resilience** **Estimated Complexity:** Medium **Dependencies:** Step 22 (Logging & Tracing) --- ## Overview Metrics and health checks provide operational visibility into the router and microservices. Prometheus-compatible metrics expose request rates, latencies, error rates, and connection pool status. Health checks enable load balancers and orchestrators to route traffic appropriately. --- ## Goals 1. Expose Prometheus-compatible metrics 2. Track request/response metrics per endpoint 3. Monitor transport layer health 4. Provide liveness and readiness probes 5. Support custom health check integrations --- ## Metrics Configuration ```csharp namespace StellaOps.Router.Common; public class MetricsConfig { /// Whether to enable metrics collection. public bool Enabled { get; set; } = true; /// Path for metrics endpoint. public string Path { get; set; } = "/metrics"; /// Histogram buckets for request duration. public double[] DurationBuckets { get; set; } = new[] { 0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0 }; /// Labels to include in metrics. public HashSet IncludeLabels { get; set; } = new() { "method", "path", "status_code", "service" }; /// Whether to include path in labels (may cause high cardinality). public bool IncludePathLabel { get; set; } = false; /// Maximum unique path labels before aggregating. public int MaxPathCardinality { get; set; } = 100; } ``` --- ## Core Metrics ```csharp namespace StellaOps.Router.Common; /// /// Central metrics registry for Stella Router. /// public sealed class StellaMetrics { // Request metrics public static readonly Counter RequestsTotal = Meter.CreateCounter( "stella_requests_total", description: "Total number of requests processed"); public static readonly Histogram RequestDuration = Meter.CreateHistogram( "stella_request_duration_seconds", unit: "s", description: "Request processing duration in seconds"); public static readonly Counter RequestErrors = Meter.CreateCounter( "stella_request_errors_total", description: "Total number of request errors"); // Transport metrics public static readonly UpDownCounter ActiveConnections = Meter.CreateUpDownCounter( "stella_active_connections", description: "Number of active transport connections"); public static readonly Counter ConnectionsTotal = Meter.CreateCounter( "stella_connections_total", description: "Total number of transport connections"); public static readonly Counter FramesSent = Meter.CreateCounter( "stella_frames_sent_total", description: "Total number of frames sent"); public static readonly Counter FramesReceived = Meter.CreateCounter( "stella_frames_received_total", description: "Total number of frames received"); public static readonly Counter BytesSent = Meter.CreateCounter( "stella_bytes_sent_total", unit: "By", description: "Total bytes sent"); public static readonly Counter BytesReceived = Meter.CreateCounter( "stella_bytes_received_total", unit: "By", description: "Total bytes received"); // Rate limiting metrics public static readonly Counter RateLimitHits = Meter.CreateCounter( "stella_rate_limit_hits_total", description: "Number of requests that hit rate limits"); public static readonly Gauge RateLimitBuckets = Meter.CreateGauge( "stella_rate_limit_buckets", description: "Number of active rate limit buckets"); // Auth metrics public static readonly Counter AuthSuccesses = Meter.CreateCounter( "stella_auth_success_total", description: "Number of successful authentications"); public static readonly Counter AuthFailures = Meter.CreateCounter( "stella_auth_failures_total", description: "Number of failed authentications"); // Circuit breaker metrics public static readonly Gauge CircuitBreakerState = Meter.CreateGauge( "stella_circuit_breaker_state", description: "Circuit breaker state (0=closed, 1=half-open, 2=open)"); private static readonly Meter Meter = new("StellaOps.Router", "1.0.0"); } ``` --- ## Request Metrics Middleware ```csharp namespace StellaOps.Router.Gateway; /// /// Middleware to collect request metrics. /// public sealed class MetricsMiddleware { private readonly RequestDelegate _next; private readonly MetricsConfig _config; private readonly PathNormalizer _pathNormalizer; public MetricsMiddleware( RequestDelegate next, IOptions config) { _next = next; _config = config.Value; _pathNormalizer = new PathNormalizer(_config.MaxPathCardinality); } public async Task InvokeAsync(HttpContext context) { if (!_config.Enabled) { await _next(context); return; } var sw = Stopwatch.StartNew(); var method = context.Request.Method; var path = _config.IncludePathLabel ? _pathNormalizer.Normalize(context.Request.Path) : "aggregated"; try { await _next(context); } finally { sw.Stop(); var tags = new TagList { { "method", method }, { "status_code", context.Response.StatusCode.ToString() } }; if (_config.IncludePathLabel) { tags.Add("path", path); } StellaMetrics.RequestsTotal.Add(1, tags); StellaMetrics.RequestDuration.Record(sw.Elapsed.TotalSeconds, tags); if (context.Response.StatusCode >= 400) { StellaMetrics.RequestErrors.Add(1, tags); } } } } /// /// Normalizes paths to prevent high cardinality. /// internal sealed class PathNormalizer { private readonly int _maxCardinality; private readonly ConcurrentDictionary _pathCache = new(); private int _uniquePaths; public PathNormalizer(int maxCardinality) { _maxCardinality = maxCardinality; } public string Normalize(string path) { if (_pathCache.TryGetValue(path, out var normalized)) return normalized; // Replace path parameters with placeholders var segments = path.Split('/'); for (int i = 0; i < segments.Length; i++) { if (Guid.TryParse(segments[i], out _) || int.TryParse(segments[i], out _) || segments[i].Length > 20) { segments[i] = "{id}"; } } normalized = string.Join("/", segments); if (Interlocked.Increment(ref _uniquePaths) <= _maxCardinality) { _pathCache[path] = normalized; } else { normalized = "other"; } return normalized; } } ``` --- ## Transport Metrics ```csharp namespace StellaOps.Router.Transport; /// /// Collects metrics for transport layer operations. /// public sealed class TransportMetricsCollector { public void RecordConnectionOpened(string transport, string serviceName) { var tags = new TagList { { "transport", transport }, { "service", serviceName } }; StellaMetrics.ConnectionsTotal.Add(1, tags); StellaMetrics.ActiveConnections.Add(1, tags); } public void RecordConnectionClosed(string transport, string serviceName) { var tags = new TagList { { "transport", transport }, { "service", serviceName } }; StellaMetrics.ActiveConnections.Add(-1, tags); } public void RecordFrameSent(string transport, FrameType type, int bytes) { var tags = new TagList { { "transport", transport }, { "frame_type", type.ToString() } }; StellaMetrics.FramesSent.Add(1, tags); StellaMetrics.BytesSent.Add(bytes, new TagList { { "transport", transport } }); } public void RecordFrameReceived(string transport, FrameType type, int bytes) { var tags = new TagList { { "transport", transport }, { "frame_type", type.ToString() } }; StellaMetrics.FramesReceived.Add(1, tags); StellaMetrics.BytesReceived.Add(bytes, new TagList { { "transport", transport } }); } } ``` --- ## Health Check System ```csharp namespace StellaOps.Router.Common; /// /// Health check result. /// public sealed class HealthCheckResult { public HealthStatus Status { get; init; } public string? Description { get; init; } public TimeSpan Duration { get; init; } public IReadOnlyDictionary? Data { get; init; } public Exception? Exception { get; init; } } public enum HealthStatus { Healthy, Degraded, Unhealthy } /// /// Health check interface. /// public interface IHealthCheck { string Name { get; } Task CheckAsync(CancellationToken cancellationToken); } /// /// Aggregates multiple health checks. /// public sealed class HealthCheckService { private readonly IEnumerable _checks; private readonly ILogger _logger; public HealthCheckService( IEnumerable checks, ILogger logger) { _checks = checks; _logger = logger; } public async Task CheckHealthAsync(CancellationToken cancellationToken) { var results = new Dictionary(); var overallStatus = HealthStatus.Healthy; foreach (var check in _checks) { var sw = Stopwatch.StartNew(); try { var result = await check.CheckAsync(cancellationToken); result = result with { Duration = sw.Elapsed }; results[check.Name] = result; if (result.Status > overallStatus) { overallStatus = result.Status; } } catch (Exception ex) { _logger.LogWarning(ex, "Health check {Name} failed", check.Name); results[check.Name] = new HealthCheckResult { Status = HealthStatus.Unhealthy, Description = ex.Message, Duration = sw.Elapsed, Exception = ex }; overallStatus = HealthStatus.Unhealthy; } } return new HealthReport { Status = overallStatus, Checks = results, TotalDuration = results.Values.Sum(r => r.Duration.TotalMilliseconds) }; } } public sealed class HealthReport { public HealthStatus Status { get; init; } public IReadOnlyDictionary Checks { get; init; } = new Dictionary(); public double TotalDuration { get; init; } } ``` --- ## Built-in Health Checks ```csharp namespace StellaOps.Router.Gateway; /// /// Checks that at least one transport connection is active. /// public sealed class TransportHealthCheck : IHealthCheck { private readonly IGlobalRoutingState _routingState; public string Name => "transport"; public TransportHealthCheck(IGlobalRoutingState routingState) { _routingState = routingState; } public Task CheckAsync(CancellationToken cancellationToken) { var connections = _routingState.GetAllConnections(); var activeCount = connections.Count(c => c.State == ConnectionState.Connected); if (activeCount == 0) { return Task.FromResult(new HealthCheckResult { Status = HealthStatus.Unhealthy, Description = "No active transport connections", Data = new Dictionary { ["connections"] = 0 } }); } return Task.FromResult(new HealthCheckResult { Status = HealthStatus.Healthy, Description = $"{activeCount} active connections", Data = new Dictionary { ["connections"] = activeCount } }); } } /// /// Checks Authority service connectivity. /// public sealed class AuthorityHealthCheck : IHealthCheck { private readonly IAuthorityClient _authority; private readonly TimeSpan _timeout; public string Name => "authority"; public AuthorityHealthCheck( IAuthorityClient authority, IOptions config) { _authority = authority; _timeout = config.Value.HealthCheckTimeout; } public async Task CheckAsync(CancellationToken cancellationToken) { try { using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); cts.CancelAfter(_timeout); var isHealthy = await _authority.CheckHealthAsync(cts.Token); return new HealthCheckResult { Status = isHealthy ? HealthStatus.Healthy : HealthStatus.Degraded, Description = isHealthy ? "Authority is responsive" : "Authority returned unhealthy" }; } catch (Exception ex) { return new HealthCheckResult { Status = HealthStatus.Degraded, // Degraded, not unhealthy - gateway can still work Description = $"Authority unreachable: {ex.Message}", Exception = ex }; } } } /// /// Checks rate limiter backend connectivity. /// public sealed class RateLimiterHealthCheck : IHealthCheck { private readonly IRateLimiter _rateLimiter; public string Name => "rate_limiter"; public RateLimiterHealthCheck(IRateLimiter rateLimiter) { _rateLimiter = rateLimiter; } public async Task CheckAsync(CancellationToken cancellationToken) { try { // Try a simple operation await _rateLimiter.CheckLimitAsync( new RateLimitContext { Key = "__health_check__", Tier = RateLimitTier.Free }, cancellationToken); return new HealthCheckResult { Status = HealthStatus.Healthy, Description = "Rate limiter is responsive" }; } catch (Exception ex) { return new HealthCheckResult { Status = HealthStatus.Degraded, Description = $"Rate limiter error: {ex.Message}", Exception = ex }; } } } ``` --- ## Health Endpoints ```csharp namespace StellaOps.Router.Gateway; /// /// Health check endpoints. /// public static class HealthEndpoints { public static IEndpointRouteBuilder MapHealthEndpoints( this IEndpointRouteBuilder endpoints, string basePath = "/health") { endpoints.MapGet(basePath + "/live", LivenessCheck); endpoints.MapGet(basePath + "/ready", ReadinessCheck); endpoints.MapGet(basePath, DetailedHealthCheck); return endpoints; } /// /// Liveness probe - is the process running? /// private static IResult LivenessCheck() { return Results.Ok(new { status = "alive" }); } /// /// Readiness probe - can the service accept traffic? /// private static async Task ReadinessCheck( HealthCheckService healthService, CancellationToken cancellationToken) { var report = await healthService.CheckHealthAsync(cancellationToken); return report.Status == HealthStatus.Unhealthy ? Results.Json(new { status = "not_ready", checks = report.Checks.ToDictionary(c => c.Key, c => c.Value.Status.ToString()) }, statusCode: 503) : Results.Ok(new { status = "ready" }); } /// /// Detailed health report. /// private static async Task DetailedHealthCheck( HealthCheckService healthService, CancellationToken cancellationToken) { var report = await healthService.CheckHealthAsync(cancellationToken); var response = new { status = report.Status.ToString().ToLower(), totalDuration = $"{report.TotalDuration:F2}ms", checks = report.Checks.ToDictionary(c => c.Key, c => new { status = c.Value.Status.ToString().ToLower(), description = c.Value.Description, duration = $"{c.Value.Duration.TotalMilliseconds:F2}ms", data = c.Value.Data }) }; var statusCode = report.Status switch { HealthStatus.Healthy => 200, HealthStatus.Degraded => 200, // Still return 200 for degraded HealthStatus.Unhealthy => 503, _ => 200 }; return Results.Json(response, statusCode: statusCode); } } ``` --- ## Prometheus Metrics Endpoint ```csharp namespace StellaOps.Router.Gateway; /// /// Exposes metrics in Prometheus format. /// public sealed class PrometheusMetricsEndpoint { public static void Map(IEndpointRouteBuilder endpoints, string path = "/metrics") { endpoints.MapGet(path, async (HttpContext context) => { var exporter = context.RequestServices.GetRequiredService(); var metrics = await exporter.ExportAsync(); context.Response.ContentType = "text/plain; version=0.0.4"; await context.Response.WriteAsync(metrics); }); } } public sealed class PrometheusExporter { private readonly MeterProvider _meterProvider; public PrometheusExporter(MeterProvider meterProvider) { _meterProvider = meterProvider; } public Task ExportAsync() { // Use OpenTelemetry's Prometheus exporter // This is a simplified example var sb = new StringBuilder(); // Export would iterate over all registered metrics // Real implementation uses OpenTelemetry.Exporter.Prometheus return Task.FromResult(sb.ToString()); } } ``` --- ## Service Registration ```csharp namespace StellaOps.Router.Gateway; public static class MetricsExtensions { public static IServiceCollection AddStellaMetrics( this IServiceCollection services, IConfiguration configuration) { services.Configure(configuration.GetSection("Metrics")); services.AddOpenTelemetry() .WithMetrics(builder => { builder .AddMeter("StellaOps.Router") .AddAspNetCoreInstrumentation() .AddPrometheusExporter(); }); return services; } public static IServiceCollection AddStellaHealthChecks( this IServiceCollection services) { services.AddSingleton(); services.AddSingleton(); services.AddSingleton(); services.AddSingleton(); return services; } } ``` --- ## YAML Configuration ```yaml Metrics: Enabled: true Path: "/metrics" IncludePathLabel: false MaxPathCardinality: 100 DurationBuckets: - 0.005 - 0.01 - 0.025 - 0.05 - 0.1 - 0.25 - 0.5 - 1 - 2.5 - 5 - 10 HealthChecks: Enabled: true Path: "/health" CacheDuration: "00:00:05" ``` --- ## Deliverables 1. `StellaOps.Router.Common/StellaMetrics.cs` 2. `StellaOps.Router.Gateway/MetricsMiddleware.cs` 3. `StellaOps.Router.Transport/TransportMetricsCollector.cs` 4. `StellaOps.Router.Common/HealthCheckService.cs` 5. `StellaOps.Router.Gateway/TransportHealthCheck.cs` 6. `StellaOps.Router.Gateway/AuthorityHealthCheck.cs` 7. `StellaOps.Router.Gateway/HealthEndpoints.cs` 8. `StellaOps.Router.Gateway/PrometheusMetricsEndpoint.cs` 9. Metrics collection tests 10. Health check tests --- ## Next Step Proceed to [Step 24: Circuit Breaker & Retry Policies](24-Step.md) to implement resilience patterns.