# Step 23: Metrics & Health Checks
**Phase 6: Observability & Resilience**
**Estimated Complexity:** Medium
**Dependencies:** Step 22 (Logging & Tracing)
---
## Overview
Metrics and health checks provide operational visibility into the router and microservices. Prometheus-compatible metrics expose request rates, latencies, error rates, and connection pool status. Health checks enable load balancers and orchestrators to route traffic appropriately.
---
## Goals
1. Expose Prometheus-compatible metrics
2. Track request/response metrics per endpoint
3. Monitor transport layer health
4. Provide liveness and readiness probes
5. Support custom health check integrations
---
## Metrics Configuration
```csharp
namespace StellaOps.Router.Common;
public class MetricsConfig
{
/// Whether to enable metrics collection.
public bool Enabled { get; set; } = true;
/// Path for metrics endpoint.
public string Path { get; set; } = "/metrics";
/// Histogram buckets for request duration.
public double[] DurationBuckets { get; set; } = new[]
{
0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0
};
/// Labels to include in metrics.
public HashSet IncludeLabels { get; set; } = new()
{
"method", "path", "status_code", "service"
};
/// Whether to include path in labels (may cause high cardinality).
public bool IncludePathLabel { get; set; } = false;
/// Maximum unique path labels before aggregating.
public int MaxPathCardinality { get; set; } = 100;
}
```
---
## Core Metrics
```csharp
namespace StellaOps.Router.Common;
///
/// Central metrics registry for Stella Router.
///
public sealed class StellaMetrics
{
// Request metrics
public static readonly Counter RequestsTotal = Meter.CreateCounter(
"stella_requests_total",
description: "Total number of requests processed");
public static readonly Histogram RequestDuration = Meter.CreateHistogram(
"stella_request_duration_seconds",
unit: "s",
description: "Request processing duration in seconds");
public static readonly Counter RequestErrors = Meter.CreateCounter(
"stella_request_errors_total",
description: "Total number of request errors");
// Transport metrics
public static readonly UpDownCounter ActiveConnections = Meter.CreateUpDownCounter(
"stella_active_connections",
description: "Number of active transport connections");
public static readonly Counter ConnectionsTotal = Meter.CreateCounter(
"stella_connections_total",
description: "Total number of transport connections");
public static readonly Counter FramesSent = Meter.CreateCounter(
"stella_frames_sent_total",
description: "Total number of frames sent");
public static readonly Counter FramesReceived = Meter.CreateCounter(
"stella_frames_received_total",
description: "Total number of frames received");
public static readonly Counter BytesSent = Meter.CreateCounter(
"stella_bytes_sent_total",
unit: "By",
description: "Total bytes sent");
public static readonly Counter BytesReceived = Meter.CreateCounter(
"stella_bytes_received_total",
unit: "By",
description: "Total bytes received");
// Rate limiting metrics
public static readonly Counter RateLimitHits = Meter.CreateCounter(
"stella_rate_limit_hits_total",
description: "Number of requests that hit rate limits");
public static readonly Gauge RateLimitBuckets = Meter.CreateGauge(
"stella_rate_limit_buckets",
description: "Number of active rate limit buckets");
// Auth metrics
public static readonly Counter AuthSuccesses = Meter.CreateCounter(
"stella_auth_success_total",
description: "Number of successful authentications");
public static readonly Counter AuthFailures = Meter.CreateCounter(
"stella_auth_failures_total",
description: "Number of failed authentications");
// Circuit breaker metrics
public static readonly Gauge CircuitBreakerState = Meter.CreateGauge(
"stella_circuit_breaker_state",
description: "Circuit breaker state (0=closed, 1=half-open, 2=open)");
private static readonly Meter Meter = new("StellaOps.Router", "1.0.0");
}
```
---
## Request Metrics Middleware
```csharp
namespace StellaOps.Router.Gateway;
///
/// Middleware to collect request metrics.
///
public sealed class MetricsMiddleware
{
private readonly RequestDelegate _next;
private readonly MetricsConfig _config;
private readonly PathNormalizer _pathNormalizer;
public MetricsMiddleware(
RequestDelegate next,
IOptions config)
{
_next = next;
_config = config.Value;
_pathNormalizer = new PathNormalizer(_config.MaxPathCardinality);
}
public async Task InvokeAsync(HttpContext context)
{
if (!_config.Enabled)
{
await _next(context);
return;
}
var sw = Stopwatch.StartNew();
var method = context.Request.Method;
var path = _config.IncludePathLabel
? _pathNormalizer.Normalize(context.Request.Path)
: "aggregated";
try
{
await _next(context);
}
finally
{
sw.Stop();
var tags = new TagList
{
{ "method", method },
{ "status_code", context.Response.StatusCode.ToString() }
};
if (_config.IncludePathLabel)
{
tags.Add("path", path);
}
StellaMetrics.RequestsTotal.Add(1, tags);
StellaMetrics.RequestDuration.Record(sw.Elapsed.TotalSeconds, tags);
if (context.Response.StatusCode >= 400)
{
StellaMetrics.RequestErrors.Add(1, tags);
}
}
}
}
///
/// Normalizes paths to prevent high cardinality.
///
internal sealed class PathNormalizer
{
private readonly int _maxCardinality;
private readonly ConcurrentDictionary _pathCache = new();
private int _uniquePaths;
public PathNormalizer(int maxCardinality)
{
_maxCardinality = maxCardinality;
}
public string Normalize(string path)
{
if (_pathCache.TryGetValue(path, out var normalized))
return normalized;
// Replace path parameters with placeholders
var segments = path.Split('/');
for (int i = 0; i < segments.Length; i++)
{
if (Guid.TryParse(segments[i], out _) ||
int.TryParse(segments[i], out _) ||
segments[i].Length > 20)
{
segments[i] = "{id}";
}
}
normalized = string.Join("/", segments);
if (Interlocked.Increment(ref _uniquePaths) <= _maxCardinality)
{
_pathCache[path] = normalized;
}
else
{
normalized = "other";
}
return normalized;
}
}
```
---
## Transport Metrics
```csharp
namespace StellaOps.Router.Transport;
///
/// Collects metrics for transport layer operations.
///
public sealed class TransportMetricsCollector
{
public void RecordConnectionOpened(string transport, string serviceName)
{
var tags = new TagList
{
{ "transport", transport },
{ "service", serviceName }
};
StellaMetrics.ConnectionsTotal.Add(1, tags);
StellaMetrics.ActiveConnections.Add(1, tags);
}
public void RecordConnectionClosed(string transport, string serviceName)
{
var tags = new TagList
{
{ "transport", transport },
{ "service", serviceName }
};
StellaMetrics.ActiveConnections.Add(-1, tags);
}
public void RecordFrameSent(string transport, FrameType type, int bytes)
{
var tags = new TagList
{
{ "transport", transport },
{ "frame_type", type.ToString() }
};
StellaMetrics.FramesSent.Add(1, tags);
StellaMetrics.BytesSent.Add(bytes, new TagList { { "transport", transport } });
}
public void RecordFrameReceived(string transport, FrameType type, int bytes)
{
var tags = new TagList
{
{ "transport", transport },
{ "frame_type", type.ToString() }
};
StellaMetrics.FramesReceived.Add(1, tags);
StellaMetrics.BytesReceived.Add(bytes, new TagList { { "transport", transport } });
}
}
```
---
## Health Check System
```csharp
namespace StellaOps.Router.Common;
///
/// Health check result.
///
public sealed class HealthCheckResult
{
public HealthStatus Status { get; init; }
public string? Description { get; init; }
public TimeSpan Duration { get; init; }
public IReadOnlyDictionary? Data { get; init; }
public Exception? Exception { get; init; }
}
public enum HealthStatus
{
Healthy,
Degraded,
Unhealthy
}
///
/// Health check interface.
///
public interface IHealthCheck
{
string Name { get; }
Task CheckAsync(CancellationToken cancellationToken);
}
///
/// Aggregates multiple health checks.
///
public sealed class HealthCheckService
{
private readonly IEnumerable _checks;
private readonly ILogger _logger;
public HealthCheckService(
IEnumerable checks,
ILogger logger)
{
_checks = checks;
_logger = logger;
}
public async Task CheckHealthAsync(CancellationToken cancellationToken)
{
var results = new Dictionary();
var overallStatus = HealthStatus.Healthy;
foreach (var check in _checks)
{
var sw = Stopwatch.StartNew();
try
{
var result = await check.CheckAsync(cancellationToken);
result = result with { Duration = sw.Elapsed };
results[check.Name] = result;
if (result.Status > overallStatus)
{
overallStatus = result.Status;
}
}
catch (Exception ex)
{
_logger.LogWarning(ex, "Health check {Name} failed", check.Name);
results[check.Name] = new HealthCheckResult
{
Status = HealthStatus.Unhealthy,
Description = ex.Message,
Duration = sw.Elapsed,
Exception = ex
};
overallStatus = HealthStatus.Unhealthy;
}
}
return new HealthReport
{
Status = overallStatus,
Checks = results,
TotalDuration = results.Values.Sum(r => r.Duration.TotalMilliseconds)
};
}
}
public sealed class HealthReport
{
public HealthStatus Status { get; init; }
public IReadOnlyDictionary Checks { get; init; } = new Dictionary();
public double TotalDuration { get; init; }
}
```
---
## Built-in Health Checks
```csharp
namespace StellaOps.Router.Gateway;
///
/// Checks that at least one transport connection is active.
///
public sealed class TransportHealthCheck : IHealthCheck
{
private readonly IGlobalRoutingState _routingState;
public string Name => "transport";
public TransportHealthCheck(IGlobalRoutingState routingState)
{
_routingState = routingState;
}
public Task CheckAsync(CancellationToken cancellationToken)
{
var connections = _routingState.GetAllConnections();
var activeCount = connections.Count(c => c.State == ConnectionState.Connected);
if (activeCount == 0)
{
return Task.FromResult(new HealthCheckResult
{
Status = HealthStatus.Unhealthy,
Description = "No active transport connections",
Data = new Dictionary { ["connections"] = 0 }
});
}
return Task.FromResult(new HealthCheckResult
{
Status = HealthStatus.Healthy,
Description = $"{activeCount} active connections",
Data = new Dictionary { ["connections"] = activeCount }
});
}
}
///
/// Checks Authority service connectivity.
///
public sealed class AuthorityHealthCheck : IHealthCheck
{
private readonly IAuthorityClient _authority;
private readonly TimeSpan _timeout;
public string Name => "authority";
public AuthorityHealthCheck(
IAuthorityClient authority,
IOptions config)
{
_authority = authority;
_timeout = config.Value.HealthCheckTimeout;
}
public async Task CheckAsync(CancellationToken cancellationToken)
{
try
{
using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_timeout);
var isHealthy = await _authority.CheckHealthAsync(cts.Token);
return new HealthCheckResult
{
Status = isHealthy ? HealthStatus.Healthy : HealthStatus.Degraded,
Description = isHealthy ? "Authority is responsive" : "Authority returned unhealthy"
};
}
catch (Exception ex)
{
return new HealthCheckResult
{
Status = HealthStatus.Degraded, // Degraded, not unhealthy - gateway can still work
Description = $"Authority unreachable: {ex.Message}",
Exception = ex
};
}
}
}
///
/// Checks rate limiter backend connectivity.
///
public sealed class RateLimiterHealthCheck : IHealthCheck
{
private readonly IRateLimiter _rateLimiter;
public string Name => "rate_limiter";
public RateLimiterHealthCheck(IRateLimiter rateLimiter)
{
_rateLimiter = rateLimiter;
}
public async Task CheckAsync(CancellationToken cancellationToken)
{
try
{
// Try a simple operation
await _rateLimiter.CheckLimitAsync(
new RateLimitContext { Key = "__health_check__", Tier = RateLimitTier.Free },
cancellationToken);
return new HealthCheckResult
{
Status = HealthStatus.Healthy,
Description = "Rate limiter is responsive"
};
}
catch (Exception ex)
{
return new HealthCheckResult
{
Status = HealthStatus.Degraded,
Description = $"Rate limiter error: {ex.Message}",
Exception = ex
};
}
}
}
```
---
## Health Endpoints
```csharp
namespace StellaOps.Router.Gateway;
///
/// Health check endpoints.
///
public static class HealthEndpoints
{
public static IEndpointRouteBuilder MapHealthEndpoints(
this IEndpointRouteBuilder endpoints,
string basePath = "/health")
{
endpoints.MapGet(basePath + "/live", LivenessCheck);
endpoints.MapGet(basePath + "/ready", ReadinessCheck);
endpoints.MapGet(basePath, DetailedHealthCheck);
return endpoints;
}
///
/// Liveness probe - is the process running?
///
private static IResult LivenessCheck()
{
return Results.Ok(new { status = "alive" });
}
///
/// Readiness probe - can the service accept traffic?
///
private static async Task ReadinessCheck(
HealthCheckService healthService,
CancellationToken cancellationToken)
{
var report = await healthService.CheckHealthAsync(cancellationToken);
return report.Status == HealthStatus.Unhealthy
? Results.Json(new
{
status = "not_ready",
checks = report.Checks.ToDictionary(c => c.Key, c => c.Value.Status.ToString())
}, statusCode: 503)
: Results.Ok(new { status = "ready" });
}
///
/// Detailed health report.
///
private static async Task DetailedHealthCheck(
HealthCheckService healthService,
CancellationToken cancellationToken)
{
var report = await healthService.CheckHealthAsync(cancellationToken);
var response = new
{
status = report.Status.ToString().ToLower(),
totalDuration = $"{report.TotalDuration:F2}ms",
checks = report.Checks.ToDictionary(c => c.Key, c => new
{
status = c.Value.Status.ToString().ToLower(),
description = c.Value.Description,
duration = $"{c.Value.Duration.TotalMilliseconds:F2}ms",
data = c.Value.Data
})
};
var statusCode = report.Status switch
{
HealthStatus.Healthy => 200,
HealthStatus.Degraded => 200, // Still return 200 for degraded
HealthStatus.Unhealthy => 503,
_ => 200
};
return Results.Json(response, statusCode: statusCode);
}
}
```
---
## Prometheus Metrics Endpoint
```csharp
namespace StellaOps.Router.Gateway;
///
/// Exposes metrics in Prometheus format.
///
public sealed class PrometheusMetricsEndpoint
{
public static void Map(IEndpointRouteBuilder endpoints, string path = "/metrics")
{
endpoints.MapGet(path, async (HttpContext context) =>
{
var exporter = context.RequestServices.GetRequiredService();
var metrics = await exporter.ExportAsync();
context.Response.ContentType = "text/plain; version=0.0.4";
await context.Response.WriteAsync(metrics);
});
}
}
public sealed class PrometheusExporter
{
private readonly MeterProvider _meterProvider;
public PrometheusExporter(MeterProvider meterProvider)
{
_meterProvider = meterProvider;
}
public Task ExportAsync()
{
// Use OpenTelemetry's Prometheus exporter
// This is a simplified example
var sb = new StringBuilder();
// Export would iterate over all registered metrics
// Real implementation uses OpenTelemetry.Exporter.Prometheus
return Task.FromResult(sb.ToString());
}
}
```
---
## Service Registration
```csharp
namespace StellaOps.Router.Gateway;
public static class MetricsExtensions
{
public static IServiceCollection AddStellaMetrics(
this IServiceCollection services,
IConfiguration configuration)
{
services.Configure(configuration.GetSection("Metrics"));
services.AddOpenTelemetry()
.WithMetrics(builder =>
{
builder
.AddMeter("StellaOps.Router")
.AddAspNetCoreInstrumentation()
.AddPrometheusExporter();
});
return services;
}
public static IServiceCollection AddStellaHealthChecks(
this IServiceCollection services)
{
services.AddSingleton();
services.AddSingleton();
services.AddSingleton();
services.AddSingleton();
return services;
}
}
```
---
## YAML Configuration
```yaml
Metrics:
Enabled: true
Path: "/metrics"
IncludePathLabel: false
MaxPathCardinality: 100
DurationBuckets:
- 0.005
- 0.01
- 0.025
- 0.05
- 0.1
- 0.25
- 0.5
- 1
- 2.5
- 5
- 10
HealthChecks:
Enabled: true
Path: "/health"
CacheDuration: "00:00:05"
```
---
## Deliverables
1. `StellaOps.Router.Common/StellaMetrics.cs`
2. `StellaOps.Router.Gateway/MetricsMiddleware.cs`
3. `StellaOps.Router.Transport/TransportMetricsCollector.cs`
4. `StellaOps.Router.Common/HealthCheckService.cs`
5. `StellaOps.Router.Gateway/TransportHealthCheck.cs`
6. `StellaOps.Router.Gateway/AuthorityHealthCheck.cs`
7. `StellaOps.Router.Gateway/HealthEndpoints.cs`
8. `StellaOps.Router.Gateway/PrometheusMetricsEndpoint.cs`
9. Metrics collection tests
10. Health check tests
---
## Next Step
Proceed to [Step 24: Circuit Breaker & Retry Policies](24-Step.md) to implement resilience patterns.