# component_architecture_gateway.md — **Stella Ops Gateway** (Sprint 3600) > Derived from Reference Architecture Advisory and Router Architecture Specification > **Scope.** The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation. > **Ownership:** Platform Guild --- ## 0) Mission & Boundaries ### What Gateway Does - **HTTP Ingress**: Single entry point for all external HTTP/HTTPS traffic - **Authentication**: DPoP and mTLS token validation via Authority integration - **Routing**: Routes HTTP requests to microservices via binary protocol (TCP/TLS) - **OpenAPI Aggregation**: Combines endpoint specs from all registered microservices - **Health Aggregation**: Provides unified health status from downstream services - **Rate Limiting**: Per-tenant and per-identity request throttling - **Tenant Propagation**: Extracts tenant context and propagates to microservices ### What Gateway Does NOT Do - **Business Logic**: No domain logic; pure routing and auth - **Data Storage**: Stateless; no persistent state beyond connection cache - **Direct Database Access**: Never connects to PostgreSQL directly - **SBOM/VEX Processing**: Delegates to Scanner, Excititor, etc. --- ## 1) Solution & Project Layout ``` src/Gateway/ ├── StellaOps.Gateway.WebService/ │ ├── StellaOps.Gateway.WebService.csproj │ ├── Program.cs # DI bootstrap, transport init │ ├── Dockerfile │ ├── appsettings.json │ ├── appsettings.Development.json │ ├── Configuration/ │ │ ├── GatewayOptions.cs # All configuration options │ │ └── TransportOptions.cs # TCP/TLS transport config │ ├── Middleware/ │ │ ├── TenantMiddleware.cs # Tenant context extraction │ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing │ │ ├── SenderConstraintMiddleware.cs # DPoP/mTLS validation │ │ ├── IdentityHeaderPolicyMiddleware.cs # Identity header sanitization │ │ ├── CorrelationIdMiddleware.cs # Request correlation │ │ └── HealthCheckMiddleware.cs # Health probe handling │ ├── Services/ │ │ ├── GatewayHostedService.cs # Transport lifecycle │ │ ├── OpenApiAggregationService.cs # Spec aggregation │ │ └── HealthAggregationService.cs # Downstream health │ └── Endpoints/ │ ├── HealthEndpoints.cs # /health/*, /metrics │ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml ``` ### Dependencies ```xml ``` --- ## 2) External Dependencies | Dependency | Purpose | Required | |------------|---------|----------| | **Authority** | OpTok validation, DPoP/mTLS | Yes | | **Router.Gateway** | Routing state, endpoint discovery | Yes | | **Router.Transport.Tcp** | Binary transport (dev) | Yes | | **Router.Transport.Tls** | Binary transport (prod) | Yes | | **Valkey/Redis** | Rate limiting state | Optional | --- ## 3) Contracts & Data Model ### Request Flow ``` ┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐ │ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │ │ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │ │ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │ └──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘ ``` ### Binary Frame Protocol Gateway uses the Router binary protocol for internal communication: | Frame Type | Purpose | |------------|---------| | HELLO | Microservice registration with endpoints | | HEARTBEAT | Health check and latency measurement | | REQUEST | HTTP request serialized to binary | | RESPONSE | HTTP response serialized from binary | | STREAM_DATA | Streaming response chunks | | CANCEL | Request cancellation propagation | ### Endpoint Descriptor ```csharp public sealed class EndpointDescriptor { public required string Method { get; init; } // GET, POST, etc. public required string Path { get; init; } // /api/v1/scans/{id} public required string ServiceName { get; init; } // scanner public required string Version { get; init; } // 1.0.0 public TimeSpan DefaultTimeout { get; init; } // 30s public bool SupportsStreaming { get; init; } // true for large responses public IReadOnlyList RequiringClaims { get; init; } } ``` ### Routing State ```csharp public interface IRoutingStateManager { ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello); ValueTask SelectInstanceAsync(string method, string path); ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat); ValueTask DrainConnectionAsync(string connectionId); } ``` --- ## 4) REST API Gateway exposes minimal management endpoints; all business APIs are routed to microservices. ### Health Endpoints | Endpoint | Auth | Description | |----------|------|-------------| | `GET /health/live` | None | Liveness probe | | `GET /health/ready` | None | Readiness probe | | `GET /health/startup` | None | Startup probe | | `GET /metrics` | None | Prometheus metrics | ### OpenAPI Endpoints | Endpoint | Auth | Description | |----------|------|-------------| | `GET /openapi.json` | None | Aggregated OpenAPI 3.1.0 spec | | `GET /openapi.yaml` | None | YAML format spec | --- ## 5) Execution Flow ### Request Routing ```mermaid sequenceDiagram participant C as Client participant G as Gateway participant A as Authority participant M as Microservice C->>G: HTTPS Request + DPoP Token G->>A: Validate Token A-->>G: Claims (sub, tid, scope) G->>G: Select Instance (Method, Path) G->>M: Binary REQUEST Frame M-->>G: Binary RESPONSE Frame G-->>C: HTTPS Response ``` ### Microservice Registration ```mermaid sequenceDiagram participant M as Microservice participant G as Gateway M->>G: TCP/TLS Connect M->>G: HELLO (ServiceName, Version, Endpoints) G->>G: Register Endpoints G-->>M: HELLO ACK loop Every 10s G->>M: HEARTBEAT M-->>G: HEARTBEAT (latency, health) G->>G: Update Health State end ``` --- ## 6) Instance Selection Algorithm ```csharp public ValueTask SelectInstanceAsync(string method, string path) { // 1. Find all endpoints matching (method, path) var candidates = _endpoints .Where(e => e.Method == method && MatchPath(e.Path, path)) .ToList(); // 2. Filter by health candidates = candidates .Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded) .ToList(); // 3. Region preference var localRegion = candidates.Where(c => c.Region == _config.Region).ToList(); var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList(); var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList(); var preferred = localRegion.Any() ? localRegion : neighborRegions.Any() ? neighborRegions : otherRegions; // 4. Within tier: prefer lower latency, then most recent heartbeat return preferred .OrderBy(c => c.AveragePingMs) .ThenByDescending(c => c.LastHeartbeatUtc) .FirstOrDefault(); } ``` --- ## 7) Configuration ```yaml gateway: node: region: "eu1" nodeId: "gw-eu1-01" environment: "prod" transports: tcp: enabled: true port: 9100 maxConnections: 1000 receiveBufferSize: 65536 sendBufferSize: 65536 tls: enabled: true port: 9443 certificatePath: "/certs/gateway.pfx" certificatePassword: "${GATEWAY_CERT_PASSWORD}" clientCertificateMode: "RequireCertificate" allowedClientCertificateThumbprints: [] routing: defaultTimeout: "30s" maxRequestBodySize: "100MB" streamingEnabled: true streamingBufferSize: 16384 neighborRegions: ["eu2", "us1"] auth: dpopEnabled: true dpopMaxClockSkew: "60s" mtlsEnabled: true rateLimiting: enabled: true requestsPerMinute: 1000 burstSize: 100 redisConnectionString: "${REDIS_URL}" # Valkey (Redis-compatible) openapi: enabled: true cacheTtlSeconds: 300 title: "Stella Ops API" version: "1.0.0" health: heartbeatIntervalSeconds: 10 heartbeatTimeoutSeconds: 30 unhealthyThreshold: 3 ``` --- ## 8) Scale & Performance | Metric | Target | Notes | |--------|--------|-------| | Routing latency (P50) | <2ms | Overhead only; excludes downstream | | Routing latency (P99) | <5ms | Under normal load | | Concurrent connections | 10,000 | Per gateway instance | | Requests/second | 50,000 | Per gateway instance | | Memory footprint | <512MB | Base; scales with connections | ### Scaling Strategy - Horizontal scaling behind load balancer - Sticky sessions NOT required (stateless) - Regional deployment for latency optimization - Rate limiting via distributed Valkey/Redis --- ## 9) Security Posture ### Authentication | Method | Description | |--------|-------------| | DPoP | Proof-of-possession tokens from Authority | | mTLS | Certificate-bound tokens for machine clients | ### Authorization - Claims-based authorization per endpoint - Required claims defined in endpoint descriptors - Tenant isolation via `tid` claim ### Transport Security | Component | Encryption | |-----------|------------| | Client → Gateway | TLS 1.3 (HTTPS) | | Gateway → Microservices | TLS (prod), TCP (dev only) | ### Rate Limiting Gateway uses the Router's dual-window rate limiting middleware with circuit breaker: - **Instance-level** (in-memory): Per-router-instance limits using sliding window counters - High-precision sub-second buckets for fair rate distribution - No external dependencies; always available - **Environment-level** (Valkey-backed): Cross-instance limits for distributed deployments - Atomic Lua scripts for consistent counting across instances - Circuit breaker pattern for fail-open behavior when Valkey is unavailable - **Activation gate**: Environment-level checks only activate above traffic threshold (configurable) - **Response headers**: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After Configuration via `appsettings.yaml`: ```yaml rate_limiting: process_back_pressure_when_more_than_per_5min: 5000 for_instance: rules: - max_requests: 100 per_seconds: 1 - max_requests: 1000 per_seconds: 60 for_environment: valkey_connection: "localhost:6379" rules: - max_requests: 10000 per_seconds: 60 circuit_breaker: failure_threshold: 3 timeout_seconds: 30 half_open_timeout: 10 ``` --- ## 10) Observability & Audit ### Metrics (Prometheus) ``` gateway_requests_total{service,method,path,status} gateway_request_duration_seconds{service,method,path,quantile} gateway_active_connections{service} gateway_transport_frames_total{type} gateway_auth_failures_total{reason} gateway_rate_limit_exceeded_total{tenant} ``` ### Traces (OpenTelemetry) - Span per request: `gateway.route` - Child span: `gateway.auth.validate` - Child span: `gateway.transport.send` ### Logs (Structured) ```json { "timestamp": "2025-12-21T10:00:00Z", "level": "info", "message": "Request routed", "correlationId": "abc123", "tenantId": "tenant-1", "method": "GET", "path": "/api/v1/scans/xyz", "service": "scanner", "durationMs": 45, "status": 200 } ``` --- ## 11) Testing Matrix | Test Type | Scope | Coverage Target | |-----------|-------|-----------------| | Unit | Routing algorithm, auth validation | 90% | | Integration | Transport + routing flow | 80% | | E2E | Full request path with mock services | Key flows | | Performance | Latency, throughput, connection limits | SLO targets | | Chaos | Connection failures, microservice crashes | Resilience | ### Test Fixtures - `StellaOps.Router.Transport.InMemory` for transport mocking - Mock Authority for auth testing - `WebApplicationFactory` for integration tests --- ## 12) DevOps & Operations ### Deployment ```yaml # Kubernetes deployment excerpt apiVersion: apps/v1 kind: Deployment metadata: name: gateway spec: replicas: 3 template: spec: containers: - name: gateway image: stellaops/gateway:1.0.0 ports: - containerPort: 8080 # HTTPS - containerPort: 9443 # TLS (microservices) resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "1000m" livenessProbe: httpGet: path: /health/live port: 8080 readinessProbe: httpGet: path: /health/ready port: 8080 ``` ### SLOs | SLO | Target | Measurement | |-----|--------|-------------| | Availability | 99.9% | Uptime over 30 days | | Latency P99 | <50ms | Includes downstream | | Error rate | <0.1% | 5xx responses | --- ## 13) Roadmap | Feature | Sprint | Status | |---------|--------|--------| | Core implementation | 3600.0001.0001 | TODO | | Performance Testing Pipeline | 038 | DONE | | WebSocket support | Future | Planned | | gRPC passthrough | Future | Planned | | GraphQL aggregation | Future | Exploration | --- ## 14) Performance Testing Pipeline (k6 + Prometheus + Correlation IDs) ### Overview The Gateway includes a comprehensive performance testing pipeline with k6 load tests, Prometheus metric instrumentation, and Grafana dashboards for performance curve modelling. ### k6 Scenarios (A–G) | Scenario | Purpose | VUs | Duration | Key Metric | |----------|---------|-----|----------|------------| | A — Health Baseline | Sub-ms health probe overhead | 10 | 1 min | P95 < 10 ms | | B — OpenAPI Aggregation | Spec cache under concurrent readers | 50 | 75 s | P95 < 200 ms | | C — Routing Throughput | Mixed-method routing at target RPS | 200 | 2 min | P50 < 2 ms, P99 < 5 ms | | D — Correlation ID | Propagation overhead measurement | 20 | 1 min | P95 < 5 ms overhead | | E — Rate Limit Boundary | Enforcement correctness at boundary | 100 | 1 min | Retry-After header | | F — Connection Ramp | Transport saturation (ramp to 1000 VUs) | 1000 | 2 min | No 503 responses | | G — Steady-State Soak | Memory leak / resource exhaustion | 50 | 10 min | Stable memory | Run all scenarios: ```bash k6 run --env BASE_URL=https://gateway.stella-ops.local src/Gateway/__Tests/load/gateway_performance.k6.js ``` Run a single scenario: ```bash k6 run --env BASE_URL=https://gateway.stella-ops.local --env SCENARIO=scenario_c_routing_throughput src/Gateway/__Tests/load/gateway_performance.k6.js ``` ### Performance Metrics (GatewayPerformanceMetrics) Meter: `StellaOps.Gateway.Performance` | Instrument | Type | Unit | Description | |------------|------|------|-------------| | `gateway.requests.total` | Counter | — | Total requests processed | | `gateway.errors.total` | Counter | — | Errors (4xx/5xx) | | `gateway.ratelimit.total` | Counter | — | Rate-limited requests (429) | | `gateway.request.duration` | Histogram | ms | Full request duration | | `gateway.auth.duration` | Histogram | ms | Auth middleware duration | | `gateway.transport.duration` | Histogram | ms | TCP/TLS transport duration | | `gateway.routing.duration` | Histogram | ms | Instance selection duration | ### Grafana Dashboard Dashboard: `devops/telemetry/dashboards/stella-ops-gateway-performance.json` UID: `stella-ops-gateway-performance` Panels: 1. **Overview row** — P50/P99 gauges, error rate, RPS 2. **Latency Distribution** — Percentile time series (overall + per-service) 3. **Throughput & Rate Limiting** — RPS by service, rate-limited requests by route 4. **Pipeline Breakdown** — Auth/Routing/Transport P95 breakdown, errors by status 5. **Connections & Resources** — Active connections, endpoints, memory usage ### C# Models | Type | Purpose | |------|---------| | `GatewayPerformanceObservation` | Single request observation (all pipeline phases) | | `PerformanceScenarioConfig` | Scenario definition with SLO thresholds | | `PerformanceCurvePoint` | Aggregated window data with computed RPS/error rate | | `PerformanceTestSummary` | Complete test run result with threshold violations | | `GatewayPerformanceMetrics` | OTel service emitting Prometheus-compatible metrics | --- ## 14) References - Router Architecture: `docs/modules/router/architecture.md` - Gateway Identity Header Policy: `docs/modules/gateway/identity-header-policy.md` - OpenAPI Aggregation: `docs/modules/gateway/openapi.md` - Router ASP.NET Endpoint Bridge: `docs/modules/router/aspnet-endpoint-bridge.md` - Router Messaging (Valkey) Transport: `docs/modules/router/messaging-valkey-transport.md` - Authority Integration: `docs/modules/authority/architecture.md` - Reference Architecture: `docs/product/advisories/archived/2025-12-21-reference-architecture/` --- **Last Updated**: 2025-12-21 (Sprint 3600)