Files

19 KiB
Raw Permalink Blame History

component_architecture_gateway.md — Stella Ops Gateway (Sprint 3600)

Derived from Reference Architecture Advisory and Router Architecture Specification

Dual-location clarification (updated 2026-02-22). Both src/Gateway/ and src/Router/ contain a project named StellaOps.Gateway.WebService. They are different implementations serving complementary roles:

  • src/Gateway/ (this module) — the simplified HTTP ingress gateway focused on authentication, routing to microservices via binary protocol, and OpenAPI aggregation.
  • src/Router/ — the evolved "Front Door" gateway with advanced features: configurable route tables (GatewayRouteCatalog), reverse proxy, SPA hosting, WebSocket support, Valkey messaging transport, and extended Authority integration.

The Router version (src/Router/) appears to be the current canonical deployment target. This Gateway version may represent a simplified or legacy configuration. Operators should verify which is deployed in their environment. See also Router Architecture.

Scope. The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation. Ownership: Platform Guild


0) Mission & Boundaries

What Gateway Does

  • HTTP Ingress: Single entry point for all external HTTP/HTTPS traffic
  • Authentication: DPoP and mTLS token validation via Authority integration
  • Routing: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
  • OpenAPI Aggregation: Combines endpoint specs from all registered microservices
  • Health Aggregation: Provides unified health status from downstream services
  • Rate Limiting: Per-tenant and per-identity request throttling
  • Tenant Propagation: Extracts tenant context and propagates to microservices

What Gateway Does NOT Do

  • Business Logic: No domain logic; pure routing and auth
  • Data Storage: Stateless; no persistent state beyond connection cache
  • Direct Database Access: Never connects to PostgreSQL directly
  • SBOM/VEX Processing: Delegates to Scanner, Excititor, etc.

1) Solution & Project Layout

src/Gateway/
├── StellaOps.Gateway.WebService/
│   ├── StellaOps.Gateway.WebService.csproj
│   ├── Program.cs                          # DI bootstrap, transport init
│   ├── Dockerfile
│   ├── appsettings.json
│   ├── appsettings.Development.json
│   ├── Configuration/
│   │   ├── GatewayOptions.cs               # All configuration options
│   │   └── TransportOptions.cs             # TCP/TLS transport config
│   ├── Middleware/
│   │   ├── TenantMiddleware.cs             # Tenant context extraction
│   │   ├── RequestRoutingMiddleware.cs     # HTTP → binary routing
│   │   ├── SenderConstraintMiddleware.cs   # DPoP/mTLS validation
│   │   ├── IdentityHeaderPolicyMiddleware.cs # Identity header sanitization
│   │   ├── CorrelationIdMiddleware.cs      # Request correlation
│   │   └── HealthCheckMiddleware.cs        # Health probe handling
│   ├── Services/
│   │   ├── GatewayHostedService.cs         # Transport lifecycle
│   │   ├── OpenApiAggregationService.cs    # Spec aggregation
│   │   └── HealthAggregationService.cs     # Downstream health
│   └── Endpoints/
│       ├── HealthEndpoints.cs              # /health/*, /metrics
│       └── OpenApiEndpoints.cs             # /openapi.json, /openapi.yaml

Dependencies

<ItemGroup>
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
  <ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
</ItemGroup>

2) External Dependencies

Dependency Purpose Required
Authority OpTok validation, DPoP/mTLS Yes
Router.Gateway Routing state, endpoint discovery Yes
Router.Transport.Tcp Binary transport (dev) Yes
Router.Transport.Tls Binary transport (prod) Yes
Valkey/Redis Rate limiting state Optional

3) Contracts & Data Model

Request Flow

┌──────────────┐     HTTPS      ┌─────────────────┐    Binary     ┌─────────────────┐
│   Client     │ ─────────────► │    Gateway      │ ────────────► │  Microservice   │
│  (CLI/UI)    │                │   WebService    │    Frame      │   (Scanner,     │
│              │ ◄───────────── │                 │ ◄──────────── │    Policy, etc) │
└──────────────┘     HTTPS      └─────────────────┘    Binary     └─────────────────┘

Binary Frame Protocol

Gateway uses the Router binary protocol for internal communication:

Frame Type Purpose
HELLO Microservice registration with endpoints
HEARTBEAT Health check and latency measurement
REQUEST HTTP request serialized to binary
RESPONSE HTTP response serialized from binary
STREAM_DATA Streaming response chunks
CANCEL Request cancellation propagation

Endpoint Descriptor

public sealed class EndpointDescriptor
{
    public required string Method { get; init; }      // GET, POST, etc.
    public required string Path { get; init; }        // /api/v1/scans/{id}
    public required string ServiceName { get; init; } // scanner
    public required string Version { get; init; }     // 1.0.0
    public TimeSpan DefaultTimeout { get; init; }     // 30s
    public bool SupportsStreaming { get; init; }      // true for large responses
    public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
}

Routing State

public interface IRoutingStateManager
{
    ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
    ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
    ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
    ValueTask DrainConnectionAsync(string connectionId);
}

4) REST API

Gateway exposes minimal management endpoints; all business APIs are routed to microservices.

Health Endpoints

Endpoint Auth Description
GET /health/live None Liveness probe
GET /health/ready None Readiness probe
GET /health/startup None Startup probe
GET /metrics None Prometheus metrics

OpenAPI Endpoints

Endpoint Auth Description
GET /openapi.json None Aggregated OpenAPI 3.1.0 spec
GET /openapi.yaml None YAML format spec

5) Execution Flow

Request Routing

sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant A as Authority
    participant M as Microservice

    C->>G: HTTPS Request + DPoP Token
    G->>A: Validate Token
    A-->>G: Claims (sub, tid, scope)
    G->>G: Select Instance (Method, Path)
    G->>M: Binary REQUEST Frame
    M-->>G: Binary RESPONSE Frame
    G-->>C: HTTPS Response

Microservice Registration

sequenceDiagram
    participant M as Microservice
    participant G as Gateway

    M->>G: TCP/TLS Connect
    M->>G: HELLO (ServiceName, Version, Endpoints)
    G->>G: Register Endpoints
    G-->>M: HELLO ACK

    loop Every 10s
        G->>M: HEARTBEAT
        M-->>G: HEARTBEAT (latency, health)
        G->>G: Update Health State
    end

6) Instance Selection Algorithm

public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
{
    // 1. Find all endpoints matching (method, path)
    var candidates = _endpoints
        .Where(e => e.Method == method && MatchPath(e.Path, path))
        .ToList();

    // 2. Filter by health
    candidates = candidates
        .Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
        .ToList();

    // 3. Region preference
    var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
    var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
    var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();

    var preferred = localRegion.Any() ? localRegion
                  : neighborRegions.Any() ? neighborRegions
                  : otherRegions;

    // 4. Within tier: prefer lower latency, then most recent heartbeat
    return preferred
        .OrderBy(c => c.AveragePingMs)
        .ThenByDescending(c => c.LastHeartbeatUtc)
        .FirstOrDefault();
}

7) Configuration

gateway:
  node:
    region: "eu1"
    nodeId: "gw-eu1-01"
    environment: "prod"

  transports:
    tcp:
      enabled: true
      port: 9100
      maxConnections: 1000
      receiveBufferSize: 65536
      sendBufferSize: 65536
    tls:
      enabled: true
      port: 9443
      certificatePath: "/certs/gateway.pfx"
      certificatePassword: "${GATEWAY_CERT_PASSWORD}"
      clientCertificateMode: "RequireCertificate"
      allowedClientCertificateThumbprints: []

  routing:
    defaultTimeout: "30s"
    maxRequestBodySize: "100MB"
    streamingEnabled: true
    streamingBufferSize: 16384
    neighborRegions: ["eu2", "us1"]

  auth:
    dpopEnabled: true
    dpopMaxClockSkew: "60s"
    mtlsEnabled: true
    rateLimiting:
      enabled: true
      requestsPerMinute: 1000
      burstSize: 100
      redisConnectionString: "${REDIS_URL}"  # Valkey (Redis-compatible)

  openapi:
    enabled: true
    cacheTtlSeconds: 300
    title: "Stella Ops API"
    version: "1.0.0"

  health:
    heartbeatIntervalSeconds: 10
    heartbeatTimeoutSeconds: 30
    unhealthyThreshold: 3

8) Scale & Performance

Metric Target Notes
Routing latency (P50) <2ms Overhead only; excludes downstream
Routing latency (P99) <5ms Under normal load
Concurrent connections 10,000 Per gateway instance
Requests/second 50,000 Per gateway instance
Memory footprint <512MB Base; scales with connections

Scaling Strategy

  • Horizontal scaling behind load balancer
  • Sticky sessions NOT required (stateless)
  • Regional deployment for latency optimization
  • Rate limiting via distributed Valkey/Redis

9) Security Posture

Authentication

Method Description
DPoP Proof-of-possession tokens from Authority
mTLS Certificate-bound tokens for machine clients

Authorization

  • Claims-based authorization per endpoint
  • Required claims defined in endpoint descriptors
  • Tenant isolation via tid claim

Transport Security

Component Encryption
Client → Gateway TLS 1.3 (HTTPS)
Gateway → Microservices TLS (prod), TCP (dev only)

Rate Limiting

Gateway uses the Router's dual-window rate limiting middleware with circuit breaker:

  • Instance-level (in-memory): Per-router-instance limits using sliding window counters
    • High-precision sub-second buckets for fair rate distribution
    • No external dependencies; always available
  • Environment-level (Valkey-backed): Cross-instance limits for distributed deployments
    • Atomic Lua scripts for consistent counting across instances
    • Circuit breaker pattern for fail-open behavior when Valkey is unavailable
  • Activation gate: Environment-level checks only activate above traffic threshold (configurable)
  • Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After

Configuration via appsettings.yaml:

rate_limiting:
  process_back_pressure_when_more_than_per_5min: 5000
  for_instance:
    rules:
      - max_requests: 100
        per_seconds: 1
      - max_requests: 1000
        per_seconds: 60
  for_environment:
    valkey_connection: "localhost:6379"
    rules:
      - max_requests: 10000
        per_seconds: 60
    circuit_breaker:
      failure_threshold: 3
      timeout_seconds: 30
      half_open_timeout: 10

10) Observability & Audit

Metrics (Prometheus)

gateway_requests_total{service,method,path,status}
gateway_request_duration_seconds{service,method,path,quantile}
gateway_active_connections{service}
gateway_transport_frames_total{type}
gateway_auth_failures_total{reason}
gateway_rate_limit_exceeded_total{tenant}

Traces (OpenTelemetry)

  • Span per request: gateway.route
  • Child span: gateway.auth.validate
  • Child span: gateway.transport.send

Logs (Structured)

{
  "timestamp": "2025-12-21T10:00:00Z",
  "level": "info",
  "message": "Request routed",
  "correlationId": "abc123",
  "tenantId": "tenant-1",
  "method": "GET",
  "path": "/api/v1/scans/xyz",
  "service": "scanner",
  "durationMs": 45,
  "status": 200
}

11) Testing Matrix

Test Type Scope Coverage Target
Unit Routing algorithm, auth validation 90%
Integration Transport + routing flow 80%
E2E Full request path with mock services Key flows
Performance Latency, throughput, connection limits SLO targets
Chaos Connection failures, microservice crashes Resilience

Test Fixtures

  • StellaOps.Router.Transport.InMemory for transport mocking
  • Mock Authority for auth testing
  • WebApplicationFactory for integration tests

12) DevOps & Operations

Deployment

# Kubernetes deployment excerpt
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: gateway
        image: stellaops/gateway:1.0.0
        ports:
        - containerPort: 8080   # HTTPS
        - containerPort: 9443   # TLS (microservices)
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

SLOs

SLO Target Measurement
Availability 99.9% Uptime over 30 days
Latency P99 <50ms Includes downstream
Error rate <0.1% 5xx responses

13) Roadmap

Feature Sprint Status
Core implementation 3600.0001.0001 TODO
Performance Testing Pipeline 038 DONE
WebSocket support Future Planned
gRPC passthrough Future Planned
GraphQL aggregation Future Exploration

14) Performance Testing Pipeline (k6 + Prometheus + Correlation IDs)

Overview

The Gateway includes a comprehensive performance testing pipeline with k6 load tests, Prometheus metric instrumentation, and Grafana dashboards for performance curve modelling.

k6 Scenarios (AG)

Scenario Purpose VUs Duration Key Metric
A — Health Baseline Sub-ms health probe overhead 10 1 min P95 < 10 ms
B — OpenAPI Aggregation Spec cache under concurrent readers 50 75 s P95 < 200 ms
C — Routing Throughput Mixed-method routing at target RPS 200 2 min P50 < 2 ms, P99 < 5 ms
D — Correlation ID Propagation overhead measurement 20 1 min P95 < 5 ms overhead
E — Rate Limit Boundary Enforcement correctness at boundary 100 1 min Retry-After header
F — Connection Ramp Transport saturation (ramp to 1000 VUs) 1000 2 min No 503 responses
G — Steady-State Soak Memory leak / resource exhaustion 50 10 min Stable memory

Run all scenarios:

k6 run --env BASE_URL=https://gateway.stella-ops.local src/Gateway/__Tests/load/gateway_performance.k6.js

Run a single scenario:

k6 run --env BASE_URL=https://gateway.stella-ops.local --env SCENARIO=scenario_c_routing_throughput src/Gateway/__Tests/load/gateway_performance.k6.js

Performance Metrics (GatewayPerformanceMetrics)

Meter: StellaOps.Gateway.Performance

Instrument Type Unit Description
gateway.requests.total Counter Total requests processed
gateway.errors.total Counter Errors (4xx/5xx)
gateway.ratelimit.total Counter Rate-limited requests (429)
gateway.request.duration Histogram ms Full request duration
gateway.auth.duration Histogram ms Auth middleware duration
gateway.transport.duration Histogram ms TCP/TLS transport duration
gateway.routing.duration Histogram ms Instance selection duration

Grafana Dashboard

Dashboard: devops/telemetry/dashboards/stella-ops-gateway-performance.json UID: stella-ops-gateway-performance

Panels:

  1. Overview row — P50/P99 gauges, error rate, RPS
  2. Latency Distribution — Percentile time series (overall + per-service)
  3. Throughput & Rate Limiting — RPS by service, rate-limited requests by route
  4. Pipeline Breakdown — Auth/Routing/Transport P95 breakdown, errors by status
  5. Connections & Resources — Active connections, endpoints, memory usage

C# Models

Type Purpose
GatewayPerformanceObservation Single request observation (all pipeline phases)
PerformanceScenarioConfig Scenario definition with SLO thresholds
PerformanceCurvePoint Aggregated window data with computed RPS/error rate
PerformanceTestSummary Complete test run result with threshold violations
GatewayPerformanceMetrics OTel service emitting Prometheus-compatible metrics

14) References

  • Router Architecture: docs/modules/router/architecture.md
  • Gateway Identity Header Policy: docs/modules/gateway/identity-header-policy.md
  • OpenAPI Aggregation: docs/modules/gateway/openapi.md
  • Router ASP.NET Endpoint Bridge: docs/modules/router/aspnet-endpoint-bridge.md
  • Router Messaging (Valkey) Transport: docs/modules/router/messaging-valkey-transport.md
  • Authority Integration: docs/modules/authority/architecture.md
  • Reference Architecture: docs/product/advisories/archived/2025-12-21-reference-architecture/

Last Updated: 2025-12-21 (Sprint 3600)