18 KiB
component_architecture_gateway.md — Stella Ops Gateway (Sprint 3600)
Derived from Reference Architecture Advisory and Router Architecture Specification
Scope. The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation. Ownership: Platform Guild
0) Mission & Boundaries
What Gateway Does
- HTTP Ingress: Single entry point for all external HTTP/HTTPS traffic
- Authentication: DPoP and mTLS token validation via Authority integration
- Routing: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
- OpenAPI Aggregation: Combines endpoint specs from all registered microservices
- Health Aggregation: Provides unified health status from downstream services
- Rate Limiting: Per-tenant and per-identity request throttling
- Tenant Propagation: Extracts tenant context and propagates to microservices
What Gateway Does NOT Do
- Business Logic: No domain logic; pure routing and auth
- Data Storage: Stateless; no persistent state beyond connection cache
- Direct Database Access: Never connects to PostgreSQL directly
- SBOM/VEX Processing: Delegates to Scanner, Excititor, etc.
1) Solution & Project Layout
src/Gateway/
├── StellaOps.Gateway.WebService/
│ ├── StellaOps.Gateway.WebService.csproj
│ ├── Program.cs # DI bootstrap, transport init
│ ├── Dockerfile
│ ├── appsettings.json
│ ├── appsettings.Development.json
│ ├── Configuration/
│ │ ├── GatewayOptions.cs # All configuration options
│ │ └── TransportOptions.cs # TCP/TLS transport config
│ ├── Middleware/
│ │ ├── TenantMiddleware.cs # Tenant context extraction
│ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing
│ │ ├── SenderConstraintMiddleware.cs # DPoP/mTLS validation
│ │ ├── IdentityHeaderPolicyMiddleware.cs # Identity header sanitization
│ │ ├── CorrelationIdMiddleware.cs # Request correlation
│ │ └── HealthCheckMiddleware.cs # Health probe handling
│ ├── Services/
│ │ ├── GatewayHostedService.cs # Transport lifecycle
│ │ ├── OpenApiAggregationService.cs # Spec aggregation
│ │ └── HealthAggregationService.cs # Downstream health
│ └── Endpoints/
│ ├── HealthEndpoints.cs # /health/*, /metrics
│ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml
Dependencies
<ItemGroup>
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
<ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
</ItemGroup>
2) External Dependencies
| Dependency | Purpose | Required |
|---|---|---|
| Authority | OpTok validation, DPoP/mTLS | Yes |
| Router.Gateway | Routing state, endpoint discovery | Yes |
| Router.Transport.Tcp | Binary transport (dev) | Yes |
| Router.Transport.Tls | Binary transport (prod) | Yes |
| Valkey/Redis | Rate limiting state | Optional |
3) Contracts & Data Model
Request Flow
┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐
│ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │
│ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │
│ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │
└──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘
Binary Frame Protocol
Gateway uses the Router binary protocol for internal communication:
| Frame Type | Purpose |
|---|---|
| HELLO | Microservice registration with endpoints |
| HEARTBEAT | Health check and latency measurement |
| REQUEST | HTTP request serialized to binary |
| RESPONSE | HTTP response serialized from binary |
| STREAM_DATA | Streaming response chunks |
| CANCEL | Request cancellation propagation |
Endpoint Descriptor
public sealed class EndpointDescriptor
{
public required string Method { get; init; } // GET, POST, etc.
public required string Path { get; init; } // /api/v1/scans/{id}
public required string ServiceName { get; init; } // scanner
public required string Version { get; init; } // 1.0.0
public TimeSpan DefaultTimeout { get; init; } // 30s
public bool SupportsStreaming { get; init; } // true for large responses
public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
}
Routing State
public interface IRoutingStateManager
{
ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
ValueTask DrainConnectionAsync(string connectionId);
}
4) REST API
Gateway exposes minimal management endpoints; all business APIs are routed to microservices.
Health Endpoints
| Endpoint | Auth | Description |
|---|---|---|
GET /health/live |
None | Liveness probe |
GET /health/ready |
None | Readiness probe |
GET /health/startup |
None | Startup probe |
GET /metrics |
None | Prometheus metrics |
OpenAPI Endpoints
| Endpoint | Auth | Description |
|---|---|---|
GET /openapi.json |
None | Aggregated OpenAPI 3.1.0 spec |
GET /openapi.yaml |
None | YAML format spec |
5) Execution Flow
Request Routing
sequenceDiagram
participant C as Client
participant G as Gateway
participant A as Authority
participant M as Microservice
C->>G: HTTPS Request + DPoP Token
G->>A: Validate Token
A-->>G: Claims (sub, tid, scope)
G->>G: Select Instance (Method, Path)
G->>M: Binary REQUEST Frame
M-->>G: Binary RESPONSE Frame
G-->>C: HTTPS Response
Microservice Registration
sequenceDiagram
participant M as Microservice
participant G as Gateway
M->>G: TCP/TLS Connect
M->>G: HELLO (ServiceName, Version, Endpoints)
G->>G: Register Endpoints
G-->>M: HELLO ACK
loop Every 10s
G->>M: HEARTBEAT
M-->>G: HEARTBEAT (latency, health)
G->>G: Update Health State
end
6) Instance Selection Algorithm
public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
{
// 1. Find all endpoints matching (method, path)
var candidates = _endpoints
.Where(e => e.Method == method && MatchPath(e.Path, path))
.ToList();
// 2. Filter by health
candidates = candidates
.Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
.ToList();
// 3. Region preference
var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();
var preferred = localRegion.Any() ? localRegion
: neighborRegions.Any() ? neighborRegions
: otherRegions;
// 4. Within tier: prefer lower latency, then most recent heartbeat
return preferred
.OrderBy(c => c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.FirstOrDefault();
}
7) Configuration
gateway:
node:
region: "eu1"
nodeId: "gw-eu1-01"
environment: "prod"
transports:
tcp:
enabled: true
port: 9100
maxConnections: 1000
receiveBufferSize: 65536
sendBufferSize: 65536
tls:
enabled: true
port: 9443
certificatePath: "/certs/gateway.pfx"
certificatePassword: "${GATEWAY_CERT_PASSWORD}"
clientCertificateMode: "RequireCertificate"
allowedClientCertificateThumbprints: []
routing:
defaultTimeout: "30s"
maxRequestBodySize: "100MB"
streamingEnabled: true
streamingBufferSize: 16384
neighborRegions: ["eu2", "us1"]
auth:
dpopEnabled: true
dpopMaxClockSkew: "60s"
mtlsEnabled: true
rateLimiting:
enabled: true
requestsPerMinute: 1000
burstSize: 100
redisConnectionString: "${REDIS_URL}" # Valkey (Redis-compatible)
openapi:
enabled: true
cacheTtlSeconds: 300
title: "Stella Ops API"
version: "1.0.0"
health:
heartbeatIntervalSeconds: 10
heartbeatTimeoutSeconds: 30
unhealthyThreshold: 3
8) Scale & Performance
| Metric | Target | Notes |
|---|---|---|
| Routing latency (P50) | <2ms | Overhead only; excludes downstream |
| Routing latency (P99) | <5ms | Under normal load |
| Concurrent connections | 10,000 | Per gateway instance |
| Requests/second | 50,000 | Per gateway instance |
| Memory footprint | <512MB | Base; scales with connections |
Scaling Strategy
- Horizontal scaling behind load balancer
- Sticky sessions NOT required (stateless)
- Regional deployment for latency optimization
- Rate limiting via distributed Valkey/Redis
9) Security Posture
Authentication
| Method | Description |
|---|---|
| DPoP | Proof-of-possession tokens from Authority |
| mTLS | Certificate-bound tokens for machine clients |
Authorization
- Claims-based authorization per endpoint
- Required claims defined in endpoint descriptors
- Tenant isolation via
tidclaim
Transport Security
| Component | Encryption |
|---|---|
| Client → Gateway | TLS 1.3 (HTTPS) |
| Gateway → Microservices | TLS (prod), TCP (dev only) |
Rate Limiting
Gateway uses the Router's dual-window rate limiting middleware with circuit breaker:
- Instance-level (in-memory): Per-router-instance limits using sliding window counters
- High-precision sub-second buckets for fair rate distribution
- No external dependencies; always available
- Environment-level (Valkey-backed): Cross-instance limits for distributed deployments
- Atomic Lua scripts for consistent counting across instances
- Circuit breaker pattern for fail-open behavior when Valkey is unavailable
- Activation gate: Environment-level checks only activate above traffic threshold (configurable)
- Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
Configuration via appsettings.yaml:
rate_limiting:
process_back_pressure_when_more_than_per_5min: 5000
for_instance:
rules:
- max_requests: 100
per_seconds: 1
- max_requests: 1000
per_seconds: 60
for_environment:
valkey_connection: "localhost:6379"
rules:
- max_requests: 10000
per_seconds: 60
circuit_breaker:
failure_threshold: 3
timeout_seconds: 30
half_open_timeout: 10
10) Observability & Audit
Metrics (Prometheus)
gateway_requests_total{service,method,path,status}
gateway_request_duration_seconds{service,method,path,quantile}
gateway_active_connections{service}
gateway_transport_frames_total{type}
gateway_auth_failures_total{reason}
gateway_rate_limit_exceeded_total{tenant}
Traces (OpenTelemetry)
- Span per request:
gateway.route - Child span:
gateway.auth.validate - Child span:
gateway.transport.send
Logs (Structured)
{
"timestamp": "2025-12-21T10:00:00Z",
"level": "info",
"message": "Request routed",
"correlationId": "abc123",
"tenantId": "tenant-1",
"method": "GET",
"path": "/api/v1/scans/xyz",
"service": "scanner",
"durationMs": 45,
"status": 200
}
11) Testing Matrix
| Test Type | Scope | Coverage Target |
|---|---|---|
| Unit | Routing algorithm, auth validation | 90% |
| Integration | Transport + routing flow | 80% |
| E2E | Full request path with mock services | Key flows |
| Performance | Latency, throughput, connection limits | SLO targets |
| Chaos | Connection failures, microservice crashes | Resilience |
Test Fixtures
StellaOps.Router.Transport.InMemoryfor transport mocking- Mock Authority for auth testing
WebApplicationFactoryfor integration tests
12) DevOps & Operations
Deployment
# Kubernetes deployment excerpt
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: stellaops/gateway:1.0.0
ports:
- containerPort: 8080 # HTTPS
- containerPort: 9443 # TLS (microservices)
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/live
port: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Availability | 99.9% | Uptime over 30 days |
| Latency P99 | <50ms | Includes downstream |
| Error rate | <0.1% | 5xx responses |
13) Roadmap
| Feature | Sprint | Status |
|---|---|---|
| Core implementation | 3600.0001.0001 | TODO |
| Performance Testing Pipeline | 038 | DONE |
| WebSocket support | Future | Planned |
| gRPC passthrough | Future | Planned |
| GraphQL aggregation | Future | Exploration |
14) Performance Testing Pipeline (k6 + Prometheus + Correlation IDs)
Overview
The Gateway includes a comprehensive performance testing pipeline with k6 load tests, Prometheus metric instrumentation, and Grafana dashboards for performance curve modelling.
k6 Scenarios (A–G)
| Scenario | Purpose | VUs | Duration | Key Metric |
|---|---|---|---|---|
| A — Health Baseline | Sub-ms health probe overhead | 10 | 1 min | P95 < 10 ms |
| B — OpenAPI Aggregation | Spec cache under concurrent readers | 50 | 75 s | P95 < 200 ms |
| C — Routing Throughput | Mixed-method routing at target RPS | 200 | 2 min | P50 < 2 ms, P99 < 5 ms |
| D — Correlation ID | Propagation overhead measurement | 20 | 1 min | P95 < 5 ms overhead |
| E — Rate Limit Boundary | Enforcement correctness at boundary | 100 | 1 min | Retry-After header |
| F — Connection Ramp | Transport saturation (ramp to 1000 VUs) | 1000 | 2 min | No 503 responses |
| G — Steady-State Soak | Memory leak / resource exhaustion | 50 | 10 min | Stable memory |
Run all scenarios:
k6 run --env BASE_URL=https://gateway.stella-ops.local src/Gateway/__Tests/load/gateway_performance.k6.js
Run a single scenario:
k6 run --env BASE_URL=https://gateway.stella-ops.local --env SCENARIO=scenario_c_routing_throughput src/Gateway/__Tests/load/gateway_performance.k6.js
Performance Metrics (GatewayPerformanceMetrics)
Meter: StellaOps.Gateway.Performance
| Instrument | Type | Unit | Description |
|---|---|---|---|
gateway.requests.total |
Counter | — | Total requests processed |
gateway.errors.total |
Counter | — | Errors (4xx/5xx) |
gateway.ratelimit.total |
Counter | — | Rate-limited requests (429) |
gateway.request.duration |
Histogram | ms | Full request duration |
gateway.auth.duration |
Histogram | ms | Auth middleware duration |
gateway.transport.duration |
Histogram | ms | TCP/TLS transport duration |
gateway.routing.duration |
Histogram | ms | Instance selection duration |
Grafana Dashboard
Dashboard: devops/telemetry/dashboards/stella-ops-gateway-performance.json
UID: stella-ops-gateway-performance
Panels:
- Overview row — P50/P99 gauges, error rate, RPS
- Latency Distribution — Percentile time series (overall + per-service)
- Throughput & Rate Limiting — RPS by service, rate-limited requests by route
- Pipeline Breakdown — Auth/Routing/Transport P95 breakdown, errors by status
- Connections & Resources — Active connections, endpoints, memory usage
C# Models
| Type | Purpose |
|---|---|
GatewayPerformanceObservation |
Single request observation (all pipeline phases) |
PerformanceScenarioConfig |
Scenario definition with SLO thresholds |
PerformanceCurvePoint |
Aggregated window data with computed RPS/error rate |
PerformanceTestSummary |
Complete test run result with threshold violations |
GatewayPerformanceMetrics |
OTel service emitting Prometheus-compatible metrics |
14) References
- Router Architecture:
docs/modules/router/architecture.md - Gateway Identity Header Policy:
docs/modules/gateway/identity-header-policy.md - OpenAPI Aggregation:
docs/modules/gateway/openapi.md - Router ASP.NET Endpoint Bridge:
docs/modules/router/aspnet-endpoint-bridge.md - Router Messaging (Valkey) Transport:
docs/modules/router/messaging-valkey-transport.md - Authority Integration:
docs/modules/authority/architecture.md - Reference Architecture:
docs/product/advisories/archived/2025-12-21-reference-architecture/
Last Updated: 2025-12-21 (Sprint 3600)