563 lines
18 KiB
Markdown
563 lines
18 KiB
Markdown
# component_architecture_gateway.md — **Stella Ops Gateway** (Sprint 3600)
|
||
|
||
> Derived from Reference Architecture Advisory and Router Architecture Specification
|
||
|
||
> **Scope.** The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation.
|
||
> **Ownership:** Platform Guild
|
||
|
||
---
|
||
|
||
## 0) Mission & Boundaries
|
||
|
||
### What Gateway Does
|
||
|
||
- **HTTP Ingress**: Single entry point for all external HTTP/HTTPS traffic
|
||
- **Authentication**: DPoP and mTLS token validation via Authority integration
|
||
- **Routing**: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
|
||
- **OpenAPI Aggregation**: Combines endpoint specs from all registered microservices
|
||
- **Health Aggregation**: Provides unified health status from downstream services
|
||
- **Rate Limiting**: Per-tenant and per-identity request throttling
|
||
- **Tenant Propagation**: Extracts tenant context and propagates to microservices
|
||
|
||
### What Gateway Does NOT Do
|
||
|
||
- **Business Logic**: No domain logic; pure routing and auth
|
||
- **Data Storage**: Stateless; no persistent state beyond connection cache
|
||
- **Direct Database Access**: Never connects to PostgreSQL directly
|
||
- **SBOM/VEX Processing**: Delegates to Scanner, Excititor, etc.
|
||
|
||
---
|
||
|
||
## 1) Solution & Project Layout
|
||
|
||
```
|
||
src/Gateway/
|
||
├── StellaOps.Gateway.WebService/
|
||
│ ├── StellaOps.Gateway.WebService.csproj
|
||
│ ├── Program.cs # DI bootstrap, transport init
|
||
│ ├── Dockerfile
|
||
│ ├── appsettings.json
|
||
│ ├── appsettings.Development.json
|
||
│ ├── Configuration/
|
||
│ │ ├── GatewayOptions.cs # All configuration options
|
||
│ │ └── TransportOptions.cs # TCP/TLS transport config
|
||
│ ├── Middleware/
|
||
│ │ ├── TenantMiddleware.cs # Tenant context extraction
|
||
│ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing
|
||
│ │ ├── SenderConstraintMiddleware.cs # DPoP/mTLS validation
|
||
│ │ ├── IdentityHeaderPolicyMiddleware.cs # Identity header sanitization
|
||
│ │ ├── CorrelationIdMiddleware.cs # Request correlation
|
||
│ │ └── HealthCheckMiddleware.cs # Health probe handling
|
||
│ ├── Services/
|
||
│ │ ├── GatewayHostedService.cs # Transport lifecycle
|
||
│ │ ├── OpenApiAggregationService.cs # Spec aggregation
|
||
│ │ └── HealthAggregationService.cs # Downstream health
|
||
│ └── Endpoints/
|
||
│ ├── HealthEndpoints.cs # /health/*, /metrics
|
||
│ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml
|
||
```
|
||
|
||
### Dependencies
|
||
|
||
```xml
|
||
<ItemGroup>
|
||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
|
||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
|
||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
|
||
<ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
|
||
</ItemGroup>
|
||
```
|
||
|
||
---
|
||
|
||
## 2) External Dependencies
|
||
|
||
| Dependency | Purpose | Required |
|
||
|------------|---------|----------|
|
||
| **Authority** | OpTok validation, DPoP/mTLS | Yes |
|
||
| **Router.Gateway** | Routing state, endpoint discovery | Yes |
|
||
| **Router.Transport.Tcp** | Binary transport (dev) | Yes |
|
||
| **Router.Transport.Tls** | Binary transport (prod) | Yes |
|
||
| **Valkey/Redis** | Rate limiting state | Optional |
|
||
|
||
---
|
||
|
||
## 3) Contracts & Data Model
|
||
|
||
### Request Flow
|
||
|
||
```
|
||
┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐
|
||
│ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │
|
||
│ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │
|
||
│ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │
|
||
└──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘
|
||
```
|
||
|
||
### Binary Frame Protocol
|
||
|
||
Gateway uses the Router binary protocol for internal communication:
|
||
|
||
| Frame Type | Purpose |
|
||
|------------|---------|
|
||
| HELLO | Microservice registration with endpoints |
|
||
| HEARTBEAT | Health check and latency measurement |
|
||
| REQUEST | HTTP request serialized to binary |
|
||
| RESPONSE | HTTP response serialized from binary |
|
||
| STREAM_DATA | Streaming response chunks |
|
||
| CANCEL | Request cancellation propagation |
|
||
|
||
### Endpoint Descriptor
|
||
|
||
```csharp
|
||
public sealed class EndpointDescriptor
|
||
{
|
||
public required string Method { get; init; } // GET, POST, etc.
|
||
public required string Path { get; init; } // /api/v1/scans/{id}
|
||
public required string ServiceName { get; init; } // scanner
|
||
public required string Version { get; init; } // 1.0.0
|
||
public TimeSpan DefaultTimeout { get; init; } // 30s
|
||
public bool SupportsStreaming { get; init; } // true for large responses
|
||
public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
|
||
}
|
||
```
|
||
|
||
### Routing State
|
||
|
||
```csharp
|
||
public interface IRoutingStateManager
|
||
{
|
||
ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
|
||
ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
|
||
ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
|
||
ValueTask DrainConnectionAsync(string connectionId);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4) REST API
|
||
|
||
Gateway exposes minimal management endpoints; all business APIs are routed to microservices.
|
||
|
||
### Health Endpoints
|
||
|
||
| Endpoint | Auth | Description |
|
||
|----------|------|-------------|
|
||
| `GET /health/live` | None | Liveness probe |
|
||
| `GET /health/ready` | None | Readiness probe |
|
||
| `GET /health/startup` | None | Startup probe |
|
||
| `GET /metrics` | None | Prometheus metrics |
|
||
|
||
### OpenAPI Endpoints
|
||
|
||
| Endpoint | Auth | Description |
|
||
|----------|------|-------------|
|
||
| `GET /openapi.json` | None | Aggregated OpenAPI 3.1.0 spec |
|
||
| `GET /openapi.yaml` | None | YAML format spec |
|
||
|
||
---
|
||
|
||
## 5) Execution Flow
|
||
|
||
### Request Routing
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant C as Client
|
||
participant G as Gateway
|
||
participant A as Authority
|
||
participant M as Microservice
|
||
|
||
C->>G: HTTPS Request + DPoP Token
|
||
G->>A: Validate Token
|
||
A-->>G: Claims (sub, tid, scope)
|
||
G->>G: Select Instance (Method, Path)
|
||
G->>M: Binary REQUEST Frame
|
||
M-->>G: Binary RESPONSE Frame
|
||
G-->>C: HTTPS Response
|
||
```
|
||
|
||
### Microservice Registration
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant M as Microservice
|
||
participant G as Gateway
|
||
|
||
M->>G: TCP/TLS Connect
|
||
M->>G: HELLO (ServiceName, Version, Endpoints)
|
||
G->>G: Register Endpoints
|
||
G-->>M: HELLO ACK
|
||
|
||
loop Every 10s
|
||
G->>M: HEARTBEAT
|
||
M-->>G: HEARTBEAT (latency, health)
|
||
G->>G: Update Health State
|
||
end
|
||
```
|
||
|
||
---
|
||
|
||
## 6) Instance Selection Algorithm
|
||
|
||
```csharp
|
||
public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
|
||
{
|
||
// 1. Find all endpoints matching (method, path)
|
||
var candidates = _endpoints
|
||
.Where(e => e.Method == method && MatchPath(e.Path, path))
|
||
.ToList();
|
||
|
||
// 2. Filter by health
|
||
candidates = candidates
|
||
.Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
|
||
.ToList();
|
||
|
||
// 3. Region preference
|
||
var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
|
||
var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
|
||
var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();
|
||
|
||
var preferred = localRegion.Any() ? localRegion
|
||
: neighborRegions.Any() ? neighborRegions
|
||
: otherRegions;
|
||
|
||
// 4. Within tier: prefer lower latency, then most recent heartbeat
|
||
return preferred
|
||
.OrderBy(c => c.AveragePingMs)
|
||
.ThenByDescending(c => c.LastHeartbeatUtc)
|
||
.FirstOrDefault();
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 7) Configuration
|
||
|
||
```yaml
|
||
gateway:
|
||
node:
|
||
region: "eu1"
|
||
nodeId: "gw-eu1-01"
|
||
environment: "prod"
|
||
|
||
transports:
|
||
tcp:
|
||
enabled: true
|
||
port: 9100
|
||
maxConnections: 1000
|
||
receiveBufferSize: 65536
|
||
sendBufferSize: 65536
|
||
tls:
|
||
enabled: true
|
||
port: 9443
|
||
certificatePath: "/certs/gateway.pfx"
|
||
certificatePassword: "${GATEWAY_CERT_PASSWORD}"
|
||
clientCertificateMode: "RequireCertificate"
|
||
allowedClientCertificateThumbprints: []
|
||
|
||
routing:
|
||
defaultTimeout: "30s"
|
||
maxRequestBodySize: "100MB"
|
||
streamingEnabled: true
|
||
streamingBufferSize: 16384
|
||
neighborRegions: ["eu2", "us1"]
|
||
|
||
auth:
|
||
dpopEnabled: true
|
||
dpopMaxClockSkew: "60s"
|
||
mtlsEnabled: true
|
||
rateLimiting:
|
||
enabled: true
|
||
requestsPerMinute: 1000
|
||
burstSize: 100
|
||
redisConnectionString: "${REDIS_URL}" # Valkey (Redis-compatible)
|
||
|
||
openapi:
|
||
enabled: true
|
||
cacheTtlSeconds: 300
|
||
title: "Stella Ops API"
|
||
version: "1.0.0"
|
||
|
||
health:
|
||
heartbeatIntervalSeconds: 10
|
||
heartbeatTimeoutSeconds: 30
|
||
unhealthyThreshold: 3
|
||
```
|
||
|
||
---
|
||
|
||
## 8) Scale & Performance
|
||
|
||
| Metric | Target | Notes |
|
||
|--------|--------|-------|
|
||
| Routing latency (P50) | <2ms | Overhead only; excludes downstream |
|
||
| Routing latency (P99) | <5ms | Under normal load |
|
||
| Concurrent connections | 10,000 | Per gateway instance |
|
||
| Requests/second | 50,000 | Per gateway instance |
|
||
| Memory footprint | <512MB | Base; scales with connections |
|
||
|
||
### Scaling Strategy
|
||
|
||
- Horizontal scaling behind load balancer
|
||
- Sticky sessions NOT required (stateless)
|
||
- Regional deployment for latency optimization
|
||
- Rate limiting via distributed Valkey/Redis
|
||
|
||
---
|
||
|
||
## 9) Security Posture
|
||
|
||
### Authentication
|
||
|
||
| Method | Description |
|
||
|--------|-------------|
|
||
| DPoP | Proof-of-possession tokens from Authority |
|
||
| mTLS | Certificate-bound tokens for machine clients |
|
||
|
||
### Authorization
|
||
|
||
- Claims-based authorization per endpoint
|
||
- Required claims defined in endpoint descriptors
|
||
- Tenant isolation via `tid` claim
|
||
|
||
### Transport Security
|
||
|
||
| Component | Encryption |
|
||
|-----------|------------|
|
||
| Client → Gateway | TLS 1.3 (HTTPS) |
|
||
| Gateway → Microservices | TLS (prod), TCP (dev only) |
|
||
|
||
### Rate Limiting
|
||
|
||
Gateway uses the Router's dual-window rate limiting middleware with circuit breaker:
|
||
|
||
- **Instance-level** (in-memory): Per-router-instance limits using sliding window counters
|
||
- High-precision sub-second buckets for fair rate distribution
|
||
- No external dependencies; always available
|
||
- **Environment-level** (Valkey-backed): Cross-instance limits for distributed deployments
|
||
- Atomic Lua scripts for consistent counting across instances
|
||
- Circuit breaker pattern for fail-open behavior when Valkey is unavailable
|
||
- **Activation gate**: Environment-level checks only activate above traffic threshold (configurable)
|
||
- **Response headers**: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
|
||
|
||
Configuration via `appsettings.yaml`:
|
||
```yaml
|
||
rate_limiting:
|
||
process_back_pressure_when_more_than_per_5min: 5000
|
||
for_instance:
|
||
rules:
|
||
- max_requests: 100
|
||
per_seconds: 1
|
||
- max_requests: 1000
|
||
per_seconds: 60
|
||
for_environment:
|
||
valkey_connection: "localhost:6379"
|
||
rules:
|
||
- max_requests: 10000
|
||
per_seconds: 60
|
||
circuit_breaker:
|
||
failure_threshold: 3
|
||
timeout_seconds: 30
|
||
half_open_timeout: 10
|
||
```
|
||
|
||
---
|
||
|
||
## 10) Observability & Audit
|
||
|
||
### Metrics (Prometheus)
|
||
|
||
```
|
||
gateway_requests_total{service,method,path,status}
|
||
gateway_request_duration_seconds{service,method,path,quantile}
|
||
gateway_active_connections{service}
|
||
gateway_transport_frames_total{type}
|
||
gateway_auth_failures_total{reason}
|
||
gateway_rate_limit_exceeded_total{tenant}
|
||
```
|
||
|
||
### Traces (OpenTelemetry)
|
||
|
||
- Span per request: `gateway.route`
|
||
- Child span: `gateway.auth.validate`
|
||
- Child span: `gateway.transport.send`
|
||
|
||
### Logs (Structured)
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2025-12-21T10:00:00Z",
|
||
"level": "info",
|
||
"message": "Request routed",
|
||
"correlationId": "abc123",
|
||
"tenantId": "tenant-1",
|
||
"method": "GET",
|
||
"path": "/api/v1/scans/xyz",
|
||
"service": "scanner",
|
||
"durationMs": 45,
|
||
"status": 200
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 11) Testing Matrix
|
||
|
||
| Test Type | Scope | Coverage Target |
|
||
|-----------|-------|-----------------|
|
||
| Unit | Routing algorithm, auth validation | 90% |
|
||
| Integration | Transport + routing flow | 80% |
|
||
| E2E | Full request path with mock services | Key flows |
|
||
| Performance | Latency, throughput, connection limits | SLO targets |
|
||
| Chaos | Connection failures, microservice crashes | Resilience |
|
||
|
||
### Test Fixtures
|
||
|
||
- `StellaOps.Router.Transport.InMemory` for transport mocking
|
||
- Mock Authority for auth testing
|
||
- `WebApplicationFactory` for integration tests
|
||
|
||
---
|
||
|
||
## 12) DevOps & Operations
|
||
|
||
### Deployment
|
||
|
||
```yaml
|
||
# Kubernetes deployment excerpt
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: gateway
|
||
spec:
|
||
replicas: 3
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: gateway
|
||
image: stellaops/gateway:1.0.0
|
||
ports:
|
||
- containerPort: 8080 # HTTPS
|
||
- containerPort: 9443 # TLS (microservices)
|
||
resources:
|
||
requests:
|
||
memory: "256Mi"
|
||
cpu: "250m"
|
||
limits:
|
||
memory: "512Mi"
|
||
cpu: "1000m"
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /health/live
|
||
port: 8080
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /health/ready
|
||
port: 8080
|
||
```
|
||
|
||
### SLOs
|
||
|
||
| SLO | Target | Measurement |
|
||
|-----|--------|-------------|
|
||
| Availability | 99.9% | Uptime over 30 days |
|
||
| Latency P99 | <50ms | Includes downstream |
|
||
| Error rate | <0.1% | 5xx responses |
|
||
|
||
---
|
||
|
||
## 13) Roadmap
|
||
|
||
| Feature | Sprint | Status |
|
||
|---------|--------|--------|
|
||
| Core implementation | 3600.0001.0001 | TODO |
|
||
| Performance Testing Pipeline | 038 | DONE |
|
||
| WebSocket support | Future | Planned |
|
||
| gRPC passthrough | Future | Planned |
|
||
| GraphQL aggregation | Future | Exploration |
|
||
|
||
---
|
||
|
||
## 14) Performance Testing Pipeline (k6 + Prometheus + Correlation IDs)
|
||
|
||
### Overview
|
||
|
||
The Gateway includes a comprehensive performance testing pipeline with k6 load tests,
|
||
Prometheus metric instrumentation, and Grafana dashboards for performance curve modelling.
|
||
|
||
### k6 Scenarios (A–G)
|
||
|
||
| Scenario | Purpose | VUs | Duration | Key Metric |
|
||
|----------|---------|-----|----------|------------|
|
||
| A — Health Baseline | Sub-ms health probe overhead | 10 | 1 min | P95 < 10 ms |
|
||
| B — OpenAPI Aggregation | Spec cache under concurrent readers | 50 | 75 s | P95 < 200 ms |
|
||
| C — Routing Throughput | Mixed-method routing at target RPS | 200 | 2 min | P50 < 2 ms, P99 < 5 ms |
|
||
| D — Correlation ID | Propagation overhead measurement | 20 | 1 min | P95 < 5 ms overhead |
|
||
| E — Rate Limit Boundary | Enforcement correctness at boundary | 100 | 1 min | Retry-After header |
|
||
| F — Connection Ramp | Transport saturation (ramp to 1000 VUs) | 1000 | 2 min | No 503 responses |
|
||
| G — Steady-State Soak | Memory leak / resource exhaustion | 50 | 10 min | Stable memory |
|
||
|
||
Run all scenarios:
|
||
```bash
|
||
k6 run --env BASE_URL=https://gateway.stella-ops.local src/Gateway/__Tests/load/gateway_performance.k6.js
|
||
```
|
||
|
||
Run a single scenario:
|
||
```bash
|
||
k6 run --env BASE_URL=https://gateway.stella-ops.local --env SCENARIO=scenario_c_routing_throughput src/Gateway/__Tests/load/gateway_performance.k6.js
|
||
```
|
||
|
||
### Performance Metrics (GatewayPerformanceMetrics)
|
||
|
||
Meter: `StellaOps.Gateway.Performance`
|
||
|
||
| Instrument | Type | Unit | Description |
|
||
|------------|------|------|-------------|
|
||
| `gateway.requests.total` | Counter | — | Total requests processed |
|
||
| `gateway.errors.total` | Counter | — | Errors (4xx/5xx) |
|
||
| `gateway.ratelimit.total` | Counter | — | Rate-limited requests (429) |
|
||
| `gateway.request.duration` | Histogram | ms | Full request duration |
|
||
| `gateway.auth.duration` | Histogram | ms | Auth middleware duration |
|
||
| `gateway.transport.duration` | Histogram | ms | TCP/TLS transport duration |
|
||
| `gateway.routing.duration` | Histogram | ms | Instance selection duration |
|
||
|
||
### Grafana Dashboard
|
||
|
||
Dashboard: `devops/telemetry/dashboards/stella-ops-gateway-performance.json`
|
||
UID: `stella-ops-gateway-performance`
|
||
|
||
Panels:
|
||
1. **Overview row** — P50/P99 gauges, error rate, RPS
|
||
2. **Latency Distribution** — Percentile time series (overall + per-service)
|
||
3. **Throughput & Rate Limiting** — RPS by service, rate-limited requests by route
|
||
4. **Pipeline Breakdown** — Auth/Routing/Transport P95 breakdown, errors by status
|
||
5. **Connections & Resources** — Active connections, endpoints, memory usage
|
||
|
||
### C# Models
|
||
|
||
| Type | Purpose |
|
||
|------|---------|
|
||
| `GatewayPerformanceObservation` | Single request observation (all pipeline phases) |
|
||
| `PerformanceScenarioConfig` | Scenario definition with SLO thresholds |
|
||
| `PerformanceCurvePoint` | Aggregated window data with computed RPS/error rate |
|
||
| `PerformanceTestSummary` | Complete test run result with threshold violations |
|
||
| `GatewayPerformanceMetrics` | OTel service emitting Prometheus-compatible metrics |
|
||
|
||
---
|
||
|
||
## 14) References
|
||
|
||
- Router Architecture: `docs/modules/router/architecture.md`
|
||
- Gateway Identity Header Policy: `docs/modules/gateway/identity-header-policy.md`
|
||
- OpenAPI Aggregation: `docs/modules/gateway/openapi.md`
|
||
- Router ASP.NET Endpoint Bridge: `docs/modules/router/aspnet-endpoint-bridge.md`
|
||
- Router Messaging (Valkey) Transport: `docs/modules/router/messaging-valkey-transport.md`
|
||
- Authority Integration: `docs/modules/authority/architecture.md`
|
||
- Reference Architecture: `docs/product/advisories/archived/2025-12-21-reference-architecture/`
|
||
|
||
---
|
||
|
||
**Last Updated**: 2025-12-21 (Sprint 3600)
|