- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces. - Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails. - Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented. - Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
13 KiB
13 KiB
component_architecture_gateway.md — Stella Ops Gateway (Sprint 3600)
Derived from Reference Architecture Advisory and Router Architecture Specification
Scope. The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation. Ownership: Platform Guild
0) Mission & Boundaries
What Gateway Does
- HTTP Ingress: Single entry point for all external HTTP/HTTPS traffic
- Authentication: DPoP and mTLS token validation via Authority integration
- Routing: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
- OpenAPI Aggregation: Combines endpoint specs from all registered microservices
- Health Aggregation: Provides unified health status from downstream services
- Rate Limiting: Per-tenant and per-identity request throttling
- Tenant Propagation: Extracts tenant context and propagates to microservices
What Gateway Does NOT Do
- Business Logic: No domain logic; pure routing and auth
- Data Storage: Stateless; no persistent state beyond connection cache
- Direct Database Access: Never connects to PostgreSQL directly
- SBOM/VEX Processing: Delegates to Scanner, Excititor, etc.
1) Solution & Project Layout
src/Gateway/
├── StellaOps.Gateway.WebService/
│ ├── StellaOps.Gateway.WebService.csproj
│ ├── Program.cs # DI bootstrap, transport init
│ ├── Dockerfile
│ ├── appsettings.json
│ ├── appsettings.Development.json
│ ├── Configuration/
│ │ ├── GatewayOptions.cs # All configuration options
│ │ └── TransportOptions.cs # TCP/TLS transport config
│ ├── Middleware/
│ │ ├── TenantMiddleware.cs # Tenant context extraction
│ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing
│ │ ├── AuthenticationMiddleware.cs # DPoP/mTLS validation
│ │ └── RateLimitingMiddleware.cs # Per-tenant throttling
│ ├── Services/
│ │ ├── GatewayHostedService.cs # Transport lifecycle
│ │ ├── OpenApiAggregationService.cs # Spec aggregation
│ │ └── HealthAggregationService.cs # Downstream health
│ └── Endpoints/
│ ├── HealthEndpoints.cs # /health/*, /metrics
│ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml
Dependencies
<ItemGroup>
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
<ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
</ItemGroup>
2) External Dependencies
| Dependency | Purpose | Required |
|---|---|---|
| Authority | OpTok validation, DPoP/mTLS | Yes |
| Router.Gateway | Routing state, endpoint discovery | Yes |
| Router.Transport.Tcp | Binary transport (dev) | Yes |
| Router.Transport.Tls | Binary transport (prod) | Yes |
| Valkey/Redis | Rate limiting state | Optional |
3) Contracts & Data Model
Request Flow
┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐
│ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │
│ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │
│ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │
└──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘
Binary Frame Protocol
Gateway uses the Router binary protocol for internal communication:
| Frame Type | Purpose |
|---|---|
| HELLO | Microservice registration with endpoints |
| HEARTBEAT | Health check and latency measurement |
| REQUEST | HTTP request serialized to binary |
| RESPONSE | HTTP response serialized from binary |
| STREAM_DATA | Streaming response chunks |
| CANCEL | Request cancellation propagation |
Endpoint Descriptor
public sealed class EndpointDescriptor
{
public required string Method { get; init; } // GET, POST, etc.
public required string Path { get; init; } // /api/v1/scans/{id}
public required string ServiceName { get; init; } // scanner
public required string Version { get; init; } // 1.0.0
public TimeSpan DefaultTimeout { get; init; } // 30s
public bool SupportsStreaming { get; init; } // true for large responses
public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
}
Routing State
public interface IRoutingStateManager
{
ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
ValueTask DrainConnectionAsync(string connectionId);
}
4) REST API
Gateway exposes minimal management endpoints; all business APIs are routed to microservices.
Health Endpoints
| Endpoint | Auth | Description |
|---|---|---|
GET /health/live |
None | Liveness probe |
GET /health/ready |
None | Readiness probe |
GET /health/startup |
None | Startup probe |
GET /metrics |
None | Prometheus metrics |
OpenAPI Endpoints
| Endpoint | Auth | Description |
|---|---|---|
GET /openapi.json |
None | Aggregated OpenAPI 3.1.0 spec |
GET /openapi.yaml |
None | YAML format spec |
5) Execution Flow
Request Routing
sequenceDiagram
participant C as Client
participant G as Gateway
participant A as Authority
participant M as Microservice
C->>G: HTTPS Request + DPoP Token
G->>A: Validate Token
A-->>G: Claims (sub, tid, scope)
G->>G: Select Instance (Method, Path)
G->>M: Binary REQUEST Frame
M-->>G: Binary RESPONSE Frame
G-->>C: HTTPS Response
Microservice Registration
sequenceDiagram
participant M as Microservice
participant G as Gateway
M->>G: TCP/TLS Connect
M->>G: HELLO (ServiceName, Version, Endpoints)
G->>G: Register Endpoints
G-->>M: HELLO ACK
loop Every 10s
G->>M: HEARTBEAT
M-->>G: HEARTBEAT (latency, health)
G->>G: Update Health State
end
6) Instance Selection Algorithm
public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
{
// 1. Find all endpoints matching (method, path)
var candidates = _endpoints
.Where(e => e.Method == method && MatchPath(e.Path, path))
.ToList();
// 2. Filter by health
candidates = candidates
.Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
.ToList();
// 3. Region preference
var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();
var preferred = localRegion.Any() ? localRegion
: neighborRegions.Any() ? neighborRegions
: otherRegions;
// 4. Within tier: prefer lower latency, then most recent heartbeat
return preferred
.OrderBy(c => c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.FirstOrDefault();
}
7) Configuration
gateway:
node:
region: "eu1"
nodeId: "gw-eu1-01"
environment: "prod"
transports:
tcp:
enabled: true
port: 9100
maxConnections: 1000
receiveBufferSize: 65536
sendBufferSize: 65536
tls:
enabled: true
port: 9443
certificatePath: "/certs/gateway.pfx"
certificatePassword: "${GATEWAY_CERT_PASSWORD}"
clientCertificateMode: "RequireCertificate"
allowedClientCertificateThumbprints: []
routing:
defaultTimeout: "30s"
maxRequestBodySize: "100MB"
streamingEnabled: true
streamingBufferSize: 16384
neighborRegions: ["eu2", "us1"]
auth:
dpopEnabled: true
dpopMaxClockSkew: "60s"
mtlsEnabled: true
rateLimiting:
enabled: true
requestsPerMinute: 1000
burstSize: 100
redisConnectionString: "${REDIS_URL}"
openapi:
enabled: true
cacheTtlSeconds: 300
title: "Stella Ops API"
version: "1.0.0"
health:
heartbeatIntervalSeconds: 10
heartbeatTimeoutSeconds: 30
unhealthyThreshold: 3
8) Scale & Performance
| Metric | Target | Notes |
|---|---|---|
| Routing latency (P50) | <2ms | Overhead only; excludes downstream |
| Routing latency (P99) | <5ms | Under normal load |
| Concurrent connections | 10,000 | Per gateway instance |
| Requests/second | 50,000 | Per gateway instance |
| Memory footprint | <512MB | Base; scales with connections |
Scaling Strategy
- Horizontal scaling behind load balancer
- Sticky sessions NOT required (stateless)
- Regional deployment for latency optimization
- Rate limiting via distributed Valkey/Redis
9) Security Posture
Authentication
| Method | Description |
|---|---|
| DPoP | Proof-of-possession tokens from Authority |
| mTLS | Certificate-bound tokens for machine clients |
Authorization
- Claims-based authorization per endpoint
- Required claims defined in endpoint descriptors
- Tenant isolation via
tidclaim
Transport Security
| Component | Encryption |
|---|---|
| Client → Gateway | TLS 1.3 (HTTPS) |
| Gateway → Microservices | TLS (prod), TCP (dev only) |
Rate Limiting
- Per-tenant: Configurable requests/minute
- Per-identity: Burst protection
- Global: Circuit breaker for overload
10) Observability & Audit
Metrics (Prometheus)
gateway_requests_total{service,method,path,status}
gateway_request_duration_seconds{service,method,path,quantile}
gateway_active_connections{service}
gateway_transport_frames_total{type}
gateway_auth_failures_total{reason}
gateway_rate_limit_exceeded_total{tenant}
Traces (OpenTelemetry)
- Span per request:
gateway.route - Child span:
gateway.auth.validate - Child span:
gateway.transport.send
Logs (Structured)
{
"timestamp": "2025-12-21T10:00:00Z",
"level": "info",
"message": "Request routed",
"correlationId": "abc123",
"tenantId": "tenant-1",
"method": "GET",
"path": "/api/v1/scans/xyz",
"service": "scanner",
"durationMs": 45,
"status": 200
}
11) Testing Matrix
| Test Type | Scope | Coverage Target |
|---|---|---|
| Unit | Routing algorithm, auth validation | 90% |
| Integration | Transport + routing flow | 80% |
| E2E | Full request path with mock services | Key flows |
| Performance | Latency, throughput, connection limits | SLO targets |
| Chaos | Connection failures, microservice crashes | Resilience |
Test Fixtures
StellaOps.Router.Transport.InMemoryfor transport mocking- Mock Authority for auth testing
WebApplicationFactoryfor integration tests
12) DevOps & Operations
Deployment
# Kubernetes deployment excerpt
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: stellaops/gateway:1.0.0
ports:
- containerPort: 8080 # HTTPS
- containerPort: 9443 # TLS (microservices)
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/live
port: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Availability | 99.9% | Uptime over 30 days |
| Latency P99 | <50ms | Includes downstream |
| Error rate | <0.1% | 5xx responses |
13) Roadmap
| Feature | Sprint | Status |
|---|---|---|
| Core implementation | 3600.0001.0001 | TODO |
| WebSocket support | Future | Planned |
| gRPC passthrough | Future | Planned |
| GraphQL aggregation | Future | Exploration |
14) References
- Router Architecture:
docs/modules/router/architecture.md - OpenAPI Aggregation:
docs/modules/gateway/openapi.md - Authority Integration:
docs/modules/authority/architecture.md - Reference Architecture:
docs/product-advisories/archived/2025-12-21-reference-architecture/
Last Updated: 2025-12-21 (Sprint 3600)