Files
git.stella-ops.org/docs/modules/gateway/architecture.md
master 53503cb407 Add reference architecture and testing strategy documentation
- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces.
- Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails.
- Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented.
- Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
2025-12-22 07:59:30 +02:00

462 lines
13 KiB
Markdown

# component_architecture_gateway.md — **Stella Ops Gateway** (Sprint 3600)
> Derived from Reference Architecture Advisory and Router Architecture Specification
> **Scope.** The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation.
> **Ownership:** Platform Guild
---
## 0) Mission & Boundaries
### What Gateway Does
- **HTTP Ingress**: Single entry point for all external HTTP/HTTPS traffic
- **Authentication**: DPoP and mTLS token validation via Authority integration
- **Routing**: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
- **OpenAPI Aggregation**: Combines endpoint specs from all registered microservices
- **Health Aggregation**: Provides unified health status from downstream services
- **Rate Limiting**: Per-tenant and per-identity request throttling
- **Tenant Propagation**: Extracts tenant context and propagates to microservices
### What Gateway Does NOT Do
- **Business Logic**: No domain logic; pure routing and auth
- **Data Storage**: Stateless; no persistent state beyond connection cache
- **Direct Database Access**: Never connects to PostgreSQL directly
- **SBOM/VEX Processing**: Delegates to Scanner, Excititor, etc.
---
## 1) Solution & Project Layout
```
src/Gateway/
├── StellaOps.Gateway.WebService/
│ ├── StellaOps.Gateway.WebService.csproj
│ ├── Program.cs # DI bootstrap, transport init
│ ├── Dockerfile
│ ├── appsettings.json
│ ├── appsettings.Development.json
│ ├── Configuration/
│ │ ├── GatewayOptions.cs # All configuration options
│ │ └── TransportOptions.cs # TCP/TLS transport config
│ ├── Middleware/
│ │ ├── TenantMiddleware.cs # Tenant context extraction
│ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing
│ │ ├── AuthenticationMiddleware.cs # DPoP/mTLS validation
│ │ └── RateLimitingMiddleware.cs # Per-tenant throttling
│ ├── Services/
│ │ ├── GatewayHostedService.cs # Transport lifecycle
│ │ ├── OpenApiAggregationService.cs # Spec aggregation
│ │ └── HealthAggregationService.cs # Downstream health
│ └── Endpoints/
│ ├── HealthEndpoints.cs # /health/*, /metrics
│ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml
```
### Dependencies
```xml
<ItemGroup>
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
<ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
</ItemGroup>
```
---
## 2) External Dependencies
| Dependency | Purpose | Required |
|------------|---------|----------|
| **Authority** | OpTok validation, DPoP/mTLS | Yes |
| **Router.Gateway** | Routing state, endpoint discovery | Yes |
| **Router.Transport.Tcp** | Binary transport (dev) | Yes |
| **Router.Transport.Tls** | Binary transport (prod) | Yes |
| **Valkey/Redis** | Rate limiting state | Optional |
---
## 3) Contracts & Data Model
### Request Flow
```
┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐
│ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │
│ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │
│ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │
└──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘
```
### Binary Frame Protocol
Gateway uses the Router binary protocol for internal communication:
| Frame Type | Purpose |
|------------|---------|
| HELLO | Microservice registration with endpoints |
| HEARTBEAT | Health check and latency measurement |
| REQUEST | HTTP request serialized to binary |
| RESPONSE | HTTP response serialized from binary |
| STREAM_DATA | Streaming response chunks |
| CANCEL | Request cancellation propagation |
### Endpoint Descriptor
```csharp
public sealed class EndpointDescriptor
{
public required string Method { get; init; } // GET, POST, etc.
public required string Path { get; init; } // /api/v1/scans/{id}
public required string ServiceName { get; init; } // scanner
public required string Version { get; init; } // 1.0.0
public TimeSpan DefaultTimeout { get; init; } // 30s
public bool SupportsStreaming { get; init; } // true for large responses
public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
}
```
### Routing State
```csharp
public interface IRoutingStateManager
{
ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
ValueTask DrainConnectionAsync(string connectionId);
}
```
---
## 4) REST API
Gateway exposes minimal management endpoints; all business APIs are routed to microservices.
### Health Endpoints
| Endpoint | Auth | Description |
|----------|------|-------------|
| `GET /health/live` | None | Liveness probe |
| `GET /health/ready` | None | Readiness probe |
| `GET /health/startup` | None | Startup probe |
| `GET /metrics` | None | Prometheus metrics |
### OpenAPI Endpoints
| Endpoint | Auth | Description |
|----------|------|-------------|
| `GET /openapi.json` | None | Aggregated OpenAPI 3.1.0 spec |
| `GET /openapi.yaml` | None | YAML format spec |
---
## 5) Execution Flow
### Request Routing
```mermaid
sequenceDiagram
participant C as Client
participant G as Gateway
participant A as Authority
participant M as Microservice
C->>G: HTTPS Request + DPoP Token
G->>A: Validate Token
A-->>G: Claims (sub, tid, scope)
G->>G: Select Instance (Method, Path)
G->>M: Binary REQUEST Frame
M-->>G: Binary RESPONSE Frame
G-->>C: HTTPS Response
```
### Microservice Registration
```mermaid
sequenceDiagram
participant M as Microservice
participant G as Gateway
M->>G: TCP/TLS Connect
M->>G: HELLO (ServiceName, Version, Endpoints)
G->>G: Register Endpoints
G-->>M: HELLO ACK
loop Every 10s
G->>M: HEARTBEAT
M-->>G: HEARTBEAT (latency, health)
G->>G: Update Health State
end
```
---
## 6) Instance Selection Algorithm
```csharp
public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
{
// 1. Find all endpoints matching (method, path)
var candidates = _endpoints
.Where(e => e.Method == method && MatchPath(e.Path, path))
.ToList();
// 2. Filter by health
candidates = candidates
.Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
.ToList();
// 3. Region preference
var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();
var preferred = localRegion.Any() ? localRegion
: neighborRegions.Any() ? neighborRegions
: otherRegions;
// 4. Within tier: prefer lower latency, then most recent heartbeat
return preferred
.OrderBy(c => c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.FirstOrDefault();
}
```
---
## 7) Configuration
```yaml
gateway:
node:
region: "eu1"
nodeId: "gw-eu1-01"
environment: "prod"
transports:
tcp:
enabled: true
port: 9100
maxConnections: 1000
receiveBufferSize: 65536
sendBufferSize: 65536
tls:
enabled: true
port: 9443
certificatePath: "/certs/gateway.pfx"
certificatePassword: "${GATEWAY_CERT_PASSWORD}"
clientCertificateMode: "RequireCertificate"
allowedClientCertificateThumbprints: []
routing:
defaultTimeout: "30s"
maxRequestBodySize: "100MB"
streamingEnabled: true
streamingBufferSize: 16384
neighborRegions: ["eu2", "us1"]
auth:
dpopEnabled: true
dpopMaxClockSkew: "60s"
mtlsEnabled: true
rateLimiting:
enabled: true
requestsPerMinute: 1000
burstSize: 100
redisConnectionString: "${REDIS_URL}"
openapi:
enabled: true
cacheTtlSeconds: 300
title: "Stella Ops API"
version: "1.0.0"
health:
heartbeatIntervalSeconds: 10
heartbeatTimeoutSeconds: 30
unhealthyThreshold: 3
```
---
## 8) Scale & Performance
| Metric | Target | Notes |
|--------|--------|-------|
| Routing latency (P50) | <2ms | Overhead only; excludes downstream |
| Routing latency (P99) | <5ms | Under normal load |
| Concurrent connections | 10,000 | Per gateway instance |
| Requests/second | 50,000 | Per gateway instance |
| Memory footprint | <512MB | Base; scales with connections |
### Scaling Strategy
- Horizontal scaling behind load balancer
- Sticky sessions NOT required (stateless)
- Regional deployment for latency optimization
- Rate limiting via distributed Valkey/Redis
---
## 9) Security Posture
### Authentication
| Method | Description |
|--------|-------------|
| DPoP | Proof-of-possession tokens from Authority |
| mTLS | Certificate-bound tokens for machine clients |
### Authorization
- Claims-based authorization per endpoint
- Required claims defined in endpoint descriptors
- Tenant isolation via `tid` claim
### Transport Security
| Component | Encryption |
|-----------|------------|
| Client Gateway | TLS 1.3 (HTTPS) |
| Gateway Microservices | TLS (prod), TCP (dev only) |
### Rate Limiting
- Per-tenant: Configurable requests/minute
- Per-identity: Burst protection
- Global: Circuit breaker for overload
---
## 10) Observability & Audit
### Metrics (Prometheus)
```
gateway_requests_total{service,method,path,status}
gateway_request_duration_seconds{service,method,path,quantile}
gateway_active_connections{service}
gateway_transport_frames_total{type}
gateway_auth_failures_total{reason}
gateway_rate_limit_exceeded_total{tenant}
```
### Traces (OpenTelemetry)
- Span per request: `gateway.route`
- Child span: `gateway.auth.validate`
- Child span: `gateway.transport.send`
### Logs (Structured)
```json
{
"timestamp": "2025-12-21T10:00:00Z",
"level": "info",
"message": "Request routed",
"correlationId": "abc123",
"tenantId": "tenant-1",
"method": "GET",
"path": "/api/v1/scans/xyz",
"service": "scanner",
"durationMs": 45,
"status": 200
}
```
---
## 11) Testing Matrix
| Test Type | Scope | Coverage Target |
|-----------|-------|-----------------|
| Unit | Routing algorithm, auth validation | 90% |
| Integration | Transport + routing flow | 80% |
| E2E | Full request path with mock services | Key flows |
| Performance | Latency, throughput, connection limits | SLO targets |
| Chaos | Connection failures, microservice crashes | Resilience |
### Test Fixtures
- `StellaOps.Router.Transport.InMemory` for transport mocking
- Mock Authority for auth testing
- `WebApplicationFactory` for integration tests
---
## 12) DevOps & Operations
### Deployment
```yaml
# Kubernetes deployment excerpt
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: stellaops/gateway:1.0.0
ports:
- containerPort: 8080 # HTTPS
- containerPort: 9443 # TLS (microservices)
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/live
port: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
```
### SLOs
| SLO | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Uptime over 30 days |
| Latency P99 | <50ms | Includes downstream |
| Error rate | <0.1% | 5xx responses |
---
## 13) Roadmap
| Feature | Sprint | Status |
|---------|--------|--------|
| Core implementation | 3600.0001.0001 | TODO |
| WebSocket support | Future | Planned |
| gRPC passthrough | Future | Planned |
| GraphQL aggregation | Future | Exploration |
---
## 14) References
- Router Architecture: `docs/modules/router/architecture.md`
- OpenAPI Aggregation: `docs/modules/gateway/openapi.md`
- Authority Integration: `docs/modules/authority/architecture.md`
- Reference Architecture: `docs/product-advisories/archived/2025-12-21-reference-architecture/`
---
**Last Updated**: 2025-12-21 (Sprint 3600)