Add reference architecture and testing strategy documentation
- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces. - Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails. - Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented. - Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
This commit is contained in:
461
docs/modules/gateway/architecture.md
Normal file
461
docs/modules/gateway/architecture.md
Normal file
@@ -0,0 +1,461 @@
|
||||
# component_architecture_gateway.md — **Stella Ops Gateway** (Sprint 3600)
|
||||
|
||||
> Derived from Reference Architecture Advisory and Router Architecture Specification
|
||||
|
||||
> **Scope.** The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation.
|
||||
> **Ownership:** Platform Guild
|
||||
|
||||
---
|
||||
|
||||
## 0) Mission & Boundaries
|
||||
|
||||
### What Gateway Does
|
||||
|
||||
- **HTTP Ingress**: Single entry point for all external HTTP/HTTPS traffic
|
||||
- **Authentication**: DPoP and mTLS token validation via Authority integration
|
||||
- **Routing**: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
|
||||
- **OpenAPI Aggregation**: Combines endpoint specs from all registered microservices
|
||||
- **Health Aggregation**: Provides unified health status from downstream services
|
||||
- **Rate Limiting**: Per-tenant and per-identity request throttling
|
||||
- **Tenant Propagation**: Extracts tenant context and propagates to microservices
|
||||
|
||||
### What Gateway Does NOT Do
|
||||
|
||||
- **Business Logic**: No domain logic; pure routing and auth
|
||||
- **Data Storage**: Stateless; no persistent state beyond connection cache
|
||||
- **Direct Database Access**: Never connects to PostgreSQL directly
|
||||
- **SBOM/VEX Processing**: Delegates to Scanner, Excititor, etc.
|
||||
|
||||
---
|
||||
|
||||
## 1) Solution & Project Layout
|
||||
|
||||
```
|
||||
src/Gateway/
|
||||
├── StellaOps.Gateway.WebService/
|
||||
│ ├── StellaOps.Gateway.WebService.csproj
|
||||
│ ├── Program.cs # DI bootstrap, transport init
|
||||
│ ├── Dockerfile
|
||||
│ ├── appsettings.json
|
||||
│ ├── appsettings.Development.json
|
||||
│ ├── Configuration/
|
||||
│ │ ├── GatewayOptions.cs # All configuration options
|
||||
│ │ └── TransportOptions.cs # TCP/TLS transport config
|
||||
│ ├── Middleware/
|
||||
│ │ ├── TenantMiddleware.cs # Tenant context extraction
|
||||
│ │ ├── RequestRoutingMiddleware.cs # HTTP → binary routing
|
||||
│ │ ├── AuthenticationMiddleware.cs # DPoP/mTLS validation
|
||||
│ │ └── RateLimitingMiddleware.cs # Per-tenant throttling
|
||||
│ ├── Services/
|
||||
│ │ ├── GatewayHostedService.cs # Transport lifecycle
|
||||
│ │ ├── OpenApiAggregationService.cs # Spec aggregation
|
||||
│ │ └── HealthAggregationService.cs # Downstream health
|
||||
│ └── Endpoints/
|
||||
│ ├── HealthEndpoints.cs # /health/*, /metrics
|
||||
│ └── OpenApiEndpoints.cs # /openapi.json, /openapi.yaml
|
||||
```
|
||||
|
||||
### Dependencies
|
||||
|
||||
```xml
|
||||
<ItemGroup>
|
||||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
|
||||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
|
||||
<ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
|
||||
<ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
|
||||
</ItemGroup>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2) External Dependencies
|
||||
|
||||
| Dependency | Purpose | Required |
|
||||
|------------|---------|----------|
|
||||
| **Authority** | OpTok validation, DPoP/mTLS | Yes |
|
||||
| **Router.Gateway** | Routing state, endpoint discovery | Yes |
|
||||
| **Router.Transport.Tcp** | Binary transport (dev) | Yes |
|
||||
| **Router.Transport.Tls** | Binary transport (prod) | Yes |
|
||||
| **Valkey/Redis** | Rate limiting state | Optional |
|
||||
|
||||
---
|
||||
|
||||
## 3) Contracts & Data Model
|
||||
|
||||
### Request Flow
|
||||
|
||||
```
|
||||
┌──────────────┐ HTTPS ┌─────────────────┐ Binary ┌─────────────────┐
|
||||
│ Client │ ─────────────► │ Gateway │ ────────────► │ Microservice │
|
||||
│ (CLI/UI) │ │ WebService │ Frame │ (Scanner, │
|
||||
│ │ ◄───────────── │ │ ◄──────────── │ Policy, etc) │
|
||||
└──────────────┘ HTTPS └─────────────────┘ Binary └─────────────────┘
|
||||
```
|
||||
|
||||
### Binary Frame Protocol
|
||||
|
||||
Gateway uses the Router binary protocol for internal communication:
|
||||
|
||||
| Frame Type | Purpose |
|
||||
|------------|---------|
|
||||
| HELLO | Microservice registration with endpoints |
|
||||
| HEARTBEAT | Health check and latency measurement |
|
||||
| REQUEST | HTTP request serialized to binary |
|
||||
| RESPONSE | HTTP response serialized from binary |
|
||||
| STREAM_DATA | Streaming response chunks |
|
||||
| CANCEL | Request cancellation propagation |
|
||||
|
||||
### Endpoint Descriptor
|
||||
|
||||
```csharp
|
||||
public sealed class EndpointDescriptor
|
||||
{
|
||||
public required string Method { get; init; } // GET, POST, etc.
|
||||
public required string Path { get; init; } // /api/v1/scans/{id}
|
||||
public required string ServiceName { get; init; } // scanner
|
||||
public required string Version { get; init; } // 1.0.0
|
||||
public TimeSpan DefaultTimeout { get; init; } // 30s
|
||||
public bool SupportsStreaming { get; init; } // true for large responses
|
||||
public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
### Routing State
|
||||
|
||||
```csharp
|
||||
public interface IRoutingStateManager
|
||||
{
|
||||
ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
|
||||
ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
|
||||
ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
|
||||
ValueTask DrainConnectionAsync(string connectionId);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4) REST API
|
||||
|
||||
Gateway exposes minimal management endpoints; all business APIs are routed to microservices.
|
||||
|
||||
### Health Endpoints
|
||||
|
||||
| Endpoint | Auth | Description |
|
||||
|----------|------|-------------|
|
||||
| `GET /health/live` | None | Liveness probe |
|
||||
| `GET /health/ready` | None | Readiness probe |
|
||||
| `GET /health/startup` | None | Startup probe |
|
||||
| `GET /metrics` | None | Prometheus metrics |
|
||||
|
||||
### OpenAPI Endpoints
|
||||
|
||||
| Endpoint | Auth | Description |
|
||||
|----------|------|-------------|
|
||||
| `GET /openapi.json` | None | Aggregated OpenAPI 3.1.0 spec |
|
||||
| `GET /openapi.yaml` | None | YAML format spec |
|
||||
|
||||
---
|
||||
|
||||
## 5) Execution Flow
|
||||
|
||||
### Request Routing
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as Client
|
||||
participant G as Gateway
|
||||
participant A as Authority
|
||||
participant M as Microservice
|
||||
|
||||
C->>G: HTTPS Request + DPoP Token
|
||||
G->>A: Validate Token
|
||||
A-->>G: Claims (sub, tid, scope)
|
||||
G->>G: Select Instance (Method, Path)
|
||||
G->>M: Binary REQUEST Frame
|
||||
M-->>G: Binary RESPONSE Frame
|
||||
G-->>C: HTTPS Response
|
||||
```
|
||||
|
||||
### Microservice Registration
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant M as Microservice
|
||||
participant G as Gateway
|
||||
|
||||
M->>G: TCP/TLS Connect
|
||||
M->>G: HELLO (ServiceName, Version, Endpoints)
|
||||
G->>G: Register Endpoints
|
||||
G-->>M: HELLO ACK
|
||||
|
||||
loop Every 10s
|
||||
G->>M: HEARTBEAT
|
||||
M-->>G: HEARTBEAT (latency, health)
|
||||
G->>G: Update Health State
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6) Instance Selection Algorithm
|
||||
|
||||
```csharp
|
||||
public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
|
||||
{
|
||||
// 1. Find all endpoints matching (method, path)
|
||||
var candidates = _endpoints
|
||||
.Where(e => e.Method == method && MatchPath(e.Path, path))
|
||||
.ToList();
|
||||
|
||||
// 2. Filter by health
|
||||
candidates = candidates
|
||||
.Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
|
||||
.ToList();
|
||||
|
||||
// 3. Region preference
|
||||
var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
|
||||
var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
|
||||
var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();
|
||||
|
||||
var preferred = localRegion.Any() ? localRegion
|
||||
: neighborRegions.Any() ? neighborRegions
|
||||
: otherRegions;
|
||||
|
||||
// 4. Within tier: prefer lower latency, then most recent heartbeat
|
||||
return preferred
|
||||
.OrderBy(c => c.AveragePingMs)
|
||||
.ThenByDescending(c => c.LastHeartbeatUtc)
|
||||
.FirstOrDefault();
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7) Configuration
|
||||
|
||||
```yaml
|
||||
gateway:
|
||||
node:
|
||||
region: "eu1"
|
||||
nodeId: "gw-eu1-01"
|
||||
environment: "prod"
|
||||
|
||||
transports:
|
||||
tcp:
|
||||
enabled: true
|
||||
port: 9100
|
||||
maxConnections: 1000
|
||||
receiveBufferSize: 65536
|
||||
sendBufferSize: 65536
|
||||
tls:
|
||||
enabled: true
|
||||
port: 9443
|
||||
certificatePath: "/certs/gateway.pfx"
|
||||
certificatePassword: "${GATEWAY_CERT_PASSWORD}"
|
||||
clientCertificateMode: "RequireCertificate"
|
||||
allowedClientCertificateThumbprints: []
|
||||
|
||||
routing:
|
||||
defaultTimeout: "30s"
|
||||
maxRequestBodySize: "100MB"
|
||||
streamingEnabled: true
|
||||
streamingBufferSize: 16384
|
||||
neighborRegions: ["eu2", "us1"]
|
||||
|
||||
auth:
|
||||
dpopEnabled: true
|
||||
dpopMaxClockSkew: "60s"
|
||||
mtlsEnabled: true
|
||||
rateLimiting:
|
||||
enabled: true
|
||||
requestsPerMinute: 1000
|
||||
burstSize: 100
|
||||
redisConnectionString: "${REDIS_URL}"
|
||||
|
||||
openapi:
|
||||
enabled: true
|
||||
cacheTtlSeconds: 300
|
||||
title: "Stella Ops API"
|
||||
version: "1.0.0"
|
||||
|
||||
health:
|
||||
heartbeatIntervalSeconds: 10
|
||||
heartbeatTimeoutSeconds: 30
|
||||
unhealthyThreshold: 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8) Scale & Performance
|
||||
|
||||
| Metric | Target | Notes |
|
||||
|--------|--------|-------|
|
||||
| Routing latency (P50) | <2ms | Overhead only; excludes downstream |
|
||||
| Routing latency (P99) | <5ms | Under normal load |
|
||||
| Concurrent connections | 10,000 | Per gateway instance |
|
||||
| Requests/second | 50,000 | Per gateway instance |
|
||||
| Memory footprint | <512MB | Base; scales with connections |
|
||||
|
||||
### Scaling Strategy
|
||||
|
||||
- Horizontal scaling behind load balancer
|
||||
- Sticky sessions NOT required (stateless)
|
||||
- Regional deployment for latency optimization
|
||||
- Rate limiting via distributed Valkey/Redis
|
||||
|
||||
---
|
||||
|
||||
## 9) Security Posture
|
||||
|
||||
### Authentication
|
||||
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| DPoP | Proof-of-possession tokens from Authority |
|
||||
| mTLS | Certificate-bound tokens for machine clients |
|
||||
|
||||
### Authorization
|
||||
|
||||
- Claims-based authorization per endpoint
|
||||
- Required claims defined in endpoint descriptors
|
||||
- Tenant isolation via `tid` claim
|
||||
|
||||
### Transport Security
|
||||
|
||||
| Component | Encryption |
|
||||
|-----------|------------|
|
||||
| Client → Gateway | TLS 1.3 (HTTPS) |
|
||||
| Gateway → Microservices | TLS (prod), TCP (dev only) |
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
- Per-tenant: Configurable requests/minute
|
||||
- Per-identity: Burst protection
|
||||
- Global: Circuit breaker for overload
|
||||
|
||||
---
|
||||
|
||||
## 10) Observability & Audit
|
||||
|
||||
### Metrics (Prometheus)
|
||||
|
||||
```
|
||||
gateway_requests_total{service,method,path,status}
|
||||
gateway_request_duration_seconds{service,method,path,quantile}
|
||||
gateway_active_connections{service}
|
||||
gateway_transport_frames_total{type}
|
||||
gateway_auth_failures_total{reason}
|
||||
gateway_rate_limit_exceeded_total{tenant}
|
||||
```
|
||||
|
||||
### Traces (OpenTelemetry)
|
||||
|
||||
- Span per request: `gateway.route`
|
||||
- Child span: `gateway.auth.validate`
|
||||
- Child span: `gateway.transport.send`
|
||||
|
||||
### Logs (Structured)
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-12-21T10:00:00Z",
|
||||
"level": "info",
|
||||
"message": "Request routed",
|
||||
"correlationId": "abc123",
|
||||
"tenantId": "tenant-1",
|
||||
"method": "GET",
|
||||
"path": "/api/v1/scans/xyz",
|
||||
"service": "scanner",
|
||||
"durationMs": 45,
|
||||
"status": 200
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11) Testing Matrix
|
||||
|
||||
| Test Type | Scope | Coverage Target |
|
||||
|-----------|-------|-----------------|
|
||||
| Unit | Routing algorithm, auth validation | 90% |
|
||||
| Integration | Transport + routing flow | 80% |
|
||||
| E2E | Full request path with mock services | Key flows |
|
||||
| Performance | Latency, throughput, connection limits | SLO targets |
|
||||
| Chaos | Connection failures, microservice crashes | Resilience |
|
||||
|
||||
### Test Fixtures
|
||||
|
||||
- `StellaOps.Router.Transport.InMemory` for transport mocking
|
||||
- Mock Authority for auth testing
|
||||
- `WebApplicationFactory` for integration tests
|
||||
|
||||
---
|
||||
|
||||
## 12) DevOps & Operations
|
||||
|
||||
### Deployment
|
||||
|
||||
```yaml
|
||||
# Kubernetes deployment excerpt
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: gateway
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: gateway
|
||||
image: stellaops/gateway:1.0.0
|
||||
ports:
|
||||
- containerPort: 8080 # HTTPS
|
||||
- containerPort: 9443 # TLS (microservices)
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "1000m"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8080
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8080
|
||||
```
|
||||
|
||||
### SLOs
|
||||
|
||||
| SLO | Target | Measurement |
|
||||
|-----|--------|-------------|
|
||||
| Availability | 99.9% | Uptime over 30 days |
|
||||
| Latency P99 | <50ms | Includes downstream |
|
||||
| Error rate | <0.1% | 5xx responses |
|
||||
|
||||
---
|
||||
|
||||
## 13) Roadmap
|
||||
|
||||
| Feature | Sprint | Status |
|
||||
|---------|--------|--------|
|
||||
| Core implementation | 3600.0001.0001 | TODO |
|
||||
| WebSocket support | Future | Planned |
|
||||
| gRPC passthrough | Future | Planned |
|
||||
| GraphQL aggregation | Future | Exploration |
|
||||
|
||||
---
|
||||
|
||||
## 14) References
|
||||
|
||||
- Router Architecture: `docs/modules/router/architecture.md`
|
||||
- OpenAPI Aggregation: `docs/modules/gateway/openapi.md`
|
||||
- Authority Integration: `docs/modules/authority/architecture.md`
|
||||
- Reference Architecture: `docs/product-advisories/archived/2025-12-21-reference-architecture/`
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-21 (Sprint 3600)
|
||||
Reference in New Issue
Block a user