Files
git.stella-ops.org/docs/modules/gateway/architecture.md
master 53503cb407 Add reference architecture and testing strategy documentation
- Created a new document for the Stella Ops Reference Architecture outlining the system's topology, trust boundaries, artifact association, and interfaces.
- Developed a comprehensive Testing Strategy document detailing the importance of offline readiness, interoperability, determinism, and operational guardrails.
- Introduced a README for the Testing Strategy, summarizing processing details and key concepts implemented.
- Added guidance for AI agents and developers in the tests directory, including directory structure, test categories, key patterns, and rules for test development.
2025-12-22 07:59:30 +02:00

13 KiB

component_architecture_gateway.md — Stella Ops Gateway (Sprint 3600)

Derived from Reference Architecture Advisory and Router Architecture Specification

Scope. The Gateway WebService is the single HTTP ingress point for all external traffic. It authenticates requests via Authority (DPoP/mTLS), routes to microservices via the Router binary protocol, aggregates OpenAPI specifications, and enforces tenant isolation. Ownership: Platform Guild


0) Mission & Boundaries

What Gateway Does

  • HTTP Ingress: Single entry point for all external HTTP/HTTPS traffic
  • Authentication: DPoP and mTLS token validation via Authority integration
  • Routing: Routes HTTP requests to microservices via binary protocol (TCP/TLS)
  • OpenAPI Aggregation: Combines endpoint specs from all registered microservices
  • Health Aggregation: Provides unified health status from downstream services
  • Rate Limiting: Per-tenant and per-identity request throttling
  • Tenant Propagation: Extracts tenant context and propagates to microservices

What Gateway Does NOT Do

  • Business Logic: No domain logic; pure routing and auth
  • Data Storage: Stateless; no persistent state beyond connection cache
  • Direct Database Access: Never connects to PostgreSQL directly
  • SBOM/VEX Processing: Delegates to Scanner, Excititor, etc.

1) Solution & Project Layout

src/Gateway/
├── StellaOps.Gateway.WebService/
│   ├── StellaOps.Gateway.WebService.csproj
│   ├── Program.cs                          # DI bootstrap, transport init
│   ├── Dockerfile
│   ├── appsettings.json
│   ├── appsettings.Development.json
│   ├── Configuration/
│   │   ├── GatewayOptions.cs               # All configuration options
│   │   └── TransportOptions.cs             # TCP/TLS transport config
│   ├── Middleware/
│   │   ├── TenantMiddleware.cs             # Tenant context extraction
│   │   ├── RequestRoutingMiddleware.cs     # HTTP → binary routing
│   │   ├── AuthenticationMiddleware.cs     # DPoP/mTLS validation
│   │   └── RateLimitingMiddleware.cs       # Per-tenant throttling
│   ├── Services/
│   │   ├── GatewayHostedService.cs         # Transport lifecycle
│   │   ├── OpenApiAggregationService.cs    # Spec aggregation
│   │   └── HealthAggregationService.cs     # Downstream health
│   └── Endpoints/
│       ├── HealthEndpoints.cs              # /health/*, /metrics
│       └── OpenApiEndpoints.cs             # /openapi.json, /openapi.yaml

Dependencies

<ItemGroup>
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Gateway\..." />
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tcp\..." />
  <ProjectReference Include="..\..\__Libraries\StellaOps.Router.Transport.Tls\..." />
  <ProjectReference Include="..\..\Auth\StellaOps.Auth.ServerIntegration\..." />
</ItemGroup>

2) External Dependencies

Dependency Purpose Required
Authority OpTok validation, DPoP/mTLS Yes
Router.Gateway Routing state, endpoint discovery Yes
Router.Transport.Tcp Binary transport (dev) Yes
Router.Transport.Tls Binary transport (prod) Yes
Valkey/Redis Rate limiting state Optional

3) Contracts & Data Model

Request Flow

┌──────────────┐     HTTPS      ┌─────────────────┐    Binary     ┌─────────────────┐
│   Client     │ ─────────────► │    Gateway      │ ────────────► │  Microservice   │
│  (CLI/UI)    │                │   WebService    │    Frame      │   (Scanner,     │
│              │ ◄───────────── │                 │ ◄──────────── │    Policy, etc) │
└──────────────┘     HTTPS      └─────────────────┘    Binary     └─────────────────┘

Binary Frame Protocol

Gateway uses the Router binary protocol for internal communication:

Frame Type Purpose
HELLO Microservice registration with endpoints
HEARTBEAT Health check and latency measurement
REQUEST HTTP request serialized to binary
RESPONSE HTTP response serialized from binary
STREAM_DATA Streaming response chunks
CANCEL Request cancellation propagation

Endpoint Descriptor

public sealed class EndpointDescriptor
{
    public required string Method { get; init; }      // GET, POST, etc.
    public required string Path { get; init; }        // /api/v1/scans/{id}
    public required string ServiceName { get; init; } // scanner
    public required string Version { get; init; }     // 1.0.0
    public TimeSpan DefaultTimeout { get; init; }     // 30s
    public bool SupportsStreaming { get; init; }      // true for large responses
    public IReadOnlyList<ClaimRequirement> RequiringClaims { get; init; }
}

Routing State

public interface IRoutingStateManager
{
    ValueTask RegisterEndpointsAsync(ConnectionState conn, HelloPayload hello);
    ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path);
    ValueTask UpdateHealthAsync(ConnectionState conn, HeartbeatPayload heartbeat);
    ValueTask DrainConnectionAsync(string connectionId);
}

4) REST API

Gateway exposes minimal management endpoints; all business APIs are routed to microservices.

Health Endpoints

Endpoint Auth Description
GET /health/live None Liveness probe
GET /health/ready None Readiness probe
GET /health/startup None Startup probe
GET /metrics None Prometheus metrics

OpenAPI Endpoints

Endpoint Auth Description
GET /openapi.json None Aggregated OpenAPI 3.1.0 spec
GET /openapi.yaml None YAML format spec

5) Execution Flow

Request Routing

sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant A as Authority
    participant M as Microservice

    C->>G: HTTPS Request + DPoP Token
    G->>A: Validate Token
    A-->>G: Claims (sub, tid, scope)
    G->>G: Select Instance (Method, Path)
    G->>M: Binary REQUEST Frame
    M-->>G: Binary RESPONSE Frame
    G-->>C: HTTPS Response

Microservice Registration

sequenceDiagram
    participant M as Microservice
    participant G as Gateway

    M->>G: TCP/TLS Connect
    M->>G: HELLO (ServiceName, Version, Endpoints)
    G->>G: Register Endpoints
    G-->>M: HELLO ACK

    loop Every 10s
        G->>M: HEARTBEAT
        M-->>G: HEARTBEAT (latency, health)
        G->>G: Update Health State
    end

6) Instance Selection Algorithm

public ValueTask<InstanceSelection?> SelectInstanceAsync(string method, string path)
{
    // 1. Find all endpoints matching (method, path)
    var candidates = _endpoints
        .Where(e => e.Method == method && MatchPath(e.Path, path))
        .ToList();

    // 2. Filter by health
    candidates = candidates
        .Where(c => c.Health is InstanceHealthStatus.Healthy or InstanceHealthStatus.Degraded)
        .ToList();

    // 3. Region preference
    var localRegion = candidates.Where(c => c.Region == _config.Region).ToList();
    var neighborRegions = candidates.Where(c => _config.NeighborRegions.Contains(c.Region)).ToList();
    var otherRegions = candidates.Except(localRegion).Except(neighborRegions).ToList();

    var preferred = localRegion.Any() ? localRegion
                  : neighborRegions.Any() ? neighborRegions
                  : otherRegions;

    // 4. Within tier: prefer lower latency, then most recent heartbeat
    return preferred
        .OrderBy(c => c.AveragePingMs)
        .ThenByDescending(c => c.LastHeartbeatUtc)
        .FirstOrDefault();
}

7) Configuration

gateway:
  node:
    region: "eu1"
    nodeId: "gw-eu1-01"
    environment: "prod"

  transports:
    tcp:
      enabled: true
      port: 9100
      maxConnections: 1000
      receiveBufferSize: 65536
      sendBufferSize: 65536
    tls:
      enabled: true
      port: 9443
      certificatePath: "/certs/gateway.pfx"
      certificatePassword: "${GATEWAY_CERT_PASSWORD}"
      clientCertificateMode: "RequireCertificate"
      allowedClientCertificateThumbprints: []

  routing:
    defaultTimeout: "30s"
    maxRequestBodySize: "100MB"
    streamingEnabled: true
    streamingBufferSize: 16384
    neighborRegions: ["eu2", "us1"]

  auth:
    dpopEnabled: true
    dpopMaxClockSkew: "60s"
    mtlsEnabled: true
    rateLimiting:
      enabled: true
      requestsPerMinute: 1000
      burstSize: 100
      redisConnectionString: "${REDIS_URL}"

  openapi:
    enabled: true
    cacheTtlSeconds: 300
    title: "Stella Ops API"
    version: "1.0.0"

  health:
    heartbeatIntervalSeconds: 10
    heartbeatTimeoutSeconds: 30
    unhealthyThreshold: 3

8) Scale & Performance

Metric Target Notes
Routing latency (P50) <2ms Overhead only; excludes downstream
Routing latency (P99) <5ms Under normal load
Concurrent connections 10,000 Per gateway instance
Requests/second 50,000 Per gateway instance
Memory footprint <512MB Base; scales with connections

Scaling Strategy

  • Horizontal scaling behind load balancer
  • Sticky sessions NOT required (stateless)
  • Regional deployment for latency optimization
  • Rate limiting via distributed Valkey/Redis

9) Security Posture

Authentication

Method Description
DPoP Proof-of-possession tokens from Authority
mTLS Certificate-bound tokens for machine clients

Authorization

  • Claims-based authorization per endpoint
  • Required claims defined in endpoint descriptors
  • Tenant isolation via tid claim

Transport Security

Component Encryption
Client → Gateway TLS 1.3 (HTTPS)
Gateway → Microservices TLS (prod), TCP (dev only)

Rate Limiting

  • Per-tenant: Configurable requests/minute
  • Per-identity: Burst protection
  • Global: Circuit breaker for overload

10) Observability & Audit

Metrics (Prometheus)

gateway_requests_total{service,method,path,status}
gateway_request_duration_seconds{service,method,path,quantile}
gateway_active_connections{service}
gateway_transport_frames_total{type}
gateway_auth_failures_total{reason}
gateway_rate_limit_exceeded_total{tenant}

Traces (OpenTelemetry)

  • Span per request: gateway.route
  • Child span: gateway.auth.validate
  • Child span: gateway.transport.send

Logs (Structured)

{
  "timestamp": "2025-12-21T10:00:00Z",
  "level": "info",
  "message": "Request routed",
  "correlationId": "abc123",
  "tenantId": "tenant-1",
  "method": "GET",
  "path": "/api/v1/scans/xyz",
  "service": "scanner",
  "durationMs": 45,
  "status": 200
}

11) Testing Matrix

Test Type Scope Coverage Target
Unit Routing algorithm, auth validation 90%
Integration Transport + routing flow 80%
E2E Full request path with mock services Key flows
Performance Latency, throughput, connection limits SLO targets
Chaos Connection failures, microservice crashes Resilience

Test Fixtures

  • StellaOps.Router.Transport.InMemory for transport mocking
  • Mock Authority for auth testing
  • WebApplicationFactory for integration tests

12) DevOps & Operations

Deployment

# Kubernetes deployment excerpt
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: gateway
        image: stellaops/gateway:1.0.0
        ports:
        - containerPort: 8080   # HTTPS
        - containerPort: 9443   # TLS (microservices)
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080

SLOs

SLO Target Measurement
Availability 99.9% Uptime over 30 days
Latency P99 <50ms Includes downstream
Error rate <0.1% 5xx responses

13) Roadmap

Feature Sprint Status
Core implementation 3600.0001.0001 TODO
WebSocket support Future Planned
gRPC passthrough Future Planned
GraphQL aggregation Future Exploration

14) References

  • Router Architecture: docs/modules/router/architecture.md
  • OpenAPI Aggregation: docs/modules/gateway/openapi.md
  • Authority Integration: docs/modules/authority/architecture.md
  • Reference Architecture: docs/product-advisories/archived/2025-12-21-reference-architecture/

Last Updated: 2025-12-21 (Sprint 3600)