router planning

This commit is contained in:
master
2025-12-02 18:38:32 +02:00
parent 790801f329
commit 0c9e8d5d18
15 changed files with 6439 additions and 0 deletions

494
docs/router/specs.md Normal file
View File

@@ -0,0 +1,494 @@
Ill group everything into requirement buckets, but keep it all as requirements statements (no rationale). This is the union of what you asked for or confirmed across the whole thread.
---
## 1. Architectural / scope requirements
* There SHALL be a single HTTP ingress service named `StellaOps.Gateway.WebService`.
* Microservices SHALL NOT expose HTTP to the router; all microservice-to-router traffic (control + data) MUST use in-house transports (UDP, TCP, certificate/TLS, RabbitMQ).
* There SHALL NOT be a separate control-plane service or protocol; each transport connection between a microservice and the router MUST carry:
* Initial registration (HELLO) and endpoint configuration.
* Ongoing heartbeats.
* Endpoint updates (if any).
* Request/response and streaming data.
* The router SHALL maintain per-connection endpoint mappings and derive its global routing state from the union of all live connections.
* The router SHALL treat request and response bodies as opaque (raw bytes / streams); all deserialization and schema handling SHALL be the microservices responsibility.
* The system SHALL support both buffered and streaming request/response flows end-to-end.
* The design MUST reuse only the generic parts of `__SerdicaTemplate` (dynamic endpoint metadata, attribute-based endpoint discovery, request routing patterns, correlation, connection management) and MUST drop Serdica-specific stack (Oracle schema, domain logic, etc.).
* The solution MUST be a simpler, generic replacement for the existing Serdica HTTP→RabbitMQ→microservice design.
---
## 2. Service identity, region, versioning
* Each microservice instance SHALL be identified by `(ServiceName, Version, Region, InstanceId)`.
* `Version` MUST follow strict semantic versioning (`major.minor.patch`).
* Routing MUST be strict on version:
* The router MUST only route a request to instances whose `Version` equals the selected version.
* When a version is not explicitly specified by the client, a default version MUST be used (from config or metadata).
* Each gateway node SHALL have a static configuration object `GatewayNodeConfig` containing at least:
* `Region` (e.g. `"eu1"`).
* `NodeId` (e.g. `"gw-eu1-01"`).
* `Environment` (e.g. `"prod"`).
* Routing decisions MUST use `GatewayNodeConfig.Region` as the nodes region; the router MUST NOT derive region from HTTP headers or URL host names.
* DNS/host naming conventions SHOULD express region in the domain (e.g. `eu1.global.stella-ops.org`, `mainoffice.contoso.stella-ops.org`), but routing logic MUST be driven by `GatewayNodeConfig.Region` rather than by host parsing.
---
## 3. Endpoint identity and metadata
* Endpoint identity in the router and microservices MUST be `HTTP Method + Path`, for example:
* `Method`: one of `GET`, `POST`, `PUT`, `PATCH`, `DELETE`.
* `Path`: e.g. `/section/get/{id}`.
* The router and microservices MUST use the same path template syntax and matching rules (e.g. ASP.NET-style route templates), including decisions on:
* Case sensitivity.
* Trailing slash handling.
* Parameter segments (e.g. `{id}`).
* The router MUST resolve an incoming HTTP `(Method, Path)` to a logical endpoint descriptor that includes:
* ServiceName.
* Version.
* Method.
* Path.
* DefaultTimeout.
* `RequiringClaims`: a list of claim requirements.
* A flag indicating whether the endpoint supports streaming.
* Every place that previously spoke about `AllowedRoles` MUST be replaced with `RequiringClaims`:
* Each requirement MUST at minimum contain a `Type` and MAY contain a `Value`.
* Endpoints MUST support being configured with default `RequiringClaims` in microservices, with the possibility of external override (see Authority section).
---
## 4. Routing algorithm / instance selection
* Given a resolved endpoint `(ServiceName, Version, Method, Path)`, the router MUST:
* Filter candidate instances by:
* Matching `ServiceName`.
* Matching `Version` (strict semver equality).
* Health in an acceptable set (e.g. `Healthy` or `Degraded`).
* Instances MUST have health metadata:
* `Status` ∈ {`Unknown`, `Healthy`, `Degraded`, `Draining`, `Unhealthy`}.
* `LastHeartbeatUtc`.
* `AveragePingMs`.
* The routers instance selection MUST obey these rules:
* Region:
* Prefer instances whose `Region == GatewayNodeConfig.Region`.
* If none, fall back to configured neighbor regions.
* If none, fall back to all other regions.
* Within a chosen region tier:
* Prefer lower `AveragePingMs`.
* If several are tied, prefer more recent `LastHeartbeatUtc`.
* If still tied, use a balancing strategy (e.g. random or round-robin).
* The router MUST support a strict fallback order as requested:
* Prefer “closest by region and heartbeat and ping.”
* If having to choose between worse candidates, fall back in order of:
* Greater ping (latency).
* Greater heartbeat age.
* Less preferred region tier.
---
## 5. Transport plugin requirements
* There MUST be a transport plugin abstraction representing how the router and microservices communicate.
* The default transport type MUST be UDP.
* Additional supported transport types MUST include:
* TCP.
* Certificate-based TCP (TLS / mTLS).
* RabbitMQ.
* There MUST NOT be an HTTP transport plugin; HTTP MUST NOT be used for microservice-to-router communications (control or data).
* Each transport plugin MUST support:
* Establishing logical connections between microservices and the router.
* Sending/receiving HELLO (registration), HEARTBEAT, optional ENDPOINTS_UPDATE.
* Sending/receiving REQUEST/RESPONSE frames.
* Supporting streaming via REQUEST_STREAM_DATA / RESPONSE_STREAM_DATA frames where the transport allows it.
* Sending/receiving CANCEL frames to abort specific in-flight requests.
* UDP transport:
* MUST be used only for small/bounded payloads (no unbounded streaming).
* MUST respect configured `MaxRequestBytesPerCall`.
* TCP and Certificate transports:
* MUST implement a length-prefixed framing protocol capable of multiplexing frames for multiple correlation IDs.
* Certificate transport MUST enforce TLS and support optional mutual TLS (verifiable peer identity).
* RabbitMQ:
* MUST implement queue/exchange naming and routing keys sufficient to represent logical connections and correlation IDs.
* MUST use message properties (e.g. `CorrelationId`) for request/response matching.
---
## 6. Gateway (`StellaOps.Gateway.WebService`) requirements
### 6.1 HTTP ingress pipeline
* The gateway MUST host an ASP.NET Core HTTP server.
* The HTTP middleware pipeline MUST include at least:
* Forwarded headers handling (when behind reverse proxy).
* Request logging (e.g. via Serilog) including correlation ID, service, endpoint, region, instance.
* Global error-handling middleware.
* Authentication middleware.
* `EndpointResolutionMiddleware` to resolve `(Method, Path)` → endpoint.
* Authorization middleware that enforces `RequiringClaims`.
* `RoutingDecisionMiddleware` to choose connection/instance/transport.
* `TransportDispatchMiddleware` to carry out buffered or streaming dispatch.
* The gateway MUST read `Method` and `Path` from the HTTP request and use them to resolve endpoints.
### 6.2 Per-connection state and routing view
* The gateway MUST maintain a `ConnectionState` per logical connection that includes:
* ConnectionId.
* `InstanceDescriptor` (`InstanceId`, `ServiceName`, `Version`, `Region`).
* `Status`, `LastHeartbeatUtc`, `AveragePingMs`.
* The set of endpoints that this connection serves (`(Method, Path)``EndpointDescriptor`).
* The transport type for that connection.
* The gateway MUST maintain a global routing state (`IGlobalRoutingState`) that:
* Resolves `(Method, Path)` to an `EndpointDescriptor` (service, version, metadata).
* Provides the set of `ConnectionState` objects that can handle a given `(ServiceName, Version, Method, Path)`.
### 6.3 Buffered vs streaming dispatch
* The gateway MUST support:
* **Buffered mode** for small to medium payloads:
* Read the entire HTTP body into memory (or temp file when above a threshold).
* Send as a single REQUEST payload.
* **Streaming mode** for large or unknown content:
* Streaming from HTTP body to microservice via a sequence of REQUEST_STREAM_DATA frames.
* Streaming from microservice back to HTTP via RESPONSE_STREAM_DATA frames.
* For each endpoint, the gateway MUST know whether it can use streaming or must use buffered mode (`SupportsStreaming` flag).
### 6.4 Opaque body handling
* The gateway MUST treat request and response bodies as opaque byte sequences and MUST NOT attempt to deserialize or interpret payload contents.
* The gateway MUST forward headers and body bytes as given and leave any schema, JSON, or other decoding to the microservice.
### 6.5 Payload and memory protection
* The gateway MUST enforce configured payload limits:
* `MaxRequestBytesPerCall`.
* `MaxRequestBytesPerConnection`.
* `MaxAggregateInflightBytes`.
* If `Content-Length` is known and exceeds `MaxRequestBytesPerCall`, the gateway MUST reject the request early (e.g. HTTP 413 Payload Too Large).
* During streaming, the gateway MUST maintain counters of:
* Bytes read for this request.
* Bytes for this connection.
* Total in-flight bytes across all requests.
* If any limit is exceeded mid-stream, the gateway MUST:
* Stop reading the HTTP body.
* Send a CANCEL frame for that correlation ID.
* Abort the stream to the microservice.
* Return an appropriate error to the client (e.g. 413 or 503) and log the incident.
---
## 7. Microservice SDK (`__Libraries/StellaOps.Microservice`) requirements
### 7.1 Identity & router connections
* `StellaMicroserviceOptions` MUST let microservices configure:
* `ServiceName`.
* `Version`.
* `Region`.
* `InstanceId`.
* A list of router endpoints (`Routers` / router pool) including host, port, and transport type for each.
* Optional path to a YAML config file for endpoint-level overrides.
* Providing the router pool (`Routers` / HTTP servers pool) MUST be mandatory; a microservice cannot start without at least one configured router endpoint.
* The router pool SHOULD be configurable via code and MAY optionally be configured via YAML with hot-reload (causing reconnections if changed).
### 7.2 Endpoint definition & discovery
* Microservice endpoints MUST be declared using attributes that specify `(Method, Path)`:
```csharp
[StellaEndpoint("POST", "/billing/invoices")]
public sealed class CreateInvoiceEndpoint : ...
```
* The SDK MUST support two handler shapes:
* Raw handler:
* `IRawStellaEndpoint` taking a `RawRequestContext` and returning a `RawResponse`, where:
* `RawRequestContext.Body` is a stream (may be buffered or streaming).
* Body contents are raw bytes.
* Typed handlers:
* `IStellaEndpoint<TRequest, TResponse>` which takes a typed request and returns a typed response.
* `IStellaEndpoint<TResponse>` which has no request payload and returns a typed response.
* The SDK MUST adapt typed endpoints to the raw model internally (microservice-side only), leaving the router unaware of types.
* Endpoint discovery MUST work by:
* Runtime reflection: scanning assemblies for `[StellaEndpoint]` and handler interfaces.
* Build-time reflection via source generation:
* A Roslyn source generator MUST generate a descriptor list at build time.
* At runtime, the SDK MUST prefer source-generated metadata and only fall back to reflection if generation is not available.
### 7.3 Endpoint metadata defaults & overrides
* Microservices MUST be able to provide default endpoint metadata:
* `SupportsStreaming` flag.
* Default timeout.
* Default `RequiringClaims`.
* Microservice-local YAML MUST be allowed to override or refine these defaults per endpoint, keyed by `(Method, Path)`.
* Precedence rules MUST be clearly defined and honored:
* Service identity & router pool: from `StellaMicroserviceOptions` (not YAML).
* Endpoint set: from code (attributes/source gen); YAML MAY override properties but ideally not create endpoints not present in code (policy decision to be documented).
* `RequiringClaims` and timeouts: YAML overrides defaults from code, unless overridden by central Authority.
### 7.4 Connection behavior
* On establishing a connection to a router endpoint, the SDK MUST:
* Immediately send a HELLO frame containing:
* `ServiceName`, `Version`, `Region`, `InstanceId`.
* The list of endpoints (Method, Path) with their metadata (SupportsStreaming, default timeouts, default RequiringClaims).
* At regular intervals, the SDK MUST send HEARTBEAT frames on each connection indicating:
* Instance health status.
* Optional metrics (e.g. in-flight request count, error rate).
* The SDK SHOULD support optional ENDPOINTS_UPDATE (or a re-HELLO) to update endpoint metadata at runtime if needed.
### 7.5 Request handling & streaming
* For each incoming REQUEST frame:
* The SDK MUST create a `RawRequestContext` with:
* Method.
* Path.
* Headers.
* A `Body` stream that either:
* Wraps a buffered byte array.
* Or exposes streaming reads from subsequent REQUEST_STREAM_DATA frames.
* A `CancellationToken` that will be cancelled when the router sends a CANCEL frame or the connection fails.
* The SDK MUST resolve the correct endpoint handler by `(Method, Path)` using the same path template rules as the router.
* For streaming endpoints, handlers MUST be able to read from `RawRequestContext.Body` incrementally and obey the `CancellationToken`.
### 7.6 Cancellation handling (microservice side)
* The SDK MUST maintain a map of in-flight requests by correlation ID, each containing:
* A `CancellationTokenSource`.
* The task executing the handler.
* Upon receiving a CANCEL frame for a given correlation ID, the SDK MUST:
* Look up the corresponding entry and call `CancellationTokenSource.Cancel()`.
* Handlers (both raw and typed) MUST receive a `CancellationToken`:
* They MUST observe the token and be coded to cancel promptly where needed.
* They MUST pass the token to downstream I/O operations (DB calls, file I/O, network).
* If the transport connection is closed, the SDK MUST treat it as a cancellation trigger for all outstanding requests on that connection and cancel their tokens.
---
## 8. Control / health / ping requirements
* Heartbeats MUST be sent over the same connection as requests (no separate control channel).
* The router MUST:
* Track `LastHeartbeatUtc` for each connection.
* Derive `InstanceHealthStatus` based on heartbeat recency and optionally metrics.
* Drop or mark as Unhealthy any instances whose heartbeats are stale past configured thresholds.
* The router SHOULD measure network latency (ping) by:
* Timing request-response round trips, or
* Using explicit ping frames, and updating `AveragePingMs` for each connection.
* The router MUST use heartbeat and ping metrics in its routing decision as described above.
---
## 9. Authorization / requiringClaims / Authority requirements
* `RequiringClaims` MUST be the only authorization metadata field; `AllowedRoles` MUST NOT be used.
* Every endpoint MUST be able to specify:
* An empty `RequiringClaims` list (no additional claims required beyond authenticated).
* Or one or more `ClaimRequirement` objects (Type + optional Value).
* The gateway MUST enforce `RequiringClaims` per request:
* Authorization MUST check that the requests user principal has all required claims for the endpoint.
* Microservices MUST provide default `RequiringClaims` as part of their HELLO metadata.
* There MUST be a mechanism for an external Authority service to override `RequiringClaims` centrally:
* Defaults MUST come from microservices.
* Authority MUST be able to push or supply overrides that the gateway applies at startup and/or at runtime.
* The gateway MUST proactively request such overrides on startup (e.g. via a special message or mechanism) before handling traffic, or as early as practical.
* Final, effective `RequiringClaims` enforced at the gateway MUST be derived from microservice defaults plus Authority overrides, with Authority taking precedence where applicable.
---
## 10. Cancellation requirements (router side)
* The protocol MUST define a `FrameType.Cancel` with:
* A `CorrelationId` indicating which request to cancel.
* An optional payload containing a reason code (e.g. `"ClientDisconnected"`, `"Timeout"`, `"PayloadLimitExceeded"`).
* The router MUST send CANCEL frames when:
* The HTTP client disconnects (ASP.NET `HttpContext.RequestAborted` fires) while the request is in progress.
* The routers effective timeout for the request elapses, and no response has been received.
* The router detects payload/memory limit breaches and has to abort the request.
* The router is shutting down and explicitly aborts in-flight requests (if implemented).
* The router MUST:
* Stop forwarding any additional REQUEST_STREAM_DATA to the microservice once a CANCEL is sent.
* Stop reading any remaining response frames for that correlation and either:
* Discard them.
* Or treat them as late, log them, and ignore them.
* For streaming responses, if the HTTP client disconnects or router cancels:
* The router MUST stop writing to the HTTP response and treat any subsequent frames as ignored.
---
## 11. Configuration and YAML requirements
* `__Libraries/StellaOps.Router.Config` MUST handle:
* Binding router config from JSON/appsettings + YAML + environment variables.
* Static service definitions:
* ServiceName.
* DefaultVersion.
* DefaultTransport.
* Endpoint list (Method, Path) with default timeouts, requiringClaims, streaming flags.
* Static instance definitions (optional):
* ServiceName, Version, Region, supported transports, plugin-specific settings.
* Global payload limits (`PayloadLimits`).
* Router YAML config MUST support hot-reload:
* Changes SHOULD be picked up at runtime without restarting the gateway.
* Hot-reload MUST cause in-memory routing state to be updated, including:
* New or removed services/endpoints.
* New or removed instances (static).
* Updated payload limits.
* Microservice YAML config MUST be optional and used for endpoint-level overrides only, not for identity or router pool configuration.
* The router pool for microservices MUST be configured via code and MAY be backed by YAML (with hot-plug / reconnection behavior) if desired.
---
## 12. Library naming / repo structure requirements
* The router configuration library MUST be named `__Libraries/StellaOps.Router.Config`.
* The microservice SDK library MUST be named `__Libraries/StellaOps.Microservice`.
* The gateway webservice MUST be named `StellaOps.Gateway.WebService`.
* There MUST be a “common” library for shared types and abstractions (e.g. `__Libraries/StellaOps.Router.Common`).
* Documentation files MUST include at least:
* `Stella Ops Router.md` (what it is, why, high-level architecture).
* `Stella Ops Router - Webserver.md` (how the webservice works).
* `Stella Ops Router - Microservice.md` (how the microservice SDK works and is implemented).
* `Stella Ops Router - Common.md` (common components and how they are implemented).
* `Migration of Webservices to Microservices.md`.
* `Stella Ops Router Documentation.md` (doc structure & guidance).
---
## 13. Documentation & developer-experience requirements
* The docs MUST be detailed; “do not spare details” implies:
* High-fidelity, concrete examples and not hand-wavy descriptions.
* For average C# developers, documentation MUST cover:
* Exact .NET / ASP.NET Core target version and runtime baseline.
* Required NuGet packages (logging, serialization, YAML parsing, RabbitMQ client, etc.).
* Exact serialization formats for frames and payloads (JSON vs MessagePack vs others).
* Exact framing rules for each transport (length-prefix for TCP/TLS, datagrams for UDP, exchanges/queues for Rabbit).
* Concrete sample `Program.cs` for:
* A gateway node.
* A microservice.
* Example endpoint implementations:
* Typed (with and without request).
* Raw streaming endpoints for large payloads.
* Example router YAML and microservice YAML with realistic values.
* Error and HTTP status mapping policy:
* E.g. “version not found → 404 or 400; no instance available → 503; timeout → 504; payload too large → 413.”
* Guidelines on:
* When to use UDP vs TCP vs RabbitMQ.
* How to configure and validate certificates for the certificate transport.
* How to write cancellation-friendly handlers (proper use of `CancellationToken`).
* Testing strategies: local dev setups, integration test harnesses, how to run router + microservice together for tests.
* Clear explanation of config precedence:
* Code options vs YAML vs microservice defaults vs Authority for claims.
* Documentation MUST answer for each major concept:
* What it is.
* Why it exists.
* How it works.
* How to use it (with examples).
* What happens when it is misused and how to debug issues.
---
## 14. Migration requirements
* There MUST be a defined migration path from `StellaOps.*.WebServices` to `StellaOps.*.Microservices`.
* Migration documentation MUST cover:
* Inventorying existing HTTP routes (Method + Path).
* Strategy A (in-place adaptation):
* Adding microservice SDK into WebService.
* Declaring endpoints with `[StellaEndpoint]`.
* Wrapping existing controller logic in handlers.
* Connecting to the router and validating registration.
* Gradually shifting traffic from direct WebService HTTP ingress to gateway routing.
* Strategy B (split):
* Extracting domain logic into shared libraries.
* Creating a dedicated microservice project using the SDK.
* Mapping routes and handlers.
* Phasing out or repurposing the original WebService.
* Ensuring cancellation tokens are wired throughout migrated code.
* Handling streaming endpoints (large uploads/downloads) via `IRawStellaEndpoint` and streaming support instead of naive buffered HTTP controllers.
---
If you want, I can next turn this requirement set into a machine-readable checklist (e.g. JSON or YAML) or derive a first-pass implementation roadmap directly from these requirements.