Update docs, sprint plans, and compose configuration
Add 12 new sprint files (Integrations, Graph, JobEngine, FE, Router, AdvisoryAI), archive completed scheduler UI sprint, update module architecture docs (router, graph, jobengine, web, integrations), and add Gitea entrypoint script for local dev. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -24,7 +24,7 @@ Rollout policy: `docs/operations/multi-tenant-rollout-and-compatibility.md`
|
||||
|
||||
Each transport connection carries:
|
||||
|
||||
- Initial registration (HELLO) and endpoint configuration
|
||||
- Initial identity (HELLO) and, when needed, endpoint metadata replay
|
||||
- Ongoing heartbeats
|
||||
- Request/response data frames
|
||||
- Streaming data frames
|
||||
@@ -34,9 +34,11 @@ Each transport connection carries:
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ Microservice │ │ Gateway │
|
||||
│ │ HELLO │ │
|
||||
│ Endpoints: │ ─────────────────────────►│ Routing │
|
||||
│ Identity │ ─────────────────────────►│ Routing │
|
||||
│ - POST /items │ HEARTBEAT │ State │
|
||||
│ - GET /items │ ◄────────────────────────►│ │
|
||||
│ Metadata │ RESYNC / ENDPOINTS │ Connections[] │
|
||||
│ replay │ ◄────────────────────────►│ │
|
||||
│ │ │ Connections[] │
|
||||
│ │ REQUEST / RESPONSE │ │
|
||||
│ │ ◄────────────────────────►│ │
|
||||
@@ -280,7 +282,9 @@ public enum FrameType : byte
|
||||
Response = 4,
|
||||
RequestStreamData = 5,
|
||||
ResponseStreamData = 6,
|
||||
Cancel = 7
|
||||
Cancel = 7,
|
||||
ResyncRequest = 8,
|
||||
EndpointsUpdate = 9
|
||||
}
|
||||
```
|
||||
|
||||
@@ -415,9 +419,10 @@ Two mechanisms:
|
||||
### Connection Behavior
|
||||
|
||||
On connection:
|
||||
1. Send HELLO with instance info and endpoints
|
||||
2. Start heartbeat timer
|
||||
3. Listen for REQUEST frames
|
||||
1. Send HELLO with instance identity.
|
||||
2. Start heartbeat timer.
|
||||
3. For messaging transport, replay endpoint/schema/OpenAPI metadata only when the router explicitly asks for it.
|
||||
4. Listen for REQUEST frames.
|
||||
|
||||
HELLO payload:
|
||||
|
||||
@@ -431,6 +436,11 @@ public sealed class HelloPayload
|
||||
}
|
||||
```
|
||||
|
||||
For messaging transport the steady-state contract is intentionally slimmer than the generic shape above:
|
||||
- startup `HELLO` carries identity and may leave `Endpoints` empty
|
||||
- the gateway sends `ResyncRequest` on service startup, administrative replay, or gateway-state miss
|
||||
- the microservice answers with `EndpointsUpdate` containing endpoints, schemas, and OpenAPI metadata
|
||||
|
||||
---
|
||||
|
||||
## Authorization
|
||||
@@ -449,7 +459,7 @@ public sealed class ClaimRequirement
|
||||
|
||||
### Precedence
|
||||
|
||||
1. Microservice provides defaults in HELLO
|
||||
1. Microservice provides defaults in registration metadata
|
||||
2. Authority can override centrally
|
||||
3. Gateway enforces final effective claims
|
||||
|
||||
@@ -533,9 +543,12 @@ Sent at regular intervals over the same connection as requests:
|
||||
```csharp
|
||||
public sealed class HeartbeatPayload
|
||||
{
|
||||
public InstanceDescriptor? Instance { get; init; }
|
||||
public string InstanceId { get; init; }
|
||||
public required InstanceHealthStatus Status { get; init; }
|
||||
public int InflightRequests { get; init; }
|
||||
public int InFlightRequestCount { get; init; }
|
||||
public double ErrorRate { get; init; }
|
||||
public DateTime TimestampUtc { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
@@ -546,13 +559,14 @@ Gateway tracks:
|
||||
- Derives status from heartbeat recency
|
||||
- Marks stale instances as Unhealthy
|
||||
- Uses health in routing decisions
|
||||
- Messaging heartbeats include instance identity so the gateway can rebuild minimal state after a gateway restart or local routing-state loss without waiting for a full reconnect.
|
||||
- Messaging transports stay push-first even when backed by notifiable queues; the missed-notification safety-net timeout is derived from the configured heartbeat interval and clamped to a short bounded window instead of falling back to a fixed long poll.
|
||||
- Gateway degraded and stale transitions are normalized against the messaging heartbeat contract. A gateway may not mark an instance `Degraded` earlier than `2x` the heartbeat interval or `Unhealthy` earlier than `3x` the heartbeat interval, even when looser defaults were configured.
|
||||
- `/health/ready` is stricter than "process started": it remains `503` until the configured required first-party microservices have live healthy or degraded registrations in router state. Local scratch compose uses this to hold the frontdoor unhealthy until the core Stella API surface has replayed HELLO after a rebuild.
|
||||
- The required-service list must use canonical router `serviceName` values, not loose product-family aliases. Gateway readiness normalizes host-style suffixes such as `-gateway`, `-web`, `.stella-ops.local`, and ports, but it does not treat sibling services as interchangeable.
|
||||
- When a request already matched a configured `Microservice` route but the target service has not registered yet, the gateway returns `503 Service Unavailable`, not `404 Not Found`. `404` remains reserved for genuinely unknown paths or missing endpoints on an otherwise registered service.
|
||||
|
||||
Periodic HELLO re-registration is valid so a microservice can repopulate gateway state after a gateway restart, but it must refresh the existing logical transport connection instead of minting a second one. Gateway routing state also deduplicates by service instance identity (`ServiceName`, `Version`, `InstanceId`, transport) before re-indexing endpoints so repeated HELLO frames cannot accumulate stale route candidates.
|
||||
- Messaging resync is explicit instead of periodic: startup, administrative replay, and gateway-state misses trigger `ResyncRequest`, while normal heartbeats stay small.
|
||||
- The Valkey transport keeps its timeout fallback plus proactive randomized re-subscribe so silent Pub/Sub failures still recover. That fallback still produces some `XREADGROUP`/`XAUTOCLAIM` traffic, but it is resilience traffic rather than endpoint-catalog churn.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
## Status
|
||||
- **Implemented** in Sprint 8100.0011.0003.
|
||||
- Core components: Gateway DI wiring, GatewayHostedService integration, GatewayTransportClient dispatch.
|
||||
- Last updated: 2025-12-24 (UTC).
|
||||
- Last updated: 2026-04-05 (UTC).
|
||||
|
||||
## Purpose
|
||||
Enable Gateway ↔ microservice Router traffic over an offline-friendly, Redis-compatible transport (Valkey) by using the existing **Messaging** transport layer:
|
||||
@@ -27,20 +27,28 @@ This supports environments where direct TCP/TLS microservice connections are und
|
||||
|
||||
## High-Level Flow
|
||||
1) Microservice connects via messaging transport:
|
||||
- publishes a HELLO message to the gateway request queue
|
||||
- publishes a slim `HELLO` message with instance identity to the gateway control queue
|
||||
2) Gateway processes HELLO:
|
||||
- registers instance + endpoints into routing state
|
||||
3) Gateway routes an HTTP request to a microservice:
|
||||
- registers the connection identity and requests endpoint metadata replay when needed
|
||||
3) Microservice answers the replay request:
|
||||
- publishes an `EndpointsUpdate` frame with endpoints, schemas, and OpenAPI metadata
|
||||
4) Gateway applies the metadata replay:
|
||||
- updates routing state, effective claims, and aggregated OpenAPI
|
||||
5) Gateway routes an HTTP request to a microservice:
|
||||
- publishes a REQUEST message to the service request queue
|
||||
4) Microservice handles request:
|
||||
6) Microservice handles request:
|
||||
- executes handler (or ASP.NET bridge) and publishes a RESPONSE message
|
||||
5) Gateway returns response to the client.
|
||||
7) Gateway returns response to the client.
|
||||
|
||||
Messaging-specific recovery behavior:
|
||||
- Startup resync: the gateway sends `ResyncRequest` immediately after a slim `HELLO`.
|
||||
- Administrative resync: `POST /api/v1/gateway/administration/router/resync` can request replay for one connection or the whole messaging fleet.
|
||||
- Gateway-state miss: if a heartbeat arrives for an unknown messaging connection, the gateway seeds minimal state from the heartbeat identity and requests replay instead of waiting for a reconnect.
|
||||
|
||||
## Queue Topology (Conceptual)
|
||||
The Messaging transport uses a small set of queues (names are configurable):
|
||||
- **Gateway request queue**: receives HELLO / HEARTBEAT / REQUEST frames from services
|
||||
- **Gateway response queue**: receives RESPONSE frames from services
|
||||
- **Per-service request queues**: gateway publishes REQUEST frames targeted to a service
|
||||
- **Gateway control queue**: receives service-to-gateway HELLO / HEARTBEAT / ENDPOINTS_UPDATE / RESPONSE frames
|
||||
- **Per-service incoming queues**: gateway publishes REQUEST / CANCEL / RESYNC_REQUEST frames targeted to a service
|
||||
- **Dead letter queues** (optional): for messages that exceed retries/leases
|
||||
|
||||
## Configuration
|
||||
@@ -87,6 +95,8 @@ if (bootstrapOptions.Transports.Messaging.Enabled)
|
||||
- **At-least-once** delivery: message queues and leases imply retries are possible; handlers should be idempotent where feasible.
|
||||
- **Lease timeouts**: must be tuned to max handler execution time; long-running tasks should respond with 202 + job id rather than blocking.
|
||||
- **Determinism**: message ordering may vary; Router must not depend on arrival order for correctness (only for freshness/telemetry).
|
||||
- **Push-first with recovery fallback**: Valkey Pub/Sub notifications wake consumers immediately when possible. If notifications silently stop, the queue layer still wakes via timeout fallback, connection-restored hooks, and randomized proactive re-subscription so requests and resync control frames do not wedge forever.
|
||||
- **Queue fallback cost**: every wake can perform `XAUTOCLAIM` plus `XREADGROUP` checks before sleeping again. That traffic is expected resilience overhead, but it is materially smaller than replaying the full endpoint catalog on every heartbeat interval.
|
||||
|
||||
## Security Notes
|
||||
- Messaging transport is internal. External identity must still be enforced at the Gateway.
|
||||
@@ -97,11 +107,12 @@ if (bootstrapOptions.Transports.Messaging.Enabled)
|
||||
|
||||
### Completed (Sprint 8100.0011.0003)
|
||||
1. ✅ Wire Messaging transport into Gateway:
|
||||
- start/stop `MessagingTransportServer` in `GatewayHostedService`
|
||||
- subscribe to `OnHelloReceived`, `OnHeartbeatReceived`, `OnResponseReceived`, `OnConnectionClosed` events
|
||||
- reuse routing state updates and claims store updates
|
||||
- start/stop `MessagingTransportServer` in `GatewayHostedService`
|
||||
- subscribe to `OnHelloReceived`, `OnHeartbeatReceived`, `OnEndpointsUpdated`, `OnResponseReceived`, `OnConnectionClosed` events
|
||||
- reuse routing state updates and claims store updates
|
||||
2. ✅ Extend Gateway transport client to support `TransportType.Messaging` for dispatch.
|
||||
3. ✅ Add config options (`GatewayMessagingTransportOptions`) and DI mappings.
|
||||
4. ✅ Switch messaging registration from periodic full HELLO replay to explicit `ResyncRequest` / `EndpointsUpdate` control frames.
|
||||
|
||||
### Remaining Work
|
||||
1. Add deployment examples (compose/helm) for Valkey transport.
|
||||
|
||||
Reference in New Issue
Block a user