Fix router frontdoor readiness and route contracts
This commit is contained in:
@@ -0,0 +1,95 @@
|
||||
# Sprint 20260310-001 - Router Frontdoor Required-Service Readiness
|
||||
|
||||
## Topic & Scope
|
||||
- Replace the gateway's shallow "listener started" readiness contract with a required-service registration gate so scratch rebuilds do not expose first-party Stella routes before their router HELLO registrations exist.
|
||||
- Return truthful `503` responses for matched microservice routes whose target service is not yet registered instead of misleading `404` errors that make reverse proxy look safer than router transport.
|
||||
- Keep reverse proxy limited to external/bootstrap surfaces and document the rule explicitly for the local compose frontdoor.
|
||||
- Working directory: `src/Router`.
|
||||
- Allowed coordination edits: `devops/compose/docker-compose.stella-ops.yml`, `devops/compose/router-gateway-local.json`, `devops/compose/README.md`, `devops/compose/env/stellaops.env.example`, `docs/modules/router/architecture.md`, `docs/implplan/SPRINT_20260310_001_Router_frontdoor_required_service_readiness.md`.
|
||||
- Expected evidence: focused router tests, live gateway readiness probes before/after restart, and a rerun of the affected Playwright/live route checks after redeploy.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Follows `SPRINT_20260309_008_Router_live_messaging_heartbeat_contract_repair.md`, which already narrowed the remaining post-redeploy failures to startup/readiness convergence.
|
||||
- Safe parallelism: stay inside the router slice and the listed compose/docs files; do not touch unrelated search, reachability, or general frontend feature work.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `AGENTS.md`
|
||||
- `src/Router/AGENTS.md`
|
||||
- `src/Router/StellaOps.Gateway.WebService/AGENTS.md`
|
||||
- `src/Router/__Tests/StellaOps.Gateway.WebService.Tests/AGENTS.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/router/webservices-valkey-rollout-matrix.md`
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### ROUTER-READY-001 - Add required-service readiness evaluation
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Introduce a source-level readiness evaluator that keeps `/health/ready` false until the configured required first-party microservices have live healthy/degraded router registrations.
|
||||
- Preserve environment ownership of the required-service list so the local scratch compose stack can demand a stricter frontdoor than lighter dev configurations.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Gateway health options support a required microservice list.
|
||||
- [x] `/health/ready` returns `503` with missing-service details until all configured required services are registered.
|
||||
- [x] Focused router tests cover both missing and satisfied readiness states.
|
||||
|
||||
### ROUTER-READY-002 - Return truthful warm-up failures for missing target registrations
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-001
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- When a route is already classified as `Microservice` but the target service has not registered, return a service-unavailable contract instead of `404`.
|
||||
- Keep `404` only for genuinely unknown paths or endpoints that do not exist on a registered target service.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Targeted microservice-route misses return `503`.
|
||||
- [x] Registered target service with a missing endpoint still returns `404`.
|
||||
- [x] Focused middleware tests prove the distinction.
|
||||
|
||||
### ROUTER-READY-003 - Make scratch compose wait for the real frontdoor
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-002
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Update the mounted local router config with the required-service list for the client-ready scratch stack and make the router-gateway container healthcheck probe `/health/ready` instead of only testing for an open TCP port.
|
||||
- Document the reverse-proxy exception rule: external/bootstrap only, first-party Stella APIs through router transport.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `router-gateway-local.json` declares the required first-party services for the local scratch stack.
|
||||
- [x] `docker-compose.stella-ops.yml` checks router readiness instead of raw port openness.
|
||||
- [x] Router architecture docs describe the readiness gate and the reverse-proxy exception rule.
|
||||
|
||||
### ROUTER-READY-004 - Bound microservice HELLO recovery after gateway restart
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-003
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Remove the hidden fixed 30-heartbeat HELLO replay heuristic from the microservice SDK and replace it with an explicit registration refresh interval that repopulates gateway state within seconds after a gateway restart.
|
||||
- Flow the setting through the shared ASP.NET router integration so services can keep the default bounded contract or override it intentionally.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Stella microservice options expose a positive registration refresh interval.
|
||||
- [x] Router connection manager replays HELLO on the configured interval without waiting for dozens of heartbeats.
|
||||
- [x] Focused SDK and integration-helper tests cover the new contract.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-10 | Sprint created after live evidence showed the gateway was returning `404 TargetService=(none)` during post-redeploy convergence even though the mounted route table and aggregated OpenAPI already knew the affected first-party paths. | Developer |
|
||||
| 2026-03-10 | Live restart evidence showed the deeper recovery gap: services only replayed HELLO every 30 heartbeats, leaving the gateway honestly unready for minutes after restart. Added a bounded HELLO refresh task under the same sprint. | Developer |
|
||||
| 2026-03-10 | Audited the frontdoor refactor end to end: focused router tests passed, fresh-stack redeploy converged on `/health/ready`, restart probes now return `503` for missing target registrations before flipping to endpoint-level `404`, and the Playwright canonical route sweep rerun isolated the remaining failures to unrelated frontend routes under `/ops/policy`, `/ops/operations/*`, and trust-signing. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: readiness is environment-owned. The gateway source exposes the contract, while the local compose stack opts into a concrete required-service list for scratch QA.
|
||||
- Decision: reverse proxy remains valid for external/bootstrap surfaces such as Rekor, OIDC/browser flows, and SPA/static assets; it is not the preferred path for first-party Stella APIs.
|
||||
- Decision: HELLO recovery is now time-based and explicit rather than a hidden multiple of heartbeat count. The default registration refresh interval is 10 seconds so a gateway restart cannot strand first-party routes behind stale state for minutes.
|
||||
- Decision: the dedicated `router-gateway-local.reverseproxy.json` fallback mode is removed from active compose guidance. The supported scratch stack uses the microservice-first table with narrowly-scoped reverse proxy exceptions inside the same config.
|
||||
- Risk: if the required-service list is too broad for the current compose footprint, `/health/ready` could remain false. Mitigation: use the actual mounted local stack as the authority and verify registrations live after redeploy.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-03-10: land readiness evaluation and route-level `503` contract.
|
||||
- 2026-03-10: rebuild router-gateway, redeploy, and verify restart behavior with live probes.
|
||||
- 2026-03-10: rerun the targeted Playwright/router checks on the warmed stack.
|
||||
@@ -85,6 +85,8 @@ Route types:
|
||||
| `NotFoundPage` | HTML file served on 404 (after all other middleware) |
|
||||
| `ServerErrorPage` | HTML file served on 5xx (after all other middleware) |
|
||||
|
||||
Reverse proxy is reserved for external/bootstrap surfaces such as OIDC browser flows, Rekor, and frontdoor static assets. First-party Stella API surfaces are expected to use `Microservice` routing so the gateway remains the single routing authority instead of silently bypassing router registration state.
|
||||
|
||||
### Pipeline Order
|
||||
|
||||
System paths (`/health`, `/metrics`, `/openapi.*`) bypass the route table entirely. The dispatch middleware runs before the microservice pipeline:
|
||||
@@ -540,6 +542,9 @@ Gateway tracks:
|
||||
- Uses health in routing decisions
|
||||
- Messaging transports stay push-first even when backed by notifiable queues; the missed-notification safety-net timeout is derived from the configured heartbeat interval and clamped to a short bounded window instead of falling back to a fixed long poll.
|
||||
- Gateway degraded and stale transitions are normalized against the messaging heartbeat contract. A gateway may not mark an instance `Degraded` earlier than `2x` the heartbeat interval or `Unhealthy` earlier than `3x` the heartbeat interval, even when looser defaults were configured.
|
||||
- `/health/ready` is stricter than "process started": it remains `503` until the configured required first-party microservices have live healthy or degraded registrations in router state. Local scratch compose uses this to hold the frontdoor unhealthy until the core Stella API surface has replayed HELLO after a rebuild.
|
||||
- The required-service list must use canonical router `serviceName` values, not loose product-family aliases. Gateway readiness normalizes host-style suffixes such as `-gateway`, `-web`, `.stella-ops.local`, and ports, but it does not treat sibling services as interchangeable.
|
||||
- When a request already matched a configured `Microservice` route but the target service has not registered yet, the gateway returns `503 Service Unavailable`, not `404 Not Found`. `404` remains reserved for genuinely unknown paths or missing endpoints on an otherwise registered service.
|
||||
|
||||
Periodic HELLO re-registration is valid so a microservice can repopulate gateway state after a gateway restart, but it must refresh the existing logical transport connection instead of minting a second one. Gateway routing state also deduplicates by service instance identity (`ServiceName`, `Version`, `InstanceId`, transport) before re-indexing endpoints so repeated HELLO frames cannot accumulate stale route candidates.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user