Fix router frontdoor readiness and route contracts
This commit is contained in:
@@ -0,0 +1,95 @@
|
||||
# Sprint 20260310-001 - Router Frontdoor Required-Service Readiness
|
||||
|
||||
## Topic & Scope
|
||||
- Replace the gateway's shallow "listener started" readiness contract with a required-service registration gate so scratch rebuilds do not expose first-party Stella routes before their router HELLO registrations exist.
|
||||
- Return truthful `503` responses for matched microservice routes whose target service is not yet registered instead of misleading `404` errors that make reverse proxy look safer than router transport.
|
||||
- Keep reverse proxy limited to external/bootstrap surfaces and document the rule explicitly for the local compose frontdoor.
|
||||
- Working directory: `src/Router`.
|
||||
- Allowed coordination edits: `devops/compose/docker-compose.stella-ops.yml`, `devops/compose/router-gateway-local.json`, `devops/compose/README.md`, `devops/compose/env/stellaops.env.example`, `docs/modules/router/architecture.md`, `docs/implplan/SPRINT_20260310_001_Router_frontdoor_required_service_readiness.md`.
|
||||
- Expected evidence: focused router tests, live gateway readiness probes before/after restart, and a rerun of the affected Playwright/live route checks after redeploy.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Follows `SPRINT_20260309_008_Router_live_messaging_heartbeat_contract_repair.md`, which already narrowed the remaining post-redeploy failures to startup/readiness convergence.
|
||||
- Safe parallelism: stay inside the router slice and the listed compose/docs files; do not touch unrelated search, reachability, or general frontend feature work.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `AGENTS.md`
|
||||
- `src/Router/AGENTS.md`
|
||||
- `src/Router/StellaOps.Gateway.WebService/AGENTS.md`
|
||||
- `src/Router/__Tests/StellaOps.Gateway.WebService.Tests/AGENTS.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/router/webservices-valkey-rollout-matrix.md`
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### ROUTER-READY-001 - Add required-service readiness evaluation
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Introduce a source-level readiness evaluator that keeps `/health/ready` false until the configured required first-party microservices have live healthy/degraded router registrations.
|
||||
- Preserve environment ownership of the required-service list so the local scratch compose stack can demand a stricter frontdoor than lighter dev configurations.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Gateway health options support a required microservice list.
|
||||
- [x] `/health/ready` returns `503` with missing-service details until all configured required services are registered.
|
||||
- [x] Focused router tests cover both missing and satisfied readiness states.
|
||||
|
||||
### ROUTER-READY-002 - Return truthful warm-up failures for missing target registrations
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-001
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- When a route is already classified as `Microservice` but the target service has not registered, return a service-unavailable contract instead of `404`.
|
||||
- Keep `404` only for genuinely unknown paths or endpoints that do not exist on a registered target service.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Targeted microservice-route misses return `503`.
|
||||
- [x] Registered target service with a missing endpoint still returns `404`.
|
||||
- [x] Focused middleware tests prove the distinction.
|
||||
|
||||
### ROUTER-READY-003 - Make scratch compose wait for the real frontdoor
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-002
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Update the mounted local router config with the required-service list for the client-ready scratch stack and make the router-gateway container healthcheck probe `/health/ready` instead of only testing for an open TCP port.
|
||||
- Document the reverse-proxy exception rule: external/bootstrap only, first-party Stella APIs through router transport.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `router-gateway-local.json` declares the required first-party services for the local scratch stack.
|
||||
- [x] `docker-compose.stella-ops.yml` checks router readiness instead of raw port openness.
|
||||
- [x] Router architecture docs describe the readiness gate and the reverse-proxy exception rule.
|
||||
|
||||
### ROUTER-READY-004 - Bound microservice HELLO recovery after gateway restart
|
||||
Status: DONE
|
||||
Dependency: ROUTER-READY-003
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Remove the hidden fixed 30-heartbeat HELLO replay heuristic from the microservice SDK and replace it with an explicit registration refresh interval that repopulates gateway state within seconds after a gateway restart.
|
||||
- Flow the setting through the shared ASP.NET router integration so services can keep the default bounded contract or override it intentionally.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Stella microservice options expose a positive registration refresh interval.
|
||||
- [x] Router connection manager replays HELLO on the configured interval without waiting for dozens of heartbeats.
|
||||
- [x] Focused SDK and integration-helper tests cover the new contract.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-10 | Sprint created after live evidence showed the gateway was returning `404 TargetService=(none)` during post-redeploy convergence even though the mounted route table and aggregated OpenAPI already knew the affected first-party paths. | Developer |
|
||||
| 2026-03-10 | Live restart evidence showed the deeper recovery gap: services only replayed HELLO every 30 heartbeats, leaving the gateway honestly unready for minutes after restart. Added a bounded HELLO refresh task under the same sprint. | Developer |
|
||||
| 2026-03-10 | Audited the frontdoor refactor end to end: focused router tests passed, fresh-stack redeploy converged on `/health/ready`, restart probes now return `503` for missing target registrations before flipping to endpoint-level `404`, and the Playwright canonical route sweep rerun isolated the remaining failures to unrelated frontend routes under `/ops/policy`, `/ops/operations/*`, and trust-signing. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: readiness is environment-owned. The gateway source exposes the contract, while the local compose stack opts into a concrete required-service list for scratch QA.
|
||||
- Decision: reverse proxy remains valid for external/bootstrap surfaces such as Rekor, OIDC/browser flows, and SPA/static assets; it is not the preferred path for first-party Stella APIs.
|
||||
- Decision: HELLO recovery is now time-based and explicit rather than a hidden multiple of heartbeat count. The default registration refresh interval is 10 seconds so a gateway restart cannot strand first-party routes behind stale state for minutes.
|
||||
- Decision: the dedicated `router-gateway-local.reverseproxy.json` fallback mode is removed from active compose guidance. The supported scratch stack uses the microservice-first table with narrowly-scoped reverse proxy exceptions inside the same config.
|
||||
- Risk: if the required-service list is too broad for the current compose footprint, `/health/ready` could remain false. Mitigation: use the actual mounted local stack as the authority and verify registrations live after redeploy.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-03-10: land readiness evaluation and route-level `503` contract.
|
||||
- 2026-03-10: rebuild router-gateway, redeploy, and verify restart behavior with live probes.
|
||||
- 2026-03-10: rerun the targeted Playwright/router checks on the warmed stack.
|
||||
Reference in New Issue
Block a user