Repair router frontdoor convergence and live route contracts

This commit is contained in:
master
2026-03-09 19:09:19 +02:00
parent 49d1c57597
commit bf937c9395
25 changed files with 740 additions and 61 deletions

View File

@@ -0,0 +1,78 @@
# Sprint 20260309-008 - Router Live Messaging Heartbeat Contract Repair
## Topic & Scope
- Repair the live frontdoor `503` cluster triggered after the full scratch rebuild, where healthy services are marked degraded or unhealthy because gateway heartbeat thresholds undercut the messaging transport's missed-notification fallback.
- Preserve the Valkey push-first CPU fix while ensuring a missed wake-up cannot stall queue consumption long enough to trip false gateway health failures.
- Rebuild and redeploy the affected router slice, then rerun the authenticated live Playwright sweep to confirm the shared `503` backlog collapses before moving on to page-specific defects.
- Working directory: `src/Router`.
- Allowed coordination edits: `docs/modules/router/architecture.md`, `docs/implplan/SPRINT_20260309_008_Router_live_messaging_heartbeat_contract_repair.md`.
- Expected evidence: focused router unit tests, rebuilt router image, redeployed gateway, refreshed live Playwright sweep artifact.
## Dependencies & Concurrency
- Depends on `SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md` for the rebuilt baseline and `SPRINT_20260309_003_Router_live_frontdoor_contract_repair.md` for the already-restored frontdoor bindings.
- Safe parallelism: avoid the unrelated search and component-revival slices already landed by other agents; this sprint is limited to router messaging wake-up behavior, gateway health threshold policy, and live verification artifacts.
## Documentation Prerequisites
- `AGENTS.md`
- `src/Router/AGENTS.md`
- `src/Router/StellaOps.Gateway.WebService/AGENTS.md`
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
- `docs/qa/feature-checks/FLOW.md`
- `docs/modules/router/architecture.md`
- `docs/modules/platform/architecture-overview.md`
## Delivery Tracker
### ROUTER-LIVE-008-001 - Bound the messaging wake-up fallback to heartbeat cadence
Status: DONE
Dependency: none
Owners: Developer, QA
Task description:
- Replace the fixed 30-second notifiable-queue fallback with a heartbeat-aware safety-net timeout so a missed Valkey pub/sub wake-up does not leave the gateway or microservices asleep long enough to look dead.
- Keep the transport push-first and low-CPU: the fallback exists only for missed notifications, not as a return to aggressive polling.
Completion criteria:
- [x] Messaging queue waits derive their safety-net timeout from the configured heartbeat interval instead of a fixed 30-second constant.
- [x] Focused router tests cover the timeout calculation contract.
- [x] The transport remains push-first for notifiable queues.
### ROUTER-LIVE-008-002 - Harden gateway health thresholds against heartbeat jitter
Status: DONE
Dependency: ROUTER-LIVE-008-001
Owners: Developer, QA
Task description:
- Normalize gateway degraded/stale thresholds against the configured messaging heartbeat interval so the live gateway cannot mark healthy instances degraded or unhealthy earlier than the transport contract allows.
- Prefer a durable source-level policy over a compose-only tweak so the next scratch rebuild preserves the fix.
Completion criteria:
- [x] Gateway health options are normalized to a minimum of 2x/3x the configured messaging heartbeat interval for degraded/stale transitions.
- [x] Focused router tests lock the health-threshold normalization behavior.
- [x] The router architecture dossier documents the heartbeat-to-health contract.
### ROUTER-LIVE-008-003 - Rebuild, redeploy, and verify the live frontdoor
Status: DONE
Dependency: ROUTER-LIVE-008-002
Owners: QA
Task description:
- Rebuild and redeploy the router slice, rerun the authenticated live sweep, and record whether the shared `503` cluster is removed or narrowed for the next iteration.
Completion criteria:
- [x] Router artifacts are rebuilt and redeployed on the live compose stack.
- [x] The authenticated live Playwright sweep is rerun from the rebuilt stack.
- [x] Remaining failures are recorded with current evidence if any survive.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-09 | Sprint created after the rebuilt live stack showed shared gateway `503` failures caused by heartbeat health flapping rather than page-local defects. | Developer |
| 2026-03-09 | Updated messaging wait fallback to use heartbeat-derived safety-net timeouts, normalized gateway degraded/stale thresholds against messaging heartbeat cadence, and added focused router tests for both contracts. | Developer |
| 2026-03-09 | Rebuilt the full image set, redeployed the live compose stack, then reran authenticated Playwright sweeps. The first post-redeploy sweep showed transient cross-service `404` convergence misses; the second consecutive sweep completed `111/111` against `src/Web/StellaOps.Web/output/playwright/live-frontdoor-canonical-route-sweep.json`. | QA |
## Decisions & Risks
- Decision: fix the router transport/gateway heartbeat contract in source instead of only loosening compose thresholds, because scratch rebuilds must preserve the runtime behavior.
- Decision: treat the transient post-redeploy `404` cluster as the same convergence class as earlier health flapping until proven otherwise; verify with consecutive authenticated Playwright sweeps before opening page-local code work.
- Risk: route convergence is improved but still needs continued scratch-rebuild observation in later iterations; if repeated `404` windows persist after the heartbeat contract change, the next fix belongs in startup/readiness gating rather than page clients.
## Next Checkpoints
- 2026-03-09: completed messaging wait fallback repair, gateway threshold normalization, and live rebuild verification.
- Next iteration: expand from route availability into deeper Playwright action sweeps on the rebuilt stack.

View File

@@ -0,0 +1,75 @@
# Sprint 20260309-012 - Router Live Quota Scope And Notify Dispatch Repairs
## Topic & Scope
- Repair the two remaining authenticated frontdoor regressions left after the full rebuild and redeploy: quota violations authorization and notify channel health dispatch.
- Keep the fixes in the Router layer because both failures occur before or inside Router-mediated delivery, not in the Platform or Notify business logic itself.
- Preserve existing live contracts while removing the actual transport/auth defects instead of adding route-local UI fallbacks.
- Working directory: `src/Router/`.
- Allowed coordination edits: `docs/modules/router/architecture.md`, `docs/modules/notify/architecture.md`, `docs/implplan/SPRINT_20260309_012_Router_live_quota_scope_and_notify_dispatch_repairs.md`, `src/Web/StellaOps.Web/output/playwright/**`.
- Expected evidence: targeted router test runs against individual `.csproj` files, rebuilt `router-gateway` image, redeployed compose stack, refreshed authenticated Playwright artifacts.
## Dependencies & Concurrency
- Depends on `SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md` for the scratch rebuild baseline and `SPRINT_20260309_011_Platform_live_remaining_route_contract_repair.md` for the narrowed live failure inventory.
- Safe parallelism: do not touch unrelated search or component-revival work outside `src/Router/**`; leave unrelated dirty files untouched.
## Documentation Prerequisites
- `AGENTS.md`
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
- `docs/qa/feature-checks/FLOW.md`
- `docs/modules/router/architecture.md`
- `docs/modules/notify/architecture.md`
## Delivery Tracker
### LIVE-ROUTER-012-001 - Restore gateway scope compatibility for quota reads
Status: DONE
Dependency: none
Owners: Developer, Test Automation
Task description:
- Fix the gateway authorization path so live quota endpoints can honor the resolved scope set produced by identity scope expansion. The frontdoor currently rejects quota reads even though the authenticated session carries `orch:quota` and the gateway already computes expanded scopes in request context.
Completion criteria:
- [x] `/api/v1/gateway/rate-limits/violations` succeeds through the live frontdoor for the authenticated operator session.
- [x] Router gateway unit tests cover coarse-scope expansion and authorization checks against the resolved scope set.
- [x] Router docs describe that scope-based authorization uses the resolved scope context, not only raw claim payloads.
### LIVE-ROUTER-012-002 - Fix ASP.NET bridge route matching for notify health paths
Status: DONE
Dependency: none
Owners: Developer, Test Automation
Task description:
- Fix the messaging-transport ASP.NET bridge so terminal route parameters are not treated as implicit catch-alls. The notify channel-health route currently dispatches through messaging, and the bridge incorrectly matches the shorter channel-detail route when extra segments are present.
Completion criteria:
- [x] `/api/v1/notify/channels/{channelId}/health` resolves to the correct endpoint over Router messaging transport.
- [x] Router ASP.NET bridge tests reproduce the old terminal-parameter bug and prove explicit catch-all routes still work.
- [x] The fix is implemented in Router transport/bridge code, not in Notify page-local workarounds.
### LIVE-ROUTER-012-003 - Rebuild, redeploy, and reverify the live frontdoor
Status: DONE
Dependency: LIVE-ROUTER-012-001, LIVE-ROUTER-012-002
Owners: Developer, QA
Task description:
- Rebuild the touched router image, redeploy the live stack, and rerun authenticated Playwright verification for the two repaired pages before committing.
Completion criteria:
- [x] The changed Router image is rebuilt from current source and redeployed.
- [x] Authenticated Playwright rechecks pass for `/ops/operations/quotas` and `/ops/operations/notifications`.
- [x] The canonical route sweep artifact reflects the updated live failure inventory.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-09 | Sprint created after the full rebuild/redeploy cleared scanner-backed route failures and left only two live Router-layer defects: quota scope enforcement and notify channel-health dispatch over messaging transport. | Developer |
| 2026-03-09 | Added coarse-to-fine quota scope compatibility in gateway authorization, fixed ASP.NET bridge terminal-parameter matching, rebuilt `router-gateway` and `notify-web`, and verified live `/ops/operations/quotas` plus `/ops/operations/notifications` behavior with authenticated Playwright. | Developer |
| 2026-03-09 | Re-ran the authenticated canonical live sweep after the rebuild cycle; the latest artifact reached `111/111` at `src/Web/StellaOps.Web/output/playwright/live-frontdoor-canonical-route-sweep.json`. | QA |
## Decisions & Risks
- Decision: keep quota compatibility in Router by authorizing against the resolved scope context already produced by gateway identity expansion; do not broaden Platform policies or change token issuance.
- Decision: fix notify health in the ASP.NET bridge matcher so only explicit catch-all parameters consume extra path segments; this preserves direct HTTP and messaging parity.
- Risk: Router is a shared ingress surface. All changes must be covered by deterministic tests before redeploy to avoid collateral regressions in other routed pages.
- Decision: keep the live verification artifact in the sprint because the repaired quota and notify defects were validated in the same rebuilt stack that now serves the full canonical route set cleanly.
## Next Checkpoints
- 2026-03-09: completed router gateway and ASP.NET bridge repairs with focused tests plus live rebuild verification.
- Next iteration: continue beyond route presence into deeper per-page action sweeps on the rebuilt stack.

View File

@@ -18,6 +18,7 @@ Rollout policy: `docs/operations/multi-tenant-rollout-and-compatibility.md`
- HTTP is not used for internal microservice-to-gateway traffic
- Request/response bodies are opaque to the router (raw bytes/streams)
- Forwarded HTTP headers remain case-insensitive across Router frame transport and ASP.NET bridge dispatch; lowercase HTTP/2 names such as `content-type` must be preserved for JSON-bound endpoints, and the ASP.NET bridge must mark POST/PUT/PATCH requests as body-capable so minimal-API JSON binding survives frame dispatch
- Gateway scope authorization evaluates against the resolved per-request scope set from identity expansion (`GatewayContextKeys.Scopes`), so coarse compatibility scopes such as `orch:quota` can satisfy their fine-grained frontdoor equivalents without changing downstream policy names
### Transport Architecture
@@ -106,6 +107,8 @@ Browser → Router Gateway (port 80) → [microservices via binary transport]
The Angular SPA dist is provided by a `console-builder` init container that copies the built files to a shared `console-dist` volume mounted at `/app/wwwroot`.
When the gateway runs in-container, listener binding must honor explicit `ASPNETCORE_URLS` / `ASPNETCORE_HTTP_PORTS` / `ASPNETCORE_HTTPS_PORTS` values from compose. Wildcard hosts (`+`, `*`) are normalized to `0.0.0.0` before Kestrel listeners are created so the declared HTTP frontdoor contract actually comes up.
---
## Service Identity
@@ -177,6 +180,7 @@ public sealed class EndpointDescriptor
- ASP.NET-style route templates
- Parameter segments: `{id}`, `{userId}`
- Extra path segments are consumed only by explicit catch-all parameters (`{**path}`); ordinary terminal parameters must not behave like implicit catch-alls during messaging transport dispatch
- Case sensitivity and trailing slash handling follow ASP.NET conventions
---
@@ -533,6 +537,8 @@ Gateway tracks:
- Derives status from heartbeat recency
- Marks stale instances as Unhealthy
- Uses health in routing decisions
- Messaging transports stay push-first even when backed by notifiable queues; the missed-notification safety-net timeout is derived from the configured heartbeat interval and clamped to a short bounded window instead of falling back to a fixed long poll.
- Gateway degraded and stale transitions are normalized against the messaging heartbeat contract. A gateway may not mark an instance `Degraded` earlier than `2x` the heartbeat interval or `Unhealthy` earlier than `3x` the heartbeat interval, even when looser defaults were configured.
Periodic HELLO re-registration is valid so a microservice can repopulate gateway state after a gateway restart, but it must refresh the existing logical transport connection instead of minting a second one. Gateway routing state also deduplicates by service instance identity (`ServiceName`, `Version`, `InstanceId`, transport) before re-indexing endpoints so repeated HELLO frames cannot accumulate stale route candidates.