Repair router frontdoor convergence and live route contracts
This commit is contained in:
@@ -0,0 +1,78 @@
|
||||
# Sprint 20260309-008 - Router Live Messaging Heartbeat Contract Repair
|
||||
|
||||
## Topic & Scope
|
||||
- Repair the live frontdoor `503` cluster triggered after the full scratch rebuild, where healthy services are marked degraded or unhealthy because gateway heartbeat thresholds undercut the messaging transport's missed-notification fallback.
|
||||
- Preserve the Valkey push-first CPU fix while ensuring a missed wake-up cannot stall queue consumption long enough to trip false gateway health failures.
|
||||
- Rebuild and redeploy the affected router slice, then rerun the authenticated live Playwright sweep to confirm the shared `503` backlog collapses before moving on to page-specific defects.
|
||||
- Working directory: `src/Router`.
|
||||
- Allowed coordination edits: `docs/modules/router/architecture.md`, `docs/implplan/SPRINT_20260309_008_Router_live_messaging_heartbeat_contract_repair.md`.
|
||||
- Expected evidence: focused router unit tests, rebuilt router image, redeployed gateway, refreshed live Playwright sweep artifact.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md` for the rebuilt baseline and `SPRINT_20260309_003_Router_live_frontdoor_contract_repair.md` for the already-restored frontdoor bindings.
|
||||
- Safe parallelism: avoid the unrelated search and component-revival slices already landed by other agents; this sprint is limited to router messaging wake-up behavior, gateway health threshold policy, and live verification artifacts.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `AGENTS.md`
|
||||
- `src/Router/AGENTS.md`
|
||||
- `src/Router/StellaOps.Gateway.WebService/AGENTS.md`
|
||||
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### ROUTER-LIVE-008-001 - Bound the messaging wake-up fallback to heartbeat cadence
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Replace the fixed 30-second notifiable-queue fallback with a heartbeat-aware safety-net timeout so a missed Valkey pub/sub wake-up does not leave the gateway or microservices asleep long enough to look dead.
|
||||
- Keep the transport push-first and low-CPU: the fallback exists only for missed notifications, not as a return to aggressive polling.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Messaging queue waits derive their safety-net timeout from the configured heartbeat interval instead of a fixed 30-second constant.
|
||||
- [x] Focused router tests cover the timeout calculation contract.
|
||||
- [x] The transport remains push-first for notifiable queues.
|
||||
|
||||
### ROUTER-LIVE-008-002 - Harden gateway health thresholds against heartbeat jitter
|
||||
Status: DONE
|
||||
Dependency: ROUTER-LIVE-008-001
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Normalize gateway degraded/stale thresholds against the configured messaging heartbeat interval so the live gateway cannot mark healthy instances degraded or unhealthy earlier than the transport contract allows.
|
||||
- Prefer a durable source-level policy over a compose-only tweak so the next scratch rebuild preserves the fix.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Gateway health options are normalized to a minimum of 2x/3x the configured messaging heartbeat interval for degraded/stale transitions.
|
||||
- [x] Focused router tests lock the health-threshold normalization behavior.
|
||||
- [x] The router architecture dossier documents the heartbeat-to-health contract.
|
||||
|
||||
### ROUTER-LIVE-008-003 - Rebuild, redeploy, and verify the live frontdoor
|
||||
Status: DONE
|
||||
Dependency: ROUTER-LIVE-008-002
|
||||
Owners: QA
|
||||
Task description:
|
||||
- Rebuild and redeploy the router slice, rerun the authenticated live sweep, and record whether the shared `503` cluster is removed or narrowed for the next iteration.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Router artifacts are rebuilt and redeployed on the live compose stack.
|
||||
- [x] The authenticated live Playwright sweep is rerun from the rebuilt stack.
|
||||
- [x] Remaining failures are recorded with current evidence if any survive.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-09 | Sprint created after the rebuilt live stack showed shared gateway `503` failures caused by heartbeat health flapping rather than page-local defects. | Developer |
|
||||
| 2026-03-09 | Updated messaging wait fallback to use heartbeat-derived safety-net timeouts, normalized gateway degraded/stale thresholds against messaging heartbeat cadence, and added focused router tests for both contracts. | Developer |
|
||||
| 2026-03-09 | Rebuilt the full image set, redeployed the live compose stack, then reran authenticated Playwright sweeps. The first post-redeploy sweep showed transient cross-service `404` convergence misses; the second consecutive sweep completed `111/111` against `src/Web/StellaOps.Web/output/playwright/live-frontdoor-canonical-route-sweep.json`. | QA |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: fix the router transport/gateway heartbeat contract in source instead of only loosening compose thresholds, because scratch rebuilds must preserve the runtime behavior.
|
||||
- Decision: treat the transient post-redeploy `404` cluster as the same convergence class as earlier health flapping until proven otherwise; verify with consecutive authenticated Playwright sweeps before opening page-local code work.
|
||||
- Risk: route convergence is improved but still needs continued scratch-rebuild observation in later iterations; if repeated `404` windows persist after the heartbeat contract change, the next fix belongs in startup/readiness gating rather than page clients.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-03-09: completed messaging wait fallback repair, gateway threshold normalization, and live rebuild verification.
|
||||
- Next iteration: expand from route availability into deeper Playwright action sweeps on the rebuilt stack.
|
||||
@@ -0,0 +1,75 @@
|
||||
# Sprint 20260309-012 - Router Live Quota Scope And Notify Dispatch Repairs
|
||||
|
||||
## Topic & Scope
|
||||
- Repair the two remaining authenticated frontdoor regressions left after the full rebuild and redeploy: quota violations authorization and notify channel health dispatch.
|
||||
- Keep the fixes in the Router layer because both failures occur before or inside Router-mediated delivery, not in the Platform or Notify business logic itself.
|
||||
- Preserve existing live contracts while removing the actual transport/auth defects instead of adding route-local UI fallbacks.
|
||||
- Working directory: `src/Router/`.
|
||||
- Allowed coordination edits: `docs/modules/router/architecture.md`, `docs/modules/notify/architecture.md`, `docs/implplan/SPRINT_20260309_012_Router_live_quota_scope_and_notify_dispatch_repairs.md`, `src/Web/StellaOps.Web/output/playwright/**`.
|
||||
- Expected evidence: targeted router test runs against individual `.csproj` files, rebuilt `router-gateway` image, redeployed compose stack, refreshed authenticated Playwright artifacts.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md` for the scratch rebuild baseline and `SPRINT_20260309_011_Platform_live_remaining_route_contract_repair.md` for the narrowed live failure inventory.
|
||||
- Safe parallelism: do not touch unrelated search or component-revival work outside `src/Router/**`; leave unrelated dirty files untouched.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `AGENTS.md`
|
||||
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/notify/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### LIVE-ROUTER-012-001 - Restore gateway scope compatibility for quota reads
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer, Test Automation
|
||||
Task description:
|
||||
- Fix the gateway authorization path so live quota endpoints can honor the resolved scope set produced by identity scope expansion. The frontdoor currently rejects quota reads even though the authenticated session carries `orch:quota` and the gateway already computes expanded scopes in request context.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `/api/v1/gateway/rate-limits/violations` succeeds through the live frontdoor for the authenticated operator session.
|
||||
- [x] Router gateway unit tests cover coarse-scope expansion and authorization checks against the resolved scope set.
|
||||
- [x] Router docs describe that scope-based authorization uses the resolved scope context, not only raw claim payloads.
|
||||
|
||||
### LIVE-ROUTER-012-002 - Fix ASP.NET bridge route matching for notify health paths
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer, Test Automation
|
||||
Task description:
|
||||
- Fix the messaging-transport ASP.NET bridge so terminal route parameters are not treated as implicit catch-alls. The notify channel-health route currently dispatches through messaging, and the bridge incorrectly matches the shorter channel-detail route when extra segments are present.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `/api/v1/notify/channels/{channelId}/health` resolves to the correct endpoint over Router messaging transport.
|
||||
- [x] Router ASP.NET bridge tests reproduce the old terminal-parameter bug and prove explicit catch-all routes still work.
|
||||
- [x] The fix is implemented in Router transport/bridge code, not in Notify page-local workarounds.
|
||||
|
||||
### LIVE-ROUTER-012-003 - Rebuild, redeploy, and reverify the live frontdoor
|
||||
Status: DONE
|
||||
Dependency: LIVE-ROUTER-012-001, LIVE-ROUTER-012-002
|
||||
Owners: Developer, QA
|
||||
Task description:
|
||||
- Rebuild the touched router image, redeploy the live stack, and rerun authenticated Playwright verification for the two repaired pages before committing.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The changed Router image is rebuilt from current source and redeployed.
|
||||
- [x] Authenticated Playwright rechecks pass for `/ops/operations/quotas` and `/ops/operations/notifications`.
|
||||
- [x] The canonical route sweep artifact reflects the updated live failure inventory.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-09 | Sprint created after the full rebuild/redeploy cleared scanner-backed route failures and left only two live Router-layer defects: quota scope enforcement and notify channel-health dispatch over messaging transport. | Developer |
|
||||
| 2026-03-09 | Added coarse-to-fine quota scope compatibility in gateway authorization, fixed ASP.NET bridge terminal-parameter matching, rebuilt `router-gateway` and `notify-web`, and verified live `/ops/operations/quotas` plus `/ops/operations/notifications` behavior with authenticated Playwright. | Developer |
|
||||
| 2026-03-09 | Re-ran the authenticated canonical live sweep after the rebuild cycle; the latest artifact reached `111/111` at `src/Web/StellaOps.Web/output/playwright/live-frontdoor-canonical-route-sweep.json`. | QA |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: keep quota compatibility in Router by authorizing against the resolved scope context already produced by gateway identity expansion; do not broaden Platform policies or change token issuance.
|
||||
- Decision: fix notify health in the ASP.NET bridge matcher so only explicit catch-all parameters consume extra path segments; this preserves direct HTTP and messaging parity.
|
||||
- Risk: Router is a shared ingress surface. All changes must be covered by deterministic tests before redeploy to avoid collateral regressions in other routed pages.
|
||||
- Decision: keep the live verification artifact in the sprint because the repaired quota and notify defects were validated in the same rebuilt stack that now serves the full canonical route set cleanly.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-03-09: completed router gateway and ASP.NET bridge repairs with focused tests plus live rebuild verification.
|
||||
- Next iteration: continue beyond route presence into deeper per-page action sweeps on the rebuilt stack.
|
||||
Reference in New Issue
Block a user