Fix router messaging re-registration stability

This commit is contained in:
master
2026-03-07 03:48:46 +02:00
parent 28932d4a85
commit 2ff0e1f86b
9 changed files with 466 additions and 116 deletions

View File

@@ -0,0 +1,91 @@
# Sprint 20260307-008 - Router Integrations Messaging Re-registration Stability
## Topic & Scope
- Eliminate duplicate Router messaging registrations that destabilize authenticated `https://stella-ops.local/api/v1/integrations*` traffic after repeated HELLO re-registration.
- Fix the defect at the shared Router layer so re-registration refreshes an existing service connection instead of accumulating stale gateway routing entries.
- Validate the repaired behavior with focused Router tests and live Playwright verification of Setup Integrations routes on `https://stella-ops.local`.
- Working directory: `src/Router`.
- Expected evidence: targeted Router unit tests, scoped service rebuild/restart, live Playwright route/action verification, and sprint execution log updates.
## Dependencies & Concurrency
- Upstream repro/evidence is tracked in `docs/implplan/SPRINT_20260306_003_FE_playwright_setup_reset_iteration_loop.md`.
- Safe parallelism: stay inside `src/Router` plus sprint/task-board updates; do not edit unrelated Web search files or Integrations persistence files owned by other active work.
- Runtime dependency: `stellaops-router-gateway` and `stellaops-integrations-web` must be rebuildable independently from the rest of the compose stack.
## Documentation Prerequisites
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
- `docs/modules/platform/architecture-overview.md`
- `docs/modules/router/README.md`
- `docs/modules/router/architecture.md`
- `docs/modules/router/openapi-aggregation.md`
- `docs/modules/router/schema-validation.md`
- `src/Router/AGENTS.md`
- `src/Router/__Libraries/StellaOps.Router.Gateway/AGENTS.md`
- `src/Router/StellaOps.Gateway.WebService/AGENTS.md`
## Delivery Tracker
### RTR-MSG-001 - Reproduce and explain duplicate messaging registrations
Status: DONE
Dependency: none
Owners: QA, Developer
Task description:
- Use live authenticated Playwright plus Router/gateway logs to explain why Setup Integrations requests can oscillate between working and stalling.
- Capture the repeated registration pattern and map it back to the Router client/server code path responsible for HELLO re-registration.
Completion criteria:
- [x] Live evidence shows the affected `stella-ops.local` integrations route path and the corresponding Router/gateway behavior.
- [x] The repeated registration pattern is tied to specific Router source files and not left as a generic timing issue.
- [x] The scope boundary with the Web sprint is documented.
### RTR-MSG-002 - Stop duplicate Router registrations at the source and gateway state
Status: DONE
Dependency: RTR-MSG-001
Owners: Developer
Task description:
- Fix the shared Router messaging transport so `ConnectAsync(...)` re-registration does not spawn duplicate queues/receive loops or a fresh logical connection when the transport is already healthy.
- Harden gateway routing state so a reconnecting instance replaces stale registrations for the same service instance instead of accumulating them.
Completion criteria:
- [x] Re-registration reuses or replaces the logical connection deterministically instead of accumulating duplicates.
- [x] Gateway routing state no longer retains stale connections for the same service instance after re-registration.
- [x] The fix stays offline-safe and deterministic.
### RTR-MSG-003 - Rebuild targeted services and replay live integrations QA
Status: DONE
Dependency: RTR-MSG-002
Owners: QA, Developer
Task description:
- Rebuild only the Router gateway and the live Integrations service components affected by the shared messaging transport fix.
- Replay the live Setup Integrations pages and actions with authenticated Playwright, including repeated requests, list/detail rendering, and onboarding navigation.
Completion criteria:
- [x] Targeted Router tests pass against focused test runners for the affected classes.
- [x] `stellaops-router-gateway` and `stellaops-integrations-web` are restarted with the patched Router code.
- [x] Live Playwright confirms the integrations routes/actions stay functional without fallback timeout UI being the only thing keeping the page usable.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-07 | Sprint created after live Playwright on `https://stella-ops.local/setup/integrations/*` and Router log review exposed a stateful messaging defect outside Web scope. Direct Integrations HTTP calls were healthy, but Router logs showed the same `integrations` service instance repeatedly registering new messaging connection IDs over time. | QA |
| 2026-03-07 | Root cause confirmed in `RouterConnectionManager` periodic HELLO re-registration plus `MessagingTransportClient.ConnectAsync(...)`, which recreated queues, receive loops, and logical connection IDs on every refresh. | Developer |
| 2026-03-07 | Patched the shared Router transport so healthy HELLO re-registration reuses the existing logical connection, and hardened `InMemoryRoutingState` to replace stale same-instance registrations before rebuilding endpoint indexes. | Developer |
| 2026-03-07 | Added targeted xUnit v3 regressions for messaging re-registration reuse and gateway routing-state dedupe, then executed those classes directly through the test assembly runners because Microsoft Testing Platform ignored `dotnet test --filter` for these projects. | QA |
| 2026-03-07 | Rebuilt `stellaops/router-gateway:dev` and `stellaops/integrations-web:dev`, then force-recreated only `router-gateway` and `integrations-web` inside the live compose stack to replay the defect on patched runtime images. | Developer |
| 2026-03-07 | Live authenticated Playwright on `/setup/integrations`, `/setup/integrations/secrets`, `/setup/integrations/int-1`, and `/setup/integrations/onboarding/host` confirmed the SPA's own `/api/v1/integrations*` calls were returning `200` for list views, the missing-detail route rendered the intended explicit unavailable state with working back-navigation, and the host-provider selection advanced into the authentication step without stalling. | QA |
| 2026-03-07 | Timestamped Router logs showed the rebuilt `integrations-0eabab6a4e63421c9aa943f` instance re-HELLO at `2026-03-07T01:41:33Z` with the same logical connection id `a4627760b78c48228e62007d925df22a` first registered at `2026-03-07T01:36:41Z`, confirming the duplicate-registration defect is fixed for rebuilt clients. | QA |
## Decisions & Risks
- Decision: this sprint stays inside `src/Router` plus required sprint/task-board updates only.
- Decision: the permanent fix must cover both sides of the behavior: the microservice transport must stop creating duplicate logical connections, and gateway routing state must fail safe when an older client re-registers anyway.
- Decision: targeted Router evidence uses the xUnit v3 test assembly executables (`*.Tests.exe -class ...`) because these projects run on Microsoft Testing Platform and ignore `dotnet test --filter`, which would otherwise hide the new regressions inside a full-suite pass count.
- Decision: raw unauthenticated probe calls to `/api/v1/integrations*` are not accepted as UI evidence for this sprint because the SPA attaches authenticated context differently than a naked fetch; live validation is based on Playwright-observed browser traffic plus Router logs.
- Risk: `docs/modules/gateway/architecture.md` and `docs/modules/gateway/openapi.md` are referenced by module charters but do not exist at the expected paths in this repo snapshot.
- Mitigation: follow the available canonical Router dossiers (`docs/modules/router/**`) and record the missing gateway doc paths here instead of inventing replacements during the bug-fix iteration.
- Risk: other long-running services in the current compose stack may also be using the same older Router transport behavior.
- Mitigation: harden gateway-side dedupe so the live stack benefits immediately after the targeted gateway rebuild even before every service image is refreshed.
- Risk: the shared client-side transport fix is live only in the rebuilt images from this sprint, so other services that still run older images can continue to mint fresh connection ids until later rollout iterations rebuild them.
## Next Checkpoints
- 2026-03-07: rebuild the next scoped batch of messaging-client services so the shared transport fix rolls beyond the integrations path without attempting a full compose rebuild.
- 2026-03-07: continue Playwright-first page/action sweeps to surface the next live defect once the router registration churn for rebuilt services is confirmed clean.

View File

@@ -533,6 +533,8 @@ Gateway tracks:
- Marks stale instances as Unhealthy
- Uses health in routing decisions
Periodic HELLO re-registration is valid so a microservice can repopulate gateway state after a gateway restart, but it must refresh the existing logical transport connection instead of minting a second one. Gateway routing state also deduplicates by service instance identity (`ServiceName`, `Version`, `InstanceId`, transport) before re-indexing endpoints so repeated HELLO frames cannot accumulate stale route candidates.
---
## Configuration