Harden canonical route sweep rechecks

2026-03-11 18:44:38 +02:00
parent f0b2ef3319
commit 6afd8f951e
3 changed files with 108 additions and 4 deletions
--- a/docs/implplan/SPRINT_20260311_007_FE_canonical_route_sweep_transient_recheck.md
+++ b/docs/implplan/SPRINT_20260311_007_FE_canonical_route_sweep_transient_recheck.md
@@ -0,0 +1,72 @@
+# Sprint 20260311_007 - FE Canonical Route Sweep Transient Recheck
+
+## Topic & Scope
+- Revalidate the broad live canonical route sweep after recent web changes and distinguish real route defects from transient Playwright/runtime noise.
+- Root-cause the reported `/ops/operations/health-slo` failure instead of fixing healthy product code based on a flaky harness signal.
+- Harden the canonical route sweep so failed routes are rechecked in a fresh authenticated browser context before they are counted as broken.
+- Update QA flow guidance so UI Tier 2 verification explicitly requires a fresh-context recheck for transient-only failures.
+- Working directory: `src/Web/StellaOps.Web`.
+- Expected evidence: isolated Playwright repros for `health-slo`, patched canonical sweep script, refreshed `live-frontdoor-canonical-route-sweep.json` with `111/111`, updated QA flow doc, and a scoped local commit.
+
+## Dependencies & Concurrency
+- Depends on the live compose stack at `https://stella-ops.local` being healthy and reachable.
+- Safe parallelism: implementation stays in `src/Web/StellaOps.Web`; the only allowed doc touch outside the working directory is `docs/qa/feature-checks/FLOW.md` plus this sprint file because the fix changes QA execution rules.
+
+## Documentation Prerequisites
+- `AGENTS.md`
+- `docs/qa/feature-checks/FLOW.md`
+- `docs/code-of-conduct/TESTING_PRACTICES.md`
+
+## Delivery Tracker
+
+### FE-ROUTE-SWEEP-001 - Prove whether `health-slo` is a product defect or a harness defect
+Status: DONE
+Dependency: none
+Owners: QA, 3rd line support
+Task description:
+- Reproduce the `/ops/operations/health-slo` failure from the canonical route sweep, then probe the route in isolation with authenticated Playwright, endpoint capture, and repeated route loads to determine whether the page or the sweep is lying.
+
+Completion criteria:
+- [x] Isolated authenticated browser probes capture the real `/api/v1/platform/health/*` statuses during page load.
+- [x] Repeated isolated `health-slo` loads confirm whether the route itself is stable.
+- [x] Root cause is identified as product code or harness logic with concrete evidence.
+
+### FE-ROUTE-SWEEP-002 - Harden the canonical sweep against transient false positives
+Status: DONE
+Dependency: FE-ROUTE-SWEEP-001
+Owners: Product Manager, Architect, Developer
+Task description:
+- Update the broad route sweep so an initial failed route is rechecked in a fresh authenticated browser context before it is marked failed. Preserve first-failure evidence while using the recheck result as the final verdict.
+
+Completion criteria:
+- [x] Failed routes are retried in a fresh authenticated context.
+- [x] Recheck metadata preserves the initial failure evidence.
+- [x] Healthy routes are no longer misclassified from transient runtime noise.
+
+### FE-ROUTE-SWEEP-003 - Reverify the full canonical route matrix
+Status: DONE
+Dependency: FE-ROUTE-SWEEP-002
+Owners: QA
+Task description:
+- Rerun the full canonical route sweep on the live stack and confirm the final result reflects real route health after the harness hardening.
+
+Completion criteria:
+- [x] Full live sweep reruns on `https://stella-ops.local`.
+- [x] Final result records `111/111` passed routes and `0` failed routes.
+- [x] QA flow documentation records the fresh-context recheck rule for transient UI failures.
+
+## Execution Log
+| Date (UTC) | Update | Owner |
+| --- | --- | --- |
+| 2026-03-11 | Sprint created after the broad canonical route sweep reported a single live failure on `/ops/operations/health-slo`. | QA |
+| 2026-03-11 | Isolated authenticated Playwright probes showed the route and its backing `summary`, `dependencies`, and `incidents` endpoints returning `200`, while repeated direct route loads stayed clean. Root cause was reclassified from product defect to sweep false positive. | QA / 3rd line support |
+| 2026-03-11 | Hardened `live-frontdoor-canonical-route-sweep.mjs` so failed routes are rechecked in a fresh authenticated context before final classification. The first failure evidence is preserved in the route record. | Product / Architect / Developer |
+| 2026-03-11 | Full canonical route sweep reran clean on the live stack and recorded `111/111` passed routes with no failed routes. | QA |
+
+## Decisions & Risks
+- Decision: do not patch the `health-slo` product route because isolated live verification proved it healthy. Fixing product code against a false positive would lower signal quality and increase regression risk.
+- Decision: broad route sweeps now treat a first-pass failure as provisional until a fresh-context recheck runs. This is the smallest clean change that preserves aggressive QA while reducing flaky route classifications.
+- Risk: transient failures are still evidence. The harness preserves initial failure details so recurring instability can still be investigated instead of silently disappearing.
+
+## Next Checkpoints
+- Commit the canonical sweep hardening locally, clear transient Playwright output again, then move to the next unswept deep action family with the corrected route baseline.
--- a/docs/qa/feature-checks/FLOW.md
+++ b/docs/qa/feature-checks/FLOW.md
@@ -322,8 +322,9 @@ echo $?  # Verify exit code 0
 1. Ensure the Angular app is running (`ng serve` or docker)
 2. Use Playwright CLI or MCP to navigate to the feature's UI route
 3. Follow E2E Test Plan steps: verify elements render, interactions work, data displays
-4. Capture screenshots as evidence
-5. Test accessibility (keyboard navigation, ARIA labels) if listed in E2E plan
+4. If the feature fails only through transient network/runtime noise, rerun the failing UI transaction in a fresh page or fresh authenticated browser context before declaring the feature failed. Preserve both the first failure evidence and the recheck outcome.
+5. Capture screenshots as evidence
+6. Test accessibility (keyboard navigation, ARIA labels) if listed in E2E plan

 **Example for `pipeline-run-centric-view`**:
 ```bash