Restore policy frontdoor compatibility and live QA

2026-03-10 06:18:30 +02:00
parent 6578c82602
commit ff4cd7e999
7 changed files with 2413 additions and 50 deletions
--- a/docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md
+++ b/docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md
@@ -5,7 +5,7 @@
 - Use the canonical route inventory already curated in the frontend sweep spec, then record route-level failures, console errors, request failures, and visible operator actions for follow-on deep page/action iterations.
 - Keep this sprint focused on the reusable live sweep harness; route/action fixes discovered by the harness belong to later implementation iterations.
 - Working directory: `src/Web/StellaOps.Web/scripts`.
- Allowed coordination edits: `src/Web/StellaOps.Web/tests/e2e/prealpha-canonical-full-sweep.spec.ts`, `src/Web/StellaOps.Web/scripts/live-frontdoor-auth.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-canonical-route-sweep.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-changed-surfaces.mjs`, `src/Web/StellaOps.Web/scripts/live-releases-deployments-check.mjs`, `docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md`.
+- Allowed coordination edits: `src/Web/StellaOps.Web/tests/e2e/prealpha-canonical-full-sweep.spec.ts`, `src/Web/StellaOps.Web/scripts/live-frontdoor-auth.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-canonical-route-sweep.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-changed-surfaces.mjs`, `src/Web/StellaOps.Web/scripts/live-ops-policy-action-sweep.mjs`, `src/Web/StellaOps.Web/scripts/live-releases-deployments-check.mjs`, `docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md`.
 - Expected evidence: a runnable live sweep script, authenticated JSON output under `src/Web/StellaOps.Web/output/playwright/`, and a recorded list of failing canonical routes once the rebuilt stack is reachable.

 ## Dependencies & Concurrency
@@ -46,6 +46,19 @@ Completion criteria:
 - [x] The failing route list is captured as iteration evidence.
 - [x] Follow-on implementation work uses the captured failures instead of ad hoc page selection.

+### FE-LIVE-SWEEP-003 - Harden deep action sweeps against silent hangs
+Status: DONE
+Dependency: FE-LIVE-SWEEP-002
+Owners: QA, Developer (FE)
+Task description:
+- The deeper live action sweeps must fail fast and write partial evidence even when a specific page action hangs or a browser interaction wedges.
+- Add per-action watchdogs, progress logging, and non-zero exit semantics for behavioral failures so long-running scratch iterations remain auditable instead of stalling in silence.
+
+Completion criteria:
+- [x] The ops/policy action sweep writes partial JSON progress as it runs.
+- [x] A blocked action is reported as a failed action with step-level context instead of hanging the entire process.
+- [x] The action sweep exits non-zero when any checked action or runtime contract fails.
+
 ## Execution Log
 | Date (UTC) | Update | Owner |
 | --- | --- | --- |
@@ -55,6 +68,8 @@ Completion criteria:
 | 2026-03-09 | Ran the authenticated 106-route sweep against the rebuilt stack. After removing redirect/copy false positives, the real live backlog is 19 failing routes: reachability; feeds-airgap; jobengine; quotas; dead-letter; aoc; signals; packs; ai-runs; notifications; status; sbom-sources; policy simulation; policy trust-weights; policy staleness; policy audit; setup/platform trust-signing; and setup notifications. | Developer |
 | 2026-03-09 | Expanded the canonical live sweep inventory to include the revived release-investigation, evidence-thread, and registry-admin routes so future frontdoor passes cover those pages as first-class surfaces instead of leaving them to ad hoc follow-up scripts. | Developer |
 | 2026-03-09 | After the full image rebuild and the next web-only repair pass, reran the authenticated 111-route sweep. The live backlog moved to 24 failing routes, with the earlier title regressions and feeds-airgap issue cleared while new backend/runtime failures remained concentrated in analytics, JobEngine, integrations, policy governance, notifications, and trust authorization. | Developer |
+| 2026-03-10 | Full rebuild and redeploy completed cleanly, but the deeper live `ops/policy` action sweep stalled after authentication without writing a result file. This iteration is hardening the sweep itself with per-action watchdogs, progress persistence, and explicit failure semantics so the next scratch loops do not burn hours on a silent Playwright hang. | Developer |
+| 2026-03-10 | Completed the hardening pass on `live-ops-policy-action-sweep.mjs`: the script now persists progress while it runs, reports blocked actions with step-level snapshots, and exits non-zero on action/runtime failures. After the policy frontdoor fix, the same sweep completed cleanly on the rebuilt stack with zero runtime issues. | Developer |

 ## Decisions & Risks
 - Decision: keep this sprint focused on broad route-level live verification and action inventory, not on fixing specific route defects before the rebuilt stack is actually exercised.
@@ -62,6 +77,7 @@ Completion criteria:
 - Mitigation: record visible action inventory for each page so the next iterations can systematically deepen coverage instead of rediscovering affordances manually.
 - Decision: treat documented/canonical redirects as valid route outcomes in the live sweep (`/releases`, `/releases/promotion-queue`, `/ops/policy`, `/ops/policy/audit`, `/ops/platform-setup/trust-signing`, `/setup/topology`) because those aliases are intentional product behavior, not regressions.
 - Risk: many remaining failures are real frontdoor contract mismatches rather than simple UI copy/render issues, so the next iterations need backend/frontend contract inspection, not just surface-level error-banner suppression.
+- Decision: the deep live sweeps must be self-diagnosing. A hanging Playwright command is a harness defect because it blocks the problem-first loop from collecting the full issue set.

 ## Next Checkpoints
 - 2026-03-09: land the reusable live canonical route sweep script.
--- a/docs/implplan/SPRINT_20260310_002_Policy_policy_frontdoor_compat_and_live_verification.md
+++ b/docs/implplan/SPRINT_20260310_002_Policy_policy_frontdoor_compat_and_live_verification.md
@@ -0,0 +1,78 @@
+# Sprint 20260310-002 - Policy Frontdoor Compat And Live Verification
+
+## Topic & Scope
+- Restore the first-party `/policy/*` frontdoor contract on the rebuilt `https://stella-ops.local` stack so the policy simulation and governance surfaces no longer 404 through the router.
+- Fill the missing policy gateway compatibility endpoints that the live web shell expects during policy simulation, coverage, audit, effective-policy, exception, conflict, and batch-evaluation flows.
+- Keep the live Playwright policy action sweep meaningful by modeling the real shadow-mode state machine instead of failing on intentionally disabled controls.
+- Working directory: `src/Policy/StellaOps.Policy.Gateway`.
+- Allowed coordination edits: `devops/compose/router-gateway-local.json`, `src/Policy/__Tests/StellaOps.Policy.Gateway.Tests/PolicySimulationEndpointsTests.cs`, `src/Router/__Tests/StellaOps.Gateway.WebService.Tests/Middleware/RouteDispatchMiddlewareMicroserviceTests.cs`, `src/Web/StellaOps.Web/scripts/live-ops-policy-action-sweep.mjs`, `docs/implplan/SPRINT_20260310_002_Policy_policy_frontdoor_compat_and_live_verification.md`.
+- Expected evidence: targeted policy/router test passes and authenticated live Playwright evidence under `src/Web/StellaOps.Web/output/playwright/` showing zero runtime issues for the ops/policy sweep.
+
+## Dependencies & Concurrency
+- Depends on the scratch rebuild being complete enough for router, authority, policy gateway, and the web shell to authenticate at `https://stella-ops.local`.
+- Safe parallelism: do not edit unrelated router readiness/search/component revival files; keep changes scoped to the frontdoor policy compatibility path and its QA harness.
+
+## Documentation Prerequisites
+- `AGENTS.md`
+- `docs/qa/feature-checks/FLOW.md`
+- `docs/modules/router/architecture.md`
+- `docs/modules/platform/architecture-overview.md`
+
+## Delivery Tracker
+
+### POLICY-FRONTDOOR-001 - Restore missing policy gateway compatibility endpoints
+Status: DONE
+Dependency: none
+Owners: Developer, QA
+Task description:
+- Add the compatibility endpoints required by the live policy simulation/governance shell so `/policy/*` requests succeed through the first-party gateway on a fresh stack.
+- Keep the responses deterministic and scratch-friendly so the live browser sweep has meaningful data to work against.
+
+Completion criteria:
+- [x] Policy gateway exposes the missing `/policy/shadow/*`, `/policy/simulations/*`, `/policy/packs/*`, `/policy/effective`, `/policy/audit`, `/policy/exceptions*`, `/policy/conflicts*`, and `/policy/batch-evaluations*` compatibility surfaces required by the live shell.
+- [x] Targeted policy gateway tests cover the new compatibility contracts.
+- [x] The rebuilt live stack no longer emits `/policy/*` 404s from the policy simulation sweep.
+
+### POLICY-FRONTDOOR-002 - Fix router translation for first-party policy paths
+Status: DONE
+Dependency: POLICY-FRONTDOOR-001
+Owners: Developer
+Task description:
+- Diagnose why `/policy/*` still fails through the router even when the policy gateway exposes the expected endpoints.
+- Repair the local frontdoor route so the router preserves the `/policy` service prefix instead of stripping it before microservice dispatch.
+
+Completion criteria:
+- [x] The router local config translates `/policy/*` to the policy gateway with the correct preserved path prefix.
+- [x] A router regression test proves `/policy/shadow/config` no longer loses the `/policy` segment during microservice translation.
+- [x] `stellaops-router-gateway` starts healthy after the config repair.
+
+### POLICY-FRONTDOOR-003 - Make the live policy action sweep reflect real product behavior
+Status: DONE
+Dependency: POLICY-FRONTDOOR-002
+Owners: QA, Developer (FE)
+Task description:
+- Remove the false-negative `View Results` failure from the live policy action sweep by modeling the real shadow-mode workflow.
+- The sweep must enable shadow mode when needed, verify results/history becomes reachable, and restore the disabled baseline so repeated scratch loops remain deterministic.
+
+Completion criteria:
+- [x] The live action sweep treats intentionally disabled controls as state to navigate, not as blind click failures.
+- [x] The sweep verifies `View Results` reaches simulation history after shadow mode is enabled.
+- [x] The authenticated live policy action sweep finishes with zero action failures and zero runtime issues.
+
+## Execution Log
+| Date (UTC) | Update | Owner |
+| --- | --- | --- |
+| 2026-03-10 | Sprint created for the rebuilt-stack policy frontdoor repair after live Playwright showed first-party `/policy/*` 404s and a false-negative disabled action on the simulation page. | Developer |
+| 2026-03-10 | Added the missing policy gateway compatibility endpoints and deterministic backing state for shadow config, simulation history, coverage, effective policy, audit, exceptions, conflicts, and batch evaluations. Targeted policy gateway tests passed via the direct test assembly runner. | Developer |
+| 2026-03-10 | Diagnosed the real router defect: the canonical `/policy` microservice route existed already, but its translation stripped the `/policy` prefix before dispatch. Updated `router-gateway-local.json` to translate to `http://policy-gateway.stella-ops.local/policy`, added a router regression, and confirmed the gateway restarted healthy. | Developer |
+| 2026-03-10 | Reran the authenticated live ops/policy Playwright sweep. The runtime 404s disappeared; then updated the sweep to enable shadow mode before verifying `View Results`, restore the disabled baseline, and revalidated the live slice at `failedActionCount=0` and `runtimeIssueCount=0`. | Developer |
+
+## Decisions & Risks
+- Decision: keep `/policy/*` first-party and routed as a router microservice path. Reverse proxy exceptions remain reserved for third-party services, not Stella-owned policy surfaces.
+- Decision: preserve the `/policy` path prefix in the router translation instead of adding more special-case reverse-proxy routes, because the failure was path rewriting, not a missing service mapping.
+- Risk: the live policy action sweep covers only the current ops/policy slice; broader page-by-page live verification is still required in later iterations.
+- Mitigation: keep the sweep deterministic, authenticated, and state-restoring so it can be reused across scratch iterations as broader route/action work continues.
+
+## Next Checkpoints
+- Commit the scoped policy/router/web-script repair without unrelated router readiness or search changes.
+- Fold the next authenticated live slice into the broader canonical route backlog and continue the page/action-by-page/action sweep.