Restore scratch setup bootstrap and live frontdoor sweep

2026-03-09 01:42:24 +02:00
parent abda749ffd
commit c9686edf07
13 changed files with 766 additions and 63 deletions
--- a/docs/implplan/SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md
+++ b/docs/implplan/SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md
@@ -0,0 +1,72 @@
+# Sprint 20260309-001 - Platform Scratch Setup Bootstrap Restore
+
+## Topic & Scope
+- Restore the documented Windows scratch-setup path so `scripts/setup.ps1` can rebuild Docker images and start Stella Ops from an empty Docker state.
+- Treat the setup script itself as production surface: a clean repo plus docs must be enough to bootstrap the platform without manual script surgery.
+- Re-run the clean setup path after the fix, then continue into Playwright-backed live verification on the rebuilt stack.
+- Working directory: `devops/docker`.
+- Allowed coordination edits: `scripts/setup.ps1`, `scripts/setup.sh`, `devops/compose/docker-compose.stella-ops.yml`, `docs/quickstart.md`, `docs/INSTALL_GUIDE.md`, `devops/README.md`, `devops/compose/README.md`, `src/Web/StellaOps.Web/scripts/chrome-path.js`, `src/Web/StellaOps.Web/scripts/verify-chromium.js`, `docs/implplan/SPRINT_20260309_001_Platform_scratch_setup_bootstrap_restore.md`.
+- Expected evidence: clean setup invocation output, successful image-builder startup, rebuilt compose stack, and downstream Playwright verification artifacts.
+
+## Dependencies & Concurrency
+- Depends on Docker Desktop, hosts entries, and `devops/compose/.env` already being present, which the documented setup preflight checks before build/start.
+- Safe parallelism: avoid unrelated frontend search, settings, and revived-component work; keep changes limited to the bootstrap scripts/docs unless a new setup blocker proves otherwise.
+
+## Documentation Prerequisites
+- `AGENTS.md`
+- `docs/quickstart.md`
+- `docs/INSTALL_GUIDE.md`
+- `devops/README.md`
+- `devops/compose/README.md`
+
+## Delivery Tracker
+
+### PLATFORM-SETUP-001 - Repair Windows image-builder bootstrap defaults
+Status: DONE
+Dependency: none
+Owners: Developer, QA
+Task description:
+- Fix the documented Windows image-build entry point used by `scripts/setup.ps1` so it parses and runs in the repo's supported PowerShell setup flow.
+- Keep the fix minimal and compatible with environment-variable overrides because the same script is the canonical Docker image build path for a clean local bootstrap.
+
+Completion criteria:
+- [x] `devops/docker/build-all.ps1` parses without PowerShell errors.
+- [x] `scripts/setup.ps1 -SkipBuild` advances past the image-builder entry point on a clean Docker state.
+- [x] The fix preserves `REGISTRY`, `TAG_SUFFIX`, `SDK_IMAGE`, and `RUNTIME_IMAGE` overrides.
+
+### PLATFORM-SETUP-002 - Re-run clean platform bootstrap and continue QA
+Status: DONE
+Dependency: PLATFORM-SETUP-001
+Owners: QA, Developer
+Task description:
+- Re-run the documented scratch bootstrap from the repo scripts after the parser fix, then proceed into live Playwright verification on the rebuilt frontdoor.
+- Record the next blocker found after the bootstrap repair instead of treating setup completion alone as success.
+
+Completion criteria:
+- [x] The clean setup path is rerun from the repo script after the fix.
+- [x] The stack is reachable through `https://stella-ops.local`.
+- [x] The next live verification findings are captured for follow-on iterations.
+
+## Execution Log
+| Date (UTC) | Update | Owner |
+| --- | --- | --- |
+| 2026-03-09 | Sprint created after a scratch Docker wipe exposed that the documented Windows setup path fails immediately in `devops/docker/build-all.ps1` before any images are built. | Developer |
+| 2026-03-09 | Replaced invalid PowerShell null-coalescing defaults in `devops/docker/build-all.ps1` with compatibility-safe runtime fallback assignment, then re-ran `scripts/setup.ps1 -SkipBuild` and confirmed the clean bootstrap advanced into the 60-image rebuild matrix. | Developer |
+| 2026-03-09 | Found a second setup-to-QA blocker: Playwright Chromium installed under `%LOCALAPPDATA%\\ms-playwright`, but `src/Web/StellaOps.Web/scripts/chrome-path.js` only searched `%HOME%\\.cache\\ms-playwright` and `chrome-win`. Expanded resolver coverage to standard Windows cache roots and `chrome-win64` layouts. | Developer |
+| 2026-03-09 | Tightened the Chromium resolver to prefer the newest discovered Playwright revision, because the same helper is consumed by the Playwright configs and should not silently bind to an older cached browser when multiple revisions are installed. | Developer |
+| 2026-03-09 | Scratch image build completed successfully (`60/60`), but compose startup failed immediately because `docker-compose.stella-ops.yml` still referenced legacy `stellaops/jobengine*` image names while the canonical build matrix emits `stellaops/orchestrator*`. Updated compose to consume the built image names while preserving the existing `jobengine` service identity and host aliases. | Developer |
+| 2026-03-09 | The next clean-start blocker was the external `FRONTDOOR_NETWORK` contract: a full Docker wipe removed `stellaops_frontdoor`, but neither setup script recreated it before `docker compose -f docker-compose.stella-ops.yml up -d`. Wired network creation into both setup scripts and updated the install docs to document the same manual prerequisite. | Developer |
+| 2026-03-09 | Re-ran `scripts/setup.ps1 -SkipBuild -SkipImages` after the setup fixes and confirmed the stack came up cleanly on `https://stella-ops.local`; live Playwright auth also succeeded, proving the scratch bootstrap now reaches real browser-verifiable UI state. | Developer |
+| 2026-03-09 | Demo seeding still exposed module migration debt (`no migration resources to consolidate` across several modules plus a duplicate `Unknowns` migration name). I did not treat that as a setup pass condition because the live frontdoor remained operable, but it remains a follow-on platform quality gap. | Developer |
+
+## Decisions & Risks
+- Decision: repair the documented setup path first instead of working around it with ad hoc manual builds, because scratch bootstrap is part of the product surface for this mission.
+- Risk: additional clean-setup blockers may appear after the parser issue because the stack is being rebuilt from empty Docker state rather than from previously warmed images/volumes.
+- Mitigation: keep rerunning the same documented path and treat each newly exposed blocker as iteration input until the full bootstrap succeeds.
+- Decision: treat browser-binary discovery as part of the scratch-bootstrap contract because a clean rebuild is not complete until Playwright can attach to a browser for live verification.
+- Decision: preserve the `jobengine` compose service name and `jobengine.stella-ops.local` alias for compatibility, but map it to the canonical `orchestrator` image names emitted by the Docker build matrix so scratch setup uses the images it just produced.
+- Decision: the automated setup path now owns creation of the external frontdoor Docker network because that network is part of the documented default compose topology, and a scratch bootstrap should not depend on an undocumented pre-existing Docker artifact.
+
+## Next Checkpoints
+- 2026-03-09: rerun `scripts/setup.ps1 -SkipBuild` after the parser fix.
+- 2026-03-09: continue into frontdoor Playwright verification once the rebuilt stack is reachable.
--- a/docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md
+++ b/docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md
@@ -0,0 +1,67 @@
+# Sprint 20260309-002 - FE Live Frontdoor Canonical Route Sweep
+
+## Topic & Scope
+- Create a real authenticated Playwright harness for the canonical Stella Ops frontdoor routes so route regressions are detected against `https://stella-ops.local`, not just against stubbed e2e fixtures.
+- Use the canonical route inventory already curated in the frontend sweep spec, then record route-level failures, console errors, request failures, and visible operator actions for follow-on deep page/action iterations.
+- Keep this sprint focused on the reusable live sweep harness; route/action fixes discovered by the harness belong to later implementation iterations.
+- Working directory: `src/Web/StellaOps.Web/scripts`.
+- Allowed coordination edits: `src/Web/StellaOps.Web/tests/e2e/prealpha-canonical-full-sweep.spec.ts`, `src/Web/StellaOps.Web/scripts/live-frontdoor-auth.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-canonical-route-sweep.mjs`, `src/Web/StellaOps.Web/scripts/live-frontdoor-changed-surfaces.mjs`, `src/Web/StellaOps.Web/scripts/live-releases-deployments-check.mjs`, `docs/implplan/SPRINT_20260309_002_FE_live_frontdoor_canonical_route_sweep.md`.
+- Expected evidence: a runnable live sweep script, authenticated JSON output under `src/Web/StellaOps.Web/output/playwright/`, and a recorded list of failing canonical routes once the rebuilt stack is reachable.
+
+## Dependencies & Concurrency
+- Depends on the scratch bootstrap sprint completing enough of the stack for `https://stella-ops.local` and Authority auth to respond.
+- Safe parallelism: keep edits in the web scripts area only; do not touch unrelated frontend feature code while other agents are landing search/component changes.
+
+## Documentation Prerequisites
+- `AGENTS.md`
+- `src/Web/StellaOps.Web/AGENTS.md`
+- `docs/qa/feature-checks/FLOW.md`
+- `docs/modules/platform/architecture-overview.md`
+
+## Delivery Tracker
+
+### FE-LIVE-SWEEP-001 - Add authenticated canonical route sweep harness
+Status: DONE
+Dependency: none
+Owners: QA, Developer (FE)
+Task description:
+- Create a Playwright-backed live route harness that authenticates through the real frontdoor, navigates the canonical page inventory, and records route-level failures, visible problem banners, console/request failures, and visible actions.
+- Reuse the existing live auth/session seeding pattern so the harness can run repeatedly across iterations without hand-driving the browser every time.
+
+Completion criteria:
+- [x] A script exists under `src/Web/StellaOps.Web/scripts/` for authenticated live canonical route sweeps.
+- [x] The script writes structured JSON output to `src/Web/StellaOps.Web/output/playwright/`.
+- [x] The script exits non-zero when canonical routes fail the route-level acceptance checks.
+
+### FE-LIVE-SWEEP-002 - Run the harness on the rebuilt stack
+Status: DONE
+Dependency: FE-LIVE-SWEEP-001
+Owners: QA
+Task description:
+- Execute the live canonical route sweep against the rebuilt `stella-ops.local` stack once the scratch bootstrap finishes.
+- Use its findings as the starting backlog for deeper per-page/per-action iterations.
+
+Completion criteria:
+- [x] The harness has been run against the rebuilt stack.
+- [x] The failing route list is captured as iteration evidence.
+- [x] Follow-on implementation work uses the captured failures instead of ad hoc page selection.
+
+## Execution Log
+| Date (UTC) | Update | Owner |
+| --- | --- | --- |
+| 2026-03-09 | Sprint created during the scratch bootstrap so the moment the stack becomes reachable there is a broad authenticated Playwright route harness ready to run against the live frontdoor. | Developer |
+| 2026-03-09 | Added `scripts/live-frontdoor-canonical-route-sweep.mjs`, reusing live frontdoor auth/session seeding, canonical route inventory, strict route checks for known-sensitive pages, and structured JSON output under `output/playwright/`. Syntax validation passed before the live rerun. | Developer |
+| 2026-03-09 | Fixed a harness defect in the shared auth/session model: the original live sweep restored `sessionStorage` only in the login tab, so every freshly opened route page was unauthenticated and falsely redirected to `/welcome`. Moved session seeding into `createAuthenticatedContext(...)` and reused the helper from the other live scripts. | Developer |
+| 2026-03-09 | Ran the authenticated 106-route sweep against the rebuilt stack. After removing redirect/copy false positives, the real live backlog is 19 failing routes: reachability; feeds-airgap; jobengine; quotas; dead-letter; aoc; signals; packs; ai-runs; notifications; status; sbom-sources; policy simulation; policy trust-weights; policy staleness; policy audit; setup/platform trust-signing; and setup notifications. | Developer |
+
+## Decisions & Risks
+- Decision: keep this sprint focused on broad route-level live verification and action inventory, not on fixing specific route defects before the rebuilt stack is actually exercised.
+- Risk: route-level checks alone do not prove that every page action is correct; they are the breadth-first pass that feeds deeper action-by-action iterations.
+- Mitigation: record visible action inventory for each page so the next iterations can systematically deepen coverage instead of rediscovering affordances manually.
+- Decision: treat documented/canonical redirects as valid route outcomes in the live sweep (`/releases`, `/releases/promotion-queue`, `/ops/policy`, `/ops/policy/audit`, `/ops/platform-setup/trust-signing`, `/setup/topology`) because those aliases are intentional product behavior, not regressions.
+- Risk: many remaining failures are real frontdoor contract mismatches rather than simple UI copy/render issues, so the next iterations need backend/frontend contract inspection, not just surface-level error-banner suppression.
+
+## Next Checkpoints
+- 2026-03-09: land the reusable live canonical route sweep script.
+- 2026-03-09: execute the sweep once the scratch rebuild reaches a live frontdoor.
+- 2026-03-09: start implementation iterations on the highest-leverage live failure clusters from the 19-route backlog.