Restore Doctor search after AdvisoryAI cold-start race

2026-03-11 21:19:42 +02:00
parent 08006100a5
commit 66e67f1a97
5 changed files with 166 additions and 17 deletions
--- a/docs/implplan/SPRINT_20260311_011_AdvisoryAI_knowledge_startup_lock_and_doctor_search_restore.md
+++ b/docs/implplan/SPRINT_20260311_011_AdvisoryAI_knowledge_startup_lock_and_doctor_search_restore.md
@@ -0,0 +1,60 @@
+# Sprint 20260311_011 - AdvisoryAI Knowledge Startup Lock And Doctor Search Restore
+
+## Topic & Scope
+- Restore Doctor unified search on the scratch-built `stella-ops.local` stack after fresh-stack Playwright exposed an empty knowledge corpus on `/ops/operations/doctor`.
+- Fix the AdvisoryAI startup race so knowledge corpus rebuild and unified-search refresh can touch the same store during cold start without breaking first-run correctness.
+- Keep the live mission-control sweep evidence truthful by removing the remaining `View all` selector false negative uncovered in the same pass.
+- Working directory: `src/AdvisoryAI`.
+- Expected evidence: focused AdvisoryAI integration coverage, rebuilt `advisory-ai-web` startup proof, and live Playwright artifacts for Doctor unified search plus mission-control actions.
+
+## Dependencies & Concurrency
+- Depends on `docs/implplan/SPRINT_20260311_010_Platform_scratch_setup_revalidation.md`.
+- Allowed cross-module evidence touch: `src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs`.
+
+## Documentation Prerequisites
+- `AGENTS.md`
+- `docs/modules/advisory-ai/knowledge-search.md`
+- `docs/qa/feature-checks/FLOW.md`
+
+## Delivery Tracker
+
+### TASK-01 - Make knowledge schema bootstrap concurrency-safe
+Status: DONE
+Dependency: none
+Owners: QA, 3rd line support, Architect, Developer
+Task description:
+- Reproduce the Doctor search failure from the live scratch stack and trace it into the AdvisoryAI knowledge startup path.
+- Fix `PostgresKnowledgeSearchStore.EnsureSchemaAsync()` so concurrent hosted services cannot race on schema creation and leave the Doctor/knowledge corpus empty on first boot.
+
+Completion criteria:
+- [x] Concurrent cold-start schema bootstrap no longer fails in the knowledge store.
+- [x] Focused regression coverage exercises concurrent `EnsureSchemaAsync()` calls against PostgreSQL.
+
+### TASK-02 - Rebuild and prove Doctor unified search on the live scratch stack
+Status: DONE
+Dependency: TASK-01
+Owners: QA, Developer
+Task description:
+- Rebuild and redeploy AdvisoryAI, then rerun the live Doctor unified-search matrix and direct starter-query probes.
+- Recheck the mission-control action sweep after tightening the `View all` selector so the QA artifact reflects actual product behavior.
+
+Completion criteria:
+- [x] `advisory-ai-web` startup logs show a successful knowledge rebuild on the live stack.
+- [x] Live Playwright Doctor unified-search evidence is clean on the scratch deployment.
+- [x] Mission-control action sweep passes without the stale `View all` false negative.
+
+## Execution Log
+| Date (UTC) | Update | Owner |
+| --- | --- | --- |
+| 2026-03-11 | Sprint created after the fresh-stack unified-search matrix isolated Doctor failures to an empty knowledge scope and container logs showed the knowledge startup rebuild failing with PostgreSQL `23505` during schema bootstrap. | QA / 3rd line support |
+| 2026-03-11 | Root cause traced to concurrent `EnsureSchemaAsync()` callers from AdvisoryAI hosted services. Applied a PostgreSQL advisory transaction lock to the knowledge store and added a focused concurrent startup regression. | Architect / Developer |
+| 2026-03-11 | Tightened the mission-board Playwright harness so `View all` binds to the real `/releases/runs` anchor instead of a generic text match. | QA / Developer |
+| 2026-03-11 | Rebuilt and redeployed `advisory-ai-web`; live startup logs now show a successful knowledge rebuild (`documents=470`, `chunks=9051`, `doctor_projections=8`). Reran the live unified-search matrix cleanly (`4 routes checked, 0 issues`), directly rechecked Doctor starter queries with grounded results, and confirmed the mission-control action sweep passes with zero failed actions/runtime issues. | QA / Developer |
+
+## Decisions & Risks
+- Decision: keep Doctor mapped to the knowledge scope. The live failure was caused by the knowledge corpus not rebuilding on startup, not by the Doctor route using the wrong search domain.
+- Decision: fix concurrency inside the knowledge store rather than by trying to sequence hosted services manually. Multiple startup callers are valid and the store must stay safe under them.
+- Decision: use a PostgreSQL advisory transaction lock inside the store bootstrap path so the first-run contract remains correct regardless of how many hosted services touch the knowledge store during startup.
+
+## Next Checkpoints
+- Archive on local commit; Doctor search is restored on the live scratch stack.
--- a/docs/modules/advisory-ai/knowledge-search.md
+++ b/docs/modules/advisory-ai/knowledge-search.md
@@ -389,6 +389,7 @@ Notes:
 - `stella advisoryai index rebuild` and `stella search index rebuild` invoke authenticated backend endpoints. For a local source-checkout verification lane without a signed-in CLI session, use `sources prepare` via CLI and the direct HTTP rebuild calls above with explicit `X-StellaOps-*` headers.
 - Compose/runtime requirement: the published AdvisoryAI service image must carry a repo-shaped local corpus under its app content root so `POST /v1/advisory-ai/index/rebuild` can resolve `docs/**`, `devops/compose/openapi_current.json`, and `src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/*.json` even when the source checkout is not mounted into the container. If those assets are absent, live search on `stella-ops.local` degrades to partial unified rows only and documentation/Doctor/API answers disappear.
 - Fresh service startup now auto-runs the knowledge rebuild by default (`AdvisoryAI__KnowledgeSearch__KnowledgeAutoIndexOnStartup=true`). This is the scratch-setup convergence path for `stella-ops.local`: a wiped deployment must populate the documentation/API/Doctor corpus without requiring operators to call `POST /v1/advisory-ai/index/rebuild` manually. Keep the manual endpoint for explicit refreshes and local live-search lanes, but do not depend on it for first-run correctness.
+- Startup schema bootstrap is protected by a PostgreSQL advisory transaction lock. AdvisoryAI cold start can trigger both the knowledge rebuild host and unified-search refresh paths against the same store, so `EnsureSchemaAsync()` must serialize `CREATE SCHEMA` and migration application instead of relying on `IF NOT EXISTS` alone.
 - The published app content root must also carry the full unified snapshot corpus under `src/AdvisoryAI/StellaOps.AdvisoryAI/UnifiedSearch/Snapshots/*.json`; packaging only findings/VEX/policy snapshots leaves graph, OpsMemory, timeline, and scanner answer lanes permanently corpus-unready in the live shell.

 ### CLI setup in a source checkout