Restore Doctor search after AdvisoryAI cold-start race

This commit is contained in:
master
2026-03-11 21:19:42 +02:00
parent 08006100a5
commit 66e67f1a97
5 changed files with 166 additions and 17 deletions

View File

@@ -0,0 +1,60 @@
# Sprint 20260311_011 - AdvisoryAI Knowledge Startup Lock And Doctor Search Restore
## Topic & Scope
- Restore Doctor unified search on the scratch-built `stella-ops.local` stack after fresh-stack Playwright exposed an empty knowledge corpus on `/ops/operations/doctor`.
- Fix the AdvisoryAI startup race so knowledge corpus rebuild and unified-search refresh can touch the same store during cold start without breaking first-run correctness.
- Keep the live mission-control sweep evidence truthful by removing the remaining `View all` selector false negative uncovered in the same pass.
- Working directory: `src/AdvisoryAI`.
- Expected evidence: focused AdvisoryAI integration coverage, rebuilt `advisory-ai-web` startup proof, and live Playwright artifacts for Doctor unified search plus mission-control actions.
## Dependencies & Concurrency
- Depends on `docs/implplan/SPRINT_20260311_010_Platform_scratch_setup_revalidation.md`.
- Allowed cross-module evidence touch: `src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs`.
## Documentation Prerequisites
- `AGENTS.md`
- `docs/modules/advisory-ai/knowledge-search.md`
- `docs/qa/feature-checks/FLOW.md`
## Delivery Tracker
### TASK-01 - Make knowledge schema bootstrap concurrency-safe
Status: DONE
Dependency: none
Owners: QA, 3rd line support, Architect, Developer
Task description:
- Reproduce the Doctor search failure from the live scratch stack and trace it into the AdvisoryAI knowledge startup path.
- Fix `PostgresKnowledgeSearchStore.EnsureSchemaAsync()` so concurrent hosted services cannot race on schema creation and leave the Doctor/knowledge corpus empty on first boot.
Completion criteria:
- [x] Concurrent cold-start schema bootstrap no longer fails in the knowledge store.
- [x] Focused regression coverage exercises concurrent `EnsureSchemaAsync()` calls against PostgreSQL.
### TASK-02 - Rebuild and prove Doctor unified search on the live scratch stack
Status: DONE
Dependency: TASK-01
Owners: QA, Developer
Task description:
- Rebuild and redeploy AdvisoryAI, then rerun the live Doctor unified-search matrix and direct starter-query probes.
- Recheck the mission-control action sweep after tightening the `View all` selector so the QA artifact reflects actual product behavior.
Completion criteria:
- [x] `advisory-ai-web` startup logs show a successful knowledge rebuild on the live stack.
- [x] Live Playwright Doctor unified-search evidence is clean on the scratch deployment.
- [x] Mission-control action sweep passes without the stale `View all` false negative.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-11 | Sprint created after the fresh-stack unified-search matrix isolated Doctor failures to an empty knowledge scope and container logs showed the knowledge startup rebuild failing with PostgreSQL `23505` during schema bootstrap. | QA / 3rd line support |
| 2026-03-11 | Root cause traced to concurrent `EnsureSchemaAsync()` callers from AdvisoryAI hosted services. Applied a PostgreSQL advisory transaction lock to the knowledge store and added a focused concurrent startup regression. | Architect / Developer |
| 2026-03-11 | Tightened the mission-board Playwright harness so `View all` binds to the real `/releases/runs` anchor instead of a generic text match. | QA / Developer |
| 2026-03-11 | Rebuilt and redeployed `advisory-ai-web`; live startup logs now show a successful knowledge rebuild (`documents=470`, `chunks=9051`, `doctor_projections=8`). Reran the live unified-search matrix cleanly (`4 routes checked, 0 issues`), directly rechecked Doctor starter queries with grounded results, and confirmed the mission-control action sweep passes with zero failed actions/runtime issues. | QA / Developer |
## Decisions & Risks
- Decision: keep Doctor mapped to the knowledge scope. The live failure was caused by the knowledge corpus not rebuilding on startup, not by the Doctor route using the wrong search domain.
- Decision: fix concurrency inside the knowledge store rather than by trying to sequence hosted services manually. Multiple startup callers are valid and the store must stay safe under them.
- Decision: use a PostgreSQL advisory transaction lock inside the store bootstrap path so the first-run contract remains correct regardless of how many hosted services touch the knowledge store during startup.
## Next Checkpoints
- Archive on local commit; Doctor search is restored on the live scratch stack.