diff --git a/docs/implplan/SPRINT_20260311_011_AdvisoryAI_knowledge_startup_lock_and_doctor_search_restore.md b/docs/implplan/SPRINT_20260311_011_AdvisoryAI_knowledge_startup_lock_and_doctor_search_restore.md new file mode 100644 index 000000000..a60fdb607 --- /dev/null +++ b/docs/implplan/SPRINT_20260311_011_AdvisoryAI_knowledge_startup_lock_and_doctor_search_restore.md @@ -0,0 +1,60 @@ +# Sprint 20260311_011 - AdvisoryAI Knowledge Startup Lock And Doctor Search Restore + +## Topic & Scope +- Restore Doctor unified search on the scratch-built `stella-ops.local` stack after fresh-stack Playwright exposed an empty knowledge corpus on `/ops/operations/doctor`. +- Fix the AdvisoryAI startup race so knowledge corpus rebuild and unified-search refresh can touch the same store during cold start without breaking first-run correctness. +- Keep the live mission-control sweep evidence truthful by removing the remaining `View all` selector false negative uncovered in the same pass. +- Working directory: `src/AdvisoryAI`. +- Expected evidence: focused AdvisoryAI integration coverage, rebuilt `advisory-ai-web` startup proof, and live Playwright artifacts for Doctor unified search plus mission-control actions. + +## Dependencies & Concurrency +- Depends on `docs/implplan/SPRINT_20260311_010_Platform_scratch_setup_revalidation.md`. +- Allowed cross-module evidence touch: `src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs`. + +## Documentation Prerequisites +- `AGENTS.md` +- `docs/modules/advisory-ai/knowledge-search.md` +- `docs/qa/feature-checks/FLOW.md` + +## Delivery Tracker + +### TASK-01 - Make knowledge schema bootstrap concurrency-safe +Status: DONE +Dependency: none +Owners: QA, 3rd line support, Architect, Developer +Task description: +- Reproduce the Doctor search failure from the live scratch stack and trace it into the AdvisoryAI knowledge startup path. +- Fix `PostgresKnowledgeSearchStore.EnsureSchemaAsync()` so concurrent hosted services cannot race on schema creation and leave the Doctor/knowledge corpus empty on first boot. + +Completion criteria: +- [x] Concurrent cold-start schema bootstrap no longer fails in the knowledge store. +- [x] Focused regression coverage exercises concurrent `EnsureSchemaAsync()` calls against PostgreSQL. + +### TASK-02 - Rebuild and prove Doctor unified search on the live scratch stack +Status: DONE +Dependency: TASK-01 +Owners: QA, Developer +Task description: +- Rebuild and redeploy AdvisoryAI, then rerun the live Doctor unified-search matrix and direct starter-query probes. +- Recheck the mission-control action sweep after tightening the `View all` selector so the QA artifact reflects actual product behavior. + +Completion criteria: +- [x] `advisory-ai-web` startup logs show a successful knowledge rebuild on the live stack. +- [x] Live Playwright Doctor unified-search evidence is clean on the scratch deployment. +- [x] Mission-control action sweep passes without the stale `View all` false negative. + +## Execution Log +| Date (UTC) | Update | Owner | +| --- | --- | --- | +| 2026-03-11 | Sprint created after the fresh-stack unified-search matrix isolated Doctor failures to an empty knowledge scope and container logs showed the knowledge startup rebuild failing with PostgreSQL `23505` during schema bootstrap. | QA / 3rd line support | +| 2026-03-11 | Root cause traced to concurrent `EnsureSchemaAsync()` callers from AdvisoryAI hosted services. Applied a PostgreSQL advisory transaction lock to the knowledge store and added a focused concurrent startup regression. | Architect / Developer | +| 2026-03-11 | Tightened the mission-board Playwright harness so `View all` binds to the real `/releases/runs` anchor instead of a generic text match. | QA / Developer | +| 2026-03-11 | Rebuilt and redeployed `advisory-ai-web`; live startup logs now show a successful knowledge rebuild (`documents=470`, `chunks=9051`, `doctor_projections=8`). Reran the live unified-search matrix cleanly (`4 routes checked, 0 issues`), directly rechecked Doctor starter queries with grounded results, and confirmed the mission-control action sweep passes with zero failed actions/runtime issues. | QA / Developer | + +## Decisions & Risks +- Decision: keep Doctor mapped to the knowledge scope. The live failure was caused by the knowledge corpus not rebuilding on startup, not by the Doctor route using the wrong search domain. +- Decision: fix concurrency inside the knowledge store rather than by trying to sequence hosted services manually. Multiple startup callers are valid and the store must stay safe under them. +- Decision: use a PostgreSQL advisory transaction lock inside the store bootstrap path so the first-run contract remains correct regardless of how many hosted services touch the knowledge store during startup. + +## Next Checkpoints +- Archive on local commit; Doctor search is restored on the live scratch stack. diff --git a/docs/modules/advisory-ai/knowledge-search.md b/docs/modules/advisory-ai/knowledge-search.md index 01fca8faf..443e24139 100644 --- a/docs/modules/advisory-ai/knowledge-search.md +++ b/docs/modules/advisory-ai/knowledge-search.md @@ -389,6 +389,7 @@ Notes: - `stella advisoryai index rebuild` and `stella search index rebuild` invoke authenticated backend endpoints. For a local source-checkout verification lane without a signed-in CLI session, use `sources prepare` via CLI and the direct HTTP rebuild calls above with explicit `X-StellaOps-*` headers. - Compose/runtime requirement: the published AdvisoryAI service image must carry a repo-shaped local corpus under its app content root so `POST /v1/advisory-ai/index/rebuild` can resolve `docs/**`, `devops/compose/openapi_current.json`, and `src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/*.json` even when the source checkout is not mounted into the container. If those assets are absent, live search on `stella-ops.local` degrades to partial unified rows only and documentation/Doctor/API answers disappear. - Fresh service startup now auto-runs the knowledge rebuild by default (`AdvisoryAI__KnowledgeSearch__KnowledgeAutoIndexOnStartup=true`). This is the scratch-setup convergence path for `stella-ops.local`: a wiped deployment must populate the documentation/API/Doctor corpus without requiring operators to call `POST /v1/advisory-ai/index/rebuild` manually. Keep the manual endpoint for explicit refreshes and local live-search lanes, but do not depend on it for first-run correctness. +- Startup schema bootstrap is protected by a PostgreSQL advisory transaction lock. AdvisoryAI cold start can trigger both the knowledge rebuild host and unified-search refresh paths against the same store, so `EnsureSchemaAsync()` must serialize `CREATE SCHEMA` and migration application instead of relying on `IF NOT EXISTS` alone. - The published app content root must also carry the full unified snapshot corpus under `src/AdvisoryAI/StellaOps.AdvisoryAI/UnifiedSearch/Snapshots/*.json`; packaging only findings/VEX/policy snapshots leaves graph, OpsMemory, timeline, and scanner answer lanes permanently corpus-unready in the live shell. ### CLI setup in a source checkout diff --git a/src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/PostgresKnowledgeSearchStore.cs b/src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/PostgresKnowledgeSearchStore.cs index 2fad9b123..a4a9b06ac 100644 --- a/src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/PostgresKnowledgeSearchStore.cs +++ b/src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/PostgresKnowledgeSearchStore.cs @@ -10,6 +10,7 @@ namespace StellaOps.AdvisoryAI.KnowledgeSearch; internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKnowledgeSearchCorpusAvailabilityStore, IAsyncDisposable { private static readonly JsonDocument EmptyJsonDocument = JsonDocument.Parse("{}"); + private const string SchemaLockKey = "advisoryai_knowledge_schema"; private readonly KnowledgeSearchOptions _options; private readonly ILogger _logger; @@ -38,6 +39,8 @@ internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKno await using var connection = await GetDataSource().OpenConnectionAsync(cancellationToken).ConfigureAwait(false); await using var transaction = await connection.BeginTransactionAsync(cancellationToken).ConfigureAwait(false); + await AcquireSchemaLockAsync(connection, transaction, cancellationToken).ConfigureAwait(false); + const string createSchemaSql = "CREATE SCHEMA IF NOT EXISTS advisoryai;"; await ExecuteNonQueryAsync(connection, transaction, createSchemaSql, cancellationToken).ConfigureAwait(false); @@ -1123,6 +1126,19 @@ internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKno await command.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false); } + private static async Task AcquireSchemaLockAsync( + NpgsqlConnection connection, + NpgsqlTransaction transaction, + CancellationToken cancellationToken) + { + await using var command = connection.CreateCommand(); + command.Transaction = transaction; + command.CommandText = "SELECT pg_advisory_xact_lock(hashtext($1));"; + command.CommandTimeout = ToCommandTimeoutSeconds(TimeSpan.FromSeconds(30)); + command.Parameters.AddWithValue(SchemaLockKey); + await command.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false); + } + private static int ToCommandTimeoutSeconds(TimeSpan timeout) { if (timeout <= TimeSpan.Zero) diff --git a/src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/Integration/UnifiedSearchLiveAdapterIntegrationTests.cs b/src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/Integration/UnifiedSearchLiveAdapterIntegrationTests.cs index d8d440d15..20d6b9164 100644 --- a/src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/Integration/UnifiedSearchLiveAdapterIntegrationTests.cs +++ b/src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/Integration/UnifiedSearchLiveAdapterIntegrationTests.cs @@ -372,6 +372,60 @@ public sealed class UnifiedSearchLiveAdapterIntegrationTests (await CountDomainChunksAsync(connection, "policy")).Should().Be(4); } + [Fact] + public async Task PostgresKnowledgeSearchStore_EnsureSchemaAsync_IsSafeUnderConcurrentStartupCalls() + { + await using var fixture = await StartPostgresOrSkipAsync(); + var options = Options.Create(new KnowledgeSearchOptions + { + Enabled = true, + ConnectionString = fixture.ConnectionString, + FtsLanguageConfig = "simple" + }); + + await using var resetConnection = new NpgsqlConnection(fixture.ConnectionString); + await resetConnection.OpenAsync(); + await ExecuteSqlAsync(resetConnection, "DROP SCHEMA IF EXISTS advisoryai CASCADE;"); + + var stores = Enumerable.Range(0, 6) + .Select(_ => new PostgresKnowledgeSearchStore(options, NullLogger.Instance)) + .ToArray(); + var gate = new ManualResetEventSlim(false); + + try + { + var tasks = stores + .Select(store => Task.Run(async () => + { + gate.Wait(); + await store.EnsureSchemaAsync(CancellationToken.None); + })) + .ToArray(); + + gate.Set(); + await Task.WhenAll(tasks); + + await using var verifyConnection = new NpgsqlConnection(fixture.ConnectionString); + await verifyConnection.OpenAsync(); + + (await ScalarAsync( + verifyConnection, + "SELECT to_regclass('advisoryai.kb_chunk')::text;")).Should().Be("advisoryai.kb_chunk"); + (await ScalarAsync( + verifyConnection, + "SELECT COUNT(*) FROM advisoryai.__migration_history;")).Should().BeGreaterThan(0); + } + finally + { + foreach (var store in stores) + { + await store.DisposeAsync(); + } + + gate.Dispose(); + } + } + [Fact] public async Task UnifiedSearchIndexer_RebuildAllAsync_PopulatesEnglishFtsColumns_AndRecallsUnifiedDomains() { @@ -981,6 +1035,19 @@ public sealed class UnifiedSearchLiveAdapterIntegrationTests return Convert.ToInt32(scalar, System.Globalization.CultureInfo.InvariantCulture); } + private static async Task ScalarAsync(NpgsqlConnection connection, string sql) + { + await using var command = connection.CreateCommand(); + command.CommandText = sql; + var scalar = await command.ExecuteScalarAsync(); + if (scalar is null || scalar is DBNull) + { + return default!; + } + + return (T)Convert.ChangeType(scalar, typeof(T), CultureInfo.InvariantCulture); + } + private static async Task CountEnglishTsvRowsAsync(NpgsqlConnection connection, string domain) { await using var command = connection.CreateCommand(); diff --git a/src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs b/src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs index ce7d2d32e..edd4a367e 100644 --- a/src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs +++ b/src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs @@ -141,19 +141,27 @@ async function resolveLink(page, options, timeoutMs = ELEMENT_WAIT_MS) { const deadline = Date.now() + timeoutMs; while (Date.now() < deadline) { - if (options.hrefIncludes) { - const candidates = page.locator(`a[href*="${options.hrefIncludes}"]`); - const count = await candidates.count(); - for (let index = 0; index < count; index += 1) { - const candidate = candidates.nth(index); - const text = ((await candidate.innerText().catch(() => '')) || '').trim(); - if (!options.name || text === options.name || text.includes(options.name)) { - return candidate; - } + const anchors = page.locator('a'); + const anchorCount = await anchors.count(); + for (let index = 0; index < anchorCount; index += 1) { + const candidate = anchors.nth(index); + const href = ((await candidate.getAttribute('href').catch(() => '')) || '').trim(); + const text = ((await candidate.innerText().catch(() => '')) || '').trim(); + + if (options.hrefIncludes && !href.includes(options.hrefIncludes)) { + continue; + } + + if (options.name && !(text === options.name || text.includes(options.name))) { + continue; + } + + if (href || text) { + return candidate; } } - if (options.name) { + if (!options.hrefIncludes && options.name) { const roleLocator = page.getByRole('link', { name: options.name }).first(); if ((await roleLocator.count()) > 0) { return roleLocator; @@ -278,7 +286,7 @@ async function main() { { route: '/mission-control/board', actions: [ - { action: 'link:View all', name: 'View all', expectedPath: '/releases/runs' }, + { action: 'link:View all', name: 'View all', hrefIncludes: '/releases/runs', expectedPath: '/releases/runs' }, { action: 'link:Review', name: 'Review', expectedPath: '/releases/approvals' }, { action: 'link:Risk detail', name: 'Risk detail', expectedPath: '/security' }, { action: 'link:Ops detail', name: 'Ops detail', expectedPath: '/ops/operations/data-integrity' }, @@ -286,24 +294,21 @@ async function main() { { action: 'link:Stage detail', name: 'Detail', - hrefIncludes: - '/setup/topology/environments/stage/posture?tenant=demo-prod®ions=us-east&environments=stage&timeWindow=7d®ion=us-east&environment=stage', + hrefIncludes: '/setup/topology/environments/stage/posture', expectedPath: '/setup/topology/environments/stage/posture', expectQuery: { environment: 'stage', region: 'us-east' }, }, { action: 'link:Stage findings', name: 'Findings', - hrefIncludes: - '/security/findings?tenant=demo-prod®ions=us-east&environments=stage&timeWindow=7d®ion=us-east&environment=stage', + hrefIncludes: '/security/findings?tenant=demo-prod®ions=us-east&environments=stage', expectedPath: '/security/findings', expectQuery: { environment: 'stage', region: 'us-east' }, }, { action: 'link:Risk table open stage', name: 'Open', - hrefIncludes: - '/setup/topology/environments/stage/posture?tenant=demo-prod®ions=us-east&environments=stage&timeWindow=7d®ion=us-east&environment=stage', + hrefIncludes: '/setup/topology/environments/stage/posture', expectedPath: '/setup/topology/environments/stage/posture', expectQuery: { environment: 'stage', region: 'us-east' }, },