Restore Doctor search after AdvisoryAI cold-start race

This commit is contained in:
master
2026-03-11 21:19:42 +02:00
parent 08006100a5
commit 66e67f1a97
5 changed files with 166 additions and 17 deletions

View File

@@ -0,0 +1,60 @@
# Sprint 20260311_011 - AdvisoryAI Knowledge Startup Lock And Doctor Search Restore
## Topic & Scope
- Restore Doctor unified search on the scratch-built `stella-ops.local` stack after fresh-stack Playwright exposed an empty knowledge corpus on `/ops/operations/doctor`.
- Fix the AdvisoryAI startup race so knowledge corpus rebuild and unified-search refresh can touch the same store during cold start without breaking first-run correctness.
- Keep the live mission-control sweep evidence truthful by removing the remaining `View all` selector false negative uncovered in the same pass.
- Working directory: `src/AdvisoryAI`.
- Expected evidence: focused AdvisoryAI integration coverage, rebuilt `advisory-ai-web` startup proof, and live Playwright artifacts for Doctor unified search plus mission-control actions.
## Dependencies & Concurrency
- Depends on `docs/implplan/SPRINT_20260311_010_Platform_scratch_setup_revalidation.md`.
- Allowed cross-module evidence touch: `src/Web/StellaOps.Web/scripts/live-mission-control-action-sweep.mjs`.
## Documentation Prerequisites
- `AGENTS.md`
- `docs/modules/advisory-ai/knowledge-search.md`
- `docs/qa/feature-checks/FLOW.md`
## Delivery Tracker
### TASK-01 - Make knowledge schema bootstrap concurrency-safe
Status: DONE
Dependency: none
Owners: QA, 3rd line support, Architect, Developer
Task description:
- Reproduce the Doctor search failure from the live scratch stack and trace it into the AdvisoryAI knowledge startup path.
- Fix `PostgresKnowledgeSearchStore.EnsureSchemaAsync()` so concurrent hosted services cannot race on schema creation and leave the Doctor/knowledge corpus empty on first boot.
Completion criteria:
- [x] Concurrent cold-start schema bootstrap no longer fails in the knowledge store.
- [x] Focused regression coverage exercises concurrent `EnsureSchemaAsync()` calls against PostgreSQL.
### TASK-02 - Rebuild and prove Doctor unified search on the live scratch stack
Status: DONE
Dependency: TASK-01
Owners: QA, Developer
Task description:
- Rebuild and redeploy AdvisoryAI, then rerun the live Doctor unified-search matrix and direct starter-query probes.
- Recheck the mission-control action sweep after tightening the `View all` selector so the QA artifact reflects actual product behavior.
Completion criteria:
- [x] `advisory-ai-web` startup logs show a successful knowledge rebuild on the live stack.
- [x] Live Playwright Doctor unified-search evidence is clean on the scratch deployment.
- [x] Mission-control action sweep passes without the stale `View all` false negative.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-11 | Sprint created after the fresh-stack unified-search matrix isolated Doctor failures to an empty knowledge scope and container logs showed the knowledge startup rebuild failing with PostgreSQL `23505` during schema bootstrap. | QA / 3rd line support |
| 2026-03-11 | Root cause traced to concurrent `EnsureSchemaAsync()` callers from AdvisoryAI hosted services. Applied a PostgreSQL advisory transaction lock to the knowledge store and added a focused concurrent startup regression. | Architect / Developer |
| 2026-03-11 | Tightened the mission-board Playwright harness so `View all` binds to the real `/releases/runs` anchor instead of a generic text match. | QA / Developer |
| 2026-03-11 | Rebuilt and redeployed `advisory-ai-web`; live startup logs now show a successful knowledge rebuild (`documents=470`, `chunks=9051`, `doctor_projections=8`). Reran the live unified-search matrix cleanly (`4 routes checked, 0 issues`), directly rechecked Doctor starter queries with grounded results, and confirmed the mission-control action sweep passes with zero failed actions/runtime issues. | QA / Developer |
## Decisions & Risks
- Decision: keep Doctor mapped to the knowledge scope. The live failure was caused by the knowledge corpus not rebuilding on startup, not by the Doctor route using the wrong search domain.
- Decision: fix concurrency inside the knowledge store rather than by trying to sequence hosted services manually. Multiple startup callers are valid and the store must stay safe under them.
- Decision: use a PostgreSQL advisory transaction lock inside the store bootstrap path so the first-run contract remains correct regardless of how many hosted services touch the knowledge store during startup.
## Next Checkpoints
- Archive on local commit; Doctor search is restored on the live scratch stack.

View File

@@ -389,6 +389,7 @@ Notes:
- `stella advisoryai index rebuild` and `stella search index rebuild` invoke authenticated backend endpoints. For a local source-checkout verification lane without a signed-in CLI session, use `sources prepare` via CLI and the direct HTTP rebuild calls above with explicit `X-StellaOps-*` headers.
- Compose/runtime requirement: the published AdvisoryAI service image must carry a repo-shaped local corpus under its app content root so `POST /v1/advisory-ai/index/rebuild` can resolve `docs/**`, `devops/compose/openapi_current.json`, and `src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/*.json` even when the source checkout is not mounted into the container. If those assets are absent, live search on `stella-ops.local` degrades to partial unified rows only and documentation/Doctor/API answers disappear.
- Fresh service startup now auto-runs the knowledge rebuild by default (`AdvisoryAI__KnowledgeSearch__KnowledgeAutoIndexOnStartup=true`). This is the scratch-setup convergence path for `stella-ops.local`: a wiped deployment must populate the documentation/API/Doctor corpus without requiring operators to call `POST /v1/advisory-ai/index/rebuild` manually. Keep the manual endpoint for explicit refreshes and local live-search lanes, but do not depend on it for first-run correctness.
- Startup schema bootstrap is protected by a PostgreSQL advisory transaction lock. AdvisoryAI cold start can trigger both the knowledge rebuild host and unified-search refresh paths against the same store, so `EnsureSchemaAsync()` must serialize `CREATE SCHEMA` and migration application instead of relying on `IF NOT EXISTS` alone.
- The published app content root must also carry the full unified snapshot corpus under `src/AdvisoryAI/StellaOps.AdvisoryAI/UnifiedSearch/Snapshots/*.json`; packaging only findings/VEX/policy snapshots leaves graph, OpsMemory, timeline, and scanner answer lanes permanently corpus-unready in the live shell.
### CLI setup in a source checkout

View File

@@ -10,6 +10,7 @@ namespace StellaOps.AdvisoryAI.KnowledgeSearch;
internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKnowledgeSearchCorpusAvailabilityStore, IAsyncDisposable
{
private static readonly JsonDocument EmptyJsonDocument = JsonDocument.Parse("{}");
private const string SchemaLockKey = "advisoryai_knowledge_schema";
private readonly KnowledgeSearchOptions _options;
private readonly ILogger<PostgresKnowledgeSearchStore> _logger;
@@ -38,6 +39,8 @@ internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKno
await using var connection = await GetDataSource().OpenConnectionAsync(cancellationToken).ConfigureAwait(false);
await using var transaction = await connection.BeginTransactionAsync(cancellationToken).ConfigureAwait(false);
await AcquireSchemaLockAsync(connection, transaction, cancellationToken).ConfigureAwait(false);
const string createSchemaSql = "CREATE SCHEMA IF NOT EXISTS advisoryai;";
await ExecuteNonQueryAsync(connection, transaction, createSchemaSql, cancellationToken).ConfigureAwait(false);
@@ -1123,6 +1126,19 @@ internal sealed class PostgresKnowledgeSearchStore : IKnowledgeSearchStore, IKno
await command.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
}
private static async Task AcquireSchemaLockAsync(
NpgsqlConnection connection,
NpgsqlTransaction transaction,
CancellationToken cancellationToken)
{
await using var command = connection.CreateCommand();
command.Transaction = transaction;
command.CommandText = "SELECT pg_advisory_xact_lock(hashtext($1));";
command.CommandTimeout = ToCommandTimeoutSeconds(TimeSpan.FromSeconds(30));
command.Parameters.AddWithValue(SchemaLockKey);
await command.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
}
private static int ToCommandTimeoutSeconds(TimeSpan timeout)
{
if (timeout <= TimeSpan.Zero)

View File

@@ -372,6 +372,60 @@ public sealed class UnifiedSearchLiveAdapterIntegrationTests
(await CountDomainChunksAsync(connection, "policy")).Should().Be(4);
}
[Fact]
public async Task PostgresKnowledgeSearchStore_EnsureSchemaAsync_IsSafeUnderConcurrentStartupCalls()
{
await using var fixture = await StartPostgresOrSkipAsync();
var options = Options.Create(new KnowledgeSearchOptions
{
Enabled = true,
ConnectionString = fixture.ConnectionString,
FtsLanguageConfig = "simple"
});
await using var resetConnection = new NpgsqlConnection(fixture.ConnectionString);
await resetConnection.OpenAsync();
await ExecuteSqlAsync(resetConnection, "DROP SCHEMA IF EXISTS advisoryai CASCADE;");
var stores = Enumerable.Range(0, 6)
.Select(_ => new PostgresKnowledgeSearchStore(options, NullLogger<PostgresKnowledgeSearchStore>.Instance))
.ToArray();
var gate = new ManualResetEventSlim(false);
try
{
var tasks = stores
.Select(store => Task.Run(async () =>
{
gate.Wait();
await store.EnsureSchemaAsync(CancellationToken.None);
}))
.ToArray();
gate.Set();
await Task.WhenAll(tasks);
await using var verifyConnection = new NpgsqlConnection(fixture.ConnectionString);
await verifyConnection.OpenAsync();
(await ScalarAsync<string?>(
verifyConnection,
"SELECT to_regclass('advisoryai.kb_chunk')::text;")).Should().Be("advisoryai.kb_chunk");
(await ScalarAsync<int>(
verifyConnection,
"SELECT COUNT(*) FROM advisoryai.__migration_history;")).Should().BeGreaterThan(0);
}
finally
{
foreach (var store in stores)
{
await store.DisposeAsync();
}
gate.Dispose();
}
}
[Fact]
public async Task UnifiedSearchIndexer_RebuildAllAsync_PopulatesEnglishFtsColumns_AndRecallsUnifiedDomains()
{
@@ -981,6 +1035,19 @@ public sealed class UnifiedSearchLiveAdapterIntegrationTests
return Convert.ToInt32(scalar, System.Globalization.CultureInfo.InvariantCulture);
}
private static async Task<T> ScalarAsync<T>(NpgsqlConnection connection, string sql)
{
await using var command = connection.CreateCommand();
command.CommandText = sql;
var scalar = await command.ExecuteScalarAsync();
if (scalar is null || scalar is DBNull)
{
return default!;
}
return (T)Convert.ChangeType(scalar, typeof(T), CultureInfo.InvariantCulture);
}
private static async Task<int> CountEnglishTsvRowsAsync(NpgsqlConnection connection, string domain)
{
await using var command = connection.CreateCommand();

View File

@@ -141,19 +141,27 @@ async function resolveLink(page, options, timeoutMs = ELEMENT_WAIT_MS) {
const deadline = Date.now() + timeoutMs;
while (Date.now() < deadline) {
if (options.hrefIncludes) {
const candidates = page.locator(`a[href*="${options.hrefIncludes}"]`);
const count = await candidates.count();
for (let index = 0; index < count; index += 1) {
const candidate = candidates.nth(index);
const text = ((await candidate.innerText().catch(() => '')) || '').trim();
if (!options.name || text === options.name || text.includes(options.name)) {
return candidate;
}
const anchors = page.locator('a');
const anchorCount = await anchors.count();
for (let index = 0; index < anchorCount; index += 1) {
const candidate = anchors.nth(index);
const href = ((await candidate.getAttribute('href').catch(() => '')) || '').trim();
const text = ((await candidate.innerText().catch(() => '')) || '').trim();
if (options.hrefIncludes && !href.includes(options.hrefIncludes)) {
continue;
}
if (options.name && !(text === options.name || text.includes(options.name))) {
continue;
}
if (href || text) {
return candidate;
}
}
if (options.name) {
if (!options.hrefIncludes && options.name) {
const roleLocator = page.getByRole('link', { name: options.name }).first();
if ((await roleLocator.count()) > 0) {
return roleLocator;
@@ -278,7 +286,7 @@ async function main() {
{
route: '/mission-control/board',
actions: [
{ action: 'link:View all', name: 'View all', expectedPath: '/releases/runs' },
{ action: 'link:View all', name: 'View all', hrefIncludes: '/releases/runs', expectedPath: '/releases/runs' },
{ action: 'link:Review', name: 'Review', expectedPath: '/releases/approvals' },
{ action: 'link:Risk detail', name: 'Risk detail', expectedPath: '/security' },
{ action: 'link:Ops detail', name: 'Ops detail', expectedPath: '/ops/operations/data-integrity' },
@@ -286,24 +294,21 @@ async function main() {
{
action: 'link:Stage detail',
name: 'Detail',
hrefIncludes:
'/setup/topology/environments/stage/posture?tenant=demo-prod&regions=us-east&environments=stage&timeWindow=7d&region=us-east&environment=stage',
hrefIncludes: '/setup/topology/environments/stage/posture',
expectedPath: '/setup/topology/environments/stage/posture',
expectQuery: { environment: 'stage', region: 'us-east' },
},
{
action: 'link:Stage findings',
name: 'Findings',
hrefIncludes:
'/security/findings?tenant=demo-prod&regions=us-east&environments=stage&timeWindow=7d&region=us-east&environment=stage',
hrefIncludes: '/security/findings?tenant=demo-prod&regions=us-east&environments=stage',
expectedPath: '/security/findings',
expectQuery: { environment: 'stage', region: 'us-east' },
},
{
action: 'link:Risk table open stage',
name: 'Open',
hrefIncludes:
'/setup/topology/environments/stage/posture?tenant=demo-prod&regions=us-east&environments=stage&timeWindow=7d&region=us-east&environment=stage',
hrefIncludes: '/setup/topology/environments/stage/posture',
expectedPath: '/setup/topology/environments/stage/posture',
expectQuery: { environment: 'stage', region: 'us-east' },
},