stella-ops.org/git.stella-ops.org

Fork 0

Files

master 00bf2fa99a Repair live unified search corpus runtime

2026-03-09 19:44:16 +02:00

6.9 KiB

Raw Blame History

Unified Search Operations Runbook

Scope

Runbook for AdvisoryAI unified search setup, operations, troubleshooting, performance, and rollout control.

Setup

Configure AdvisoryAI:KnowledgeSearch:ConnectionString.
Configure AdvisoryAI:UnifiedSearch options.
For live compose/runtime, set AdvisoryAI:KnowledgeSearch:FindingsAdapterBaseUrl, ...:VexAdapterBaseUrl, and ...:PolicyAdapterBaseUrl together so findings, VEX, and policy ingest from live services instead of partial fallback snapshots.
Ensure the published AdvisoryAI image carries the repo-shaped local corpus under /app, including src/AdvisoryAI/StellaOps.AdvisoryAI/UnifiedSearch/Snapshots/{findings,vex,policy,graph,opsmemory,timeline,scanner}.snapshot.json.
Ensure model artifact path exists when VectorEncoderType=onnx:
- default: models/all-MiniLM-L6-v2.onnx
Rebuild indexes in order when verifying live search quality:
- POST /v1/advisory-ai/index/rebuild
- POST /v1/search/index/rebuild
Verify query endpoint:
- POST /v1/search/query with X-StellaOps-Tenant and advisory-ai:operate scope.

Key Endpoints

POST /v1/search/query
POST /v1/search/synthesize
POST /v1/search/index/rebuild
POST /v1/advisory-ai/search/analytics
GET /v1/advisory-ai/search/quality/metrics
GET /v1/advisory-ai/search/quality/alerts

Monitoring

Track per-tenant and global:

Query throughput (query, click, zero_result, synthesis events)
Self-serve journey signals (answer_frame, reformulation, rescue_action)
P50/P95/P99 latency for /v1/search/query
Zero-result rate
Fallback answer rate, clarify rate, insufficient-evidence rate
Reformulation count, rescue-action count, abandoned fallback count
Synthesis quota denials
Index size and rebuild duration
Active encoder diagnostics (diagnostics.activeEncoder)

Performance Targets

Instant results: P50 < 100ms, P95 < 200ms, P99 < 300ms
Full results (federated): P50 < 200ms, P95 < 500ms, P99 < 800ms
Deterministic synthesis: P50 < 30ms, P95 < 50ms
LLM synthesis: TTFB P50 < 1s, total P95 < 5s

SQL Query Tuning and EXPLAIN Evidence

Unified search read paths rely on:

FTS query over advisoryai.kb_chunk.body_tsv*
Trigram fuzzy fallback (% / similarity())
Vector nearest-neighbor (embedding_vec <=> query_vector)

Recommended validation commands:

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.chunk_id
FROM advisoryai.kb_chunk c
WHERE c.body_tsv_en @@ websearch_to_tsquery('english', @query)
ORDER BY ts_rank_cd(c.body_tsv_en, websearch_to_tsquery('english', @query), 32) DESC, c.chunk_id
LIMIT 20;

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.chunk_id
FROM advisoryai.kb_chunk c
WHERE c.embedding_vec IS NOT NULL
ORDER BY c.embedding_vec <=> CAST(@query_vector AS vector), c.chunk_id
LIMIT 20;

Index expectations:

idx_kb_chunk_body_tsv_en (GIN over body_tsv_en)
idx_kb_chunk_body_trgm (GIN trigram over body)
idx_kb_chunk_embedding_vec_hnsw (HNSW over embedding_vec)

Automated EXPLAIN evidence is captured by:

UnifiedSearchLiveAdapterIntegrationTests.PostgresKnowledgeSearchStore_ExplainAnalyze_ShowsIndexedSearchPlans

Load and Capacity Envelope

Validated test envelope (in-process benchmark harness):

50 concurrent requests sustained
P95 < 500ms, P99 < 800ms

Sizing guidance:

Up to 100k chunks: 2 vCPU / 4 GB RAM
100k-500k chunks: 4 vCPU / 8 GB RAM
500k chunks or heavy synthesis: 8 vCPU / 16 GB RAM, split synthesis workers

Feature Flags and Rollout

Config path: AdvisoryAI:UnifiedSearch:TenantFeatureFlags

Enabled
FederationEnabled
SynthesisEnabled

Example:

{
  "AdvisoryAI": {
    "UnifiedSearch": {
      "TenantFeatureFlags": {
        "tenant-alpha": { "Enabled": true, "FederationEnabled": true, "SynthesisEnabled": false },
        "tenant-beta":  { "Enabled": true, "FederationEnabled": false, "SynthesisEnabled": false }
      }
    }
  }
}

Troubleshooting

Symptom: empty results

Verify tenant header is present.
Verify UnifiedSearch.Enabled and tenant flag Enabled.
Run index rebuild and check chunk count.
If suggestions also fail, verify both rebuild steps were run in order and re-check with a known live query such as database connectivity.
If only findings answer lanes work while VEX/policy/graph/OpsMemory remain corpus-unready, verify the published snapshot files exist under /app/src/AdvisoryAI/StellaOps.AdvisoryAI/UnifiedSearch/Snapshots/ and confirm the VEX/policy adapter base URLs are configured in runtime env.

Symptom: poor semantic recall

Verify VectorEncoderType and active encoder diagnostics.
Confirm ONNX model path is accessible and valid.
Rebuild index after encoder switch.

Symptom: synthesis unavailable

Check SynthesisEnabled (global + tenant).
Check quota counters and provider configuration.

Symptom: search feels self-serve weak

Inspect GET /v1/advisory-ai/search/quality/metrics?period=7d.
Watch fallbackAnswerRate, clarifyRate, insufficientRate, reformulationCount, rescueActionCount, and abandonedFallbackCount.
Inspect GET /v1/advisory-ai/search/quality/alerts for fallback_loop and abandoned_fallback.
Treat repeated fallback loops as ranking/context gaps; treat abandoned fallback sessions as UX/product gaps.

Symptom: high latency

Check federated backend timeout budget.
Review EXPLAIN (ANALYZE) plans.
Verify index health and cardinality growth by tenant.

Backup and Recovery

Unified index is derivable state.
Recovery sequence:
1. Restore primary domain systems (findings/vex/policy/docs sources).
2. Restore AdvisoryAI DB schema.
3. Trigger full index rebuild.
4. Validate with quality benchmark fast subset.

Validation Commands

# Fast PR-level quality gate
dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj \
  -- --filter-class StellaOps.AdvisoryAI.Tests.UnifiedSearch.UnifiedSearchQualityBenchmarkFastSubsetTests

# Full benchmark + tuning evidence
dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj \
  -- --filter-class StellaOps.AdvisoryAI.Tests.UnifiedSearch.UnifiedSearchQualityBenchmarkTests

# Performance envelope
dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj \
  -- --filter-class StellaOps.AdvisoryAI.Tests.UnifiedSearch.UnifiedSearchPerformanceEnvelopeTests

# Self-serve telemetry and gap surfacing slice
dotnet build src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj -v minimal
src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/bin/Debug/net10.0/StellaOps.AdvisoryAI.Tests.exe \
  -method "StellaOps.AdvisoryAI.Tests.Integration.UnifiedSearchSprintIntegrationTests.G10_SelfServeMetrics_IncludeFallbackReformulationAndRescueSignals" \
  -method "StellaOps.AdvisoryAI.Tests.Integration.UnifiedSearchSprintIntegrationTests.G10_RecoveredFallbackSessions_DoNotCountAsAbandoned" \
  -reporter verbose -noColor

6.9 KiB Raw Blame History