Files
git.stella-ops.org/docs/modules/advisory-ai/unified-search-ranking-benchmark.md

2.1 KiB

Unified Search Ranking Benchmark and Tuning Report

Corpus

  • File: src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/TestData/unified-search-quality-corpus.json
  • Cases: 250 queries
  • Archetypes: cve_lookup, package_image, documentation, doctor_diagnostic, policy_search, audit_timeline, cross_domain, conversational_followup
  • Labels: relevance grades 0..3

Metrics

  • Precision@1, @3, @5, @10
  • Recall@10
  • NDCG@10
  • Entity-card top hit accuracy
  • Cross-domain recall
  • Ranking stability hash (SHA-256)

Quality Gates

  • P@1 >= 0.80
  • NDCG@10 >= 0.70
  • Entity-card accuracy >= 0.85
  • Cross-domain recall >= 0.60

Tuning Method

  • Deterministic grid search over weighting parameters used by DomainWeightCalculator.
  • Parameter ranges:
    • CveBoostFindings: {0.35, 0.45}
    • CveBoostVex: {0.30, 0.38}
    • PackageBoostGraph: {0.20, 0.36, 0.48}
    • PackageBoostScanner: {0.12, 0.28, 0.40}
    • AuditBoostTimeline: {0.10, 0.24, 0.34}
    • PolicyBoostPolicy: {0.30, 0.38}
  • Tie-breakers: NDCG@10, then P@1, then stability hash.

Baseline vs Tuned

Values populated from UnifiedSearchQualityBenchmarkTests output.

Variant P@1 NDCG@10 Entity Accuracy Cross-domain Recall Gates Passed
Baseline (legacy weighting) 0.9560 0.9522 0.9560 1.0000 Yes
Tuned defaults 0.9600 0.9598 0.9600 1.0000 Yes

Reference hashes from benchmark output:

  • Baseline: FF32EBE1DF1705A524B20B5A114B0CF496F1CA05147FC9FD869312903B8F40E9
  • Tuned defaults: B5A12ACFE304E6A4620BBB2E9280FEE2E29E952B3E832F92C69FFA10760DA957

Tuned Defaults Applied

  • UnifiedSearchOptions.BaseDomainWeights
    • knowledge=1.05, findings=1.20, vex=1.15, policy=1.10, graph=1.15, timeline=1.05, scanner=1.10, opsmemory=1.05
  • UnifiedSearchOptions.Weighting
    • cve/security/policy/troubleshoot/package/audit/role boosts aligned with tuned values in UnifiedSearchOptions.cs

Determinism

  • Repeat runs produce identical stability hash for fixed corpus + options.
  • Fast subset (50 queries) and full suite (250 queries) both run in CI lanes.