Files
git.stella-ops.org/docs/modules/advisory-ai/unified-search-ranking-benchmark.md

55 lines
2.1 KiB
Markdown

# Unified Search Ranking Benchmark and Tuning Report
## Corpus
- File: `src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/TestData/unified-search-quality-corpus.json`
- Cases: 250 queries
- Archetypes: `cve_lookup`, `package_image`, `documentation`, `doctor_diagnostic`, `policy_search`, `audit_timeline`, `cross_domain`, `conversational_followup`
- Labels: relevance grades `0..3`
## Metrics
- Precision@1, @3, @5, @10
- Recall@10
- NDCG@10
- Entity-card top hit accuracy
- Cross-domain recall
- Ranking stability hash (SHA-256)
## Quality Gates
- P@1 >= 0.80
- NDCG@10 >= 0.70
- Entity-card accuracy >= 0.85
- Cross-domain recall >= 0.60
## Tuning Method
- Deterministic grid search over weighting parameters used by `DomainWeightCalculator`.
- Parameter ranges:
- `CveBoostFindings`: {0.35, 0.45}
- `CveBoostVex`: {0.30, 0.38}
- `PackageBoostGraph`: {0.20, 0.36, 0.48}
- `PackageBoostScanner`: {0.12, 0.28, 0.40}
- `AuditBoostTimeline`: {0.10, 0.24, 0.34}
- `PolicyBoostPolicy`: {0.30, 0.38}
- Tie-breakers: NDCG@10, then P@1, then stability hash.
## Baseline vs Tuned
_Values populated from `UnifiedSearchQualityBenchmarkTests` output._
| Variant | P@1 | NDCG@10 | Entity Accuracy | Cross-domain Recall | Gates Passed |
| --- | --- | --- | --- | --- | --- |
| Baseline (legacy weighting) | 0.9560 | 0.9522 | 0.9560 | 1.0000 | Yes |
| Tuned defaults | 0.9600 | 0.9598 | 0.9600 | 1.0000 | Yes |
Reference hashes from benchmark output:
- Baseline: `FF32EBE1DF1705A524B20B5A114B0CF496F1CA05147FC9FD869312903B8F40E9`
- Tuned defaults: `B5A12ACFE304E6A4620BBB2E9280FEE2E29E952B3E832F92C69FFA10760DA957`
## Tuned Defaults Applied
- `UnifiedSearchOptions.BaseDomainWeights`
- knowledge=1.05, findings=1.20, vex=1.15, policy=1.10, graph=1.15, timeline=1.05, scanner=1.10, opsmemory=1.05
- `UnifiedSearchOptions.Weighting`
- cve/security/policy/troubleshoot/package/audit/role boosts aligned with tuned values in `UnifiedSearchOptions.cs`
## Determinism
- Repeat runs produce identical stability hash for fixed corpus + options.
- Fast subset (50 queries) and full suite (250 queries) both run in CI lanes.