Files
git.stella-ops.org/docs/ops/scanner-analyzers-operations.md
master 651b8e0fa3 feat: Add new projects to solution and implement contract testing documentation
- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
2025-10-27 07:57:55 +02:00

2.8 KiB
Raw Blame History

Scanner Analyzer Benchmarks Operations Guide

Purpose

Keep the language analyzer microbench under the <5s SBOM pledge. CI emits Prometheus metrics and JSON fixtures so trend dashboards and alerts stay in lockstep with the repository baseline.

Grafana note: Import docs/ops/scanner-analyzers-grafana-dashboard.json into your Prometheus-backed Grafana stack to monitor scanner_analyzer_bench_* metrics and alert on regressions.

Publishing workflow

  1. CI (or engineers running locally) execute:
    dotnet run \
      --project src/StellaOps.Bench/Scanner.Analyzers/StellaOps.Bench.ScannerAnalyzers/StellaOps.Bench.ScannerAnalyzers.csproj \
      -- \
      --repo-root . \
      --out src/StellaOps.Bench/Scanner.Analyzers/baseline.csv \
      --json out/bench/scanner-analyzers/latest.json \
      --prom out/bench/scanner-analyzers/latest.prom \
      --commit "$(git rev-parse HEAD)" \
      --environment "${CI_ENVIRONMENT_NAME:-local}"
    
  2. Publish the artefacts (baseline.csv, latest.json, latest.prom) to bench-artifacts/<date>/.
  3. Promtail (or the CI job) pushes latest.prom into Prometheus; JSON lands in long-term storage for workbook snapshots.
  4. The harness exits non-zero if:
    • max_ms for any scenario breaches its configured threshold; or
    • max_ms regresses ≥20% versus baseline.csv.

Grafana dashboard

  • Import docs/ops/scanner-analyzers-grafana-dashboard.json.
  • Point the template variable datasource to the Prometheus instance ingesting scanner_analyzer_bench_* metrics.
  • Panels:
    • Max Duration (ms) compares live runs vs baseline.
    • Regression Ratio vs Limit plots (max / baseline_max - 1) * 100.
    • Breached Scenarios stat panel sourced from scanner_analyzer_bench_regression_breached.

Alerting & on-call response

  • Primary alert: fire when scanner_analyzer_bench_regression_ratio{scenario=~".+"} >= 1.20 for 2 consecutive samples (10min default). Suggested PromQL:
    max_over_time(scanner_analyzer_bench_regression_ratio[10m]) >= 1.20
    
  • Suppress duplicates using the scenario label.
  • Pager payload should include scenario, max_ms, baseline_max_ms, and commit.
  • Immediate triage steps:
    1. Check latest.json artefact for the failing scenario confirm commit and environment.
    2. Re-run the harness with --captured-at and --baseline pointing at the last known good CSV to verify determinism.
    3. If regression persists, open an incident ticket tagged scanner-analyzer-perf and page the owning language guild.
    4. Roll back the offending change or update the baseline after sign-off from the guild lead and Perf captain.

Document the outcome in docs/12_PERFORMANCE_WORKBOOK.md (section 8) so trendlines reflect any accepted regressions.