Files
git.stella-ops.org/bench/reachability-benchmark/tools/scorer
StellaOps Bot 71e9a56cfd
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Airgap Sealed CI Smoke / sealed-smoke (push) Has been cancelled
Export Center CI / export-ci (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
feat: Add Scanner CI runner and related artifacts
- Implemented `run-scanner-ci.sh` to build and run tests for the Scanner solution with a warmed NuGet cache.
- Created `excititor-vex-traces.json` dashboard for monitoring Excititor VEX observations.
- Added Docker Compose configuration for the OTLP span sink in `docker-compose.spansink.yml`.
- Configured OpenTelemetry collector in `otel-spansink.yaml` to receive and process traces.
- Developed `run-spansink.sh` script to run the OTLP span sink for Excititor traces.
- Introduced `FileSystemRiskBundleObjectStore` for storing risk bundle artifacts in the filesystem.
- Built `RiskBundleBuilder` for creating risk bundles with associated metadata and providers.
- Established `RiskBundleJob` to execute the risk bundle creation and storage process.
- Defined models for risk bundle inputs, entries, and manifests in `RiskBundleModels.cs`.
- Implemented signing functionality for risk bundle manifests with `HmacRiskBundleManifestSigner`.
- Created unit tests for `RiskBundleBuilder`, `RiskBundleJob`, and signing functionality to ensure correctness.
- Added filesystem artifact reader tests to validate manifest parsing and artifact listing.
- Included test manifests for egress scenarios in the task runner tests.
- Developed timeline query service tests to verify tenant and event ID handling.
2025-11-30 19:12:35 +02:00
..

rb-score

Deterministic scorer for the reachability benchmark.

What it does

  • Validates submissions against schemas/submission.schema.json and truth against schemas/truth.schema.json.
  • Computes precision/recall/F1 (micro, sink-level).
  • Computes explainability score per prediction (03) and averages it.
  • Checks duplicate predictions for determinism (inconsistent duplicates lower the rate).
  • Surfaces runtime metadata from the submission (run block).

Install (offline-friendly)

python -m pip install -r requirements.txt

Usage

./rb_score.py --truth ../../benchmark/truth/public.json --submission ../../benchmark/submissions/sample.json --format json

Output

  • text (default): short human-readable summary.
  • json: deterministic JSON with top-level metrics and per-case breakdown.

Tests

python -m unittest tests/test_scoring.py

Explainability tiers (task 513-009) are covered by test_explainability_tiers in tests/test_scoring.py.

Notes

  • Predictions for sinks not present in truth count as false positives (strict posture).
  • Truth sinks with label unknown are ignored for FN/FP counting.
  • Explainability tiering: 0=no context; 1=path>=2 nodes; 2=entry + path>=3; 3=guards present.