Files
git.stella-ops.org/docs/modules/scanner/reachability-ground-truth-corpus.md

1.6 KiB

Scanner Reachability Ground-Truth Corpus

This document defines the deterministic toy-service corpus used to validate reachability tier classification quality in Scanner tests.

Location

  • src/Scanner/__Tests/__Datasets/toys/

Service Set

  • svc-01-log4shell-java
  • svc-02-prototype-pollution-node
  • svc-03-pickle-deserialization-python
  • svc-04-text-template-go
  • svc-05-xmlserializer-dotnet
  • svc-06-erb-injection-ruby

Each service contains:

  • Minimal source code with a known vulnerability pattern.
  • labels.yaml with tier ground truth for one or more CVEs.

labels.yaml Contract (v1)

  • Required top-level fields: schema_version, service, language, entrypoint, cves.
  • Each CVE entry requires: id, package, tier, rationale.
  • Allowed tier values:
    • R0: unreachable
    • R1: present in dependency only
    • R2: imported but not called
    • R3: called but not reachable from entrypoint
    • R4: reachable from entrypoint

Deterministic Validation Harness

  • Test suite: src/Scanner/__Tests/StellaOps.Scanner.Reachability.Tests/Benchmarks/ReachabilityTierCorpusTests.cs
  • Harness capabilities:
    • Validates corpus structure and required schema fields.
    • Verifies R0..R4 coverage across the toy corpus.
    • Maps R0..R4 into Scanner confidence tiers for compatibility checks.
    • Computes precision, recall, and F1 per tier using deterministic ordering.

Offline Posture

  • No external network access is required for corpus loading or metric computation.
  • Dataset files are copied into test output for stable local/CI execution.