# Subgraph Extraction for Proof of Exposure _Last updated: 2025-12-23. Owner: Scanner Guild._ This document specifies the algorithm and implementation strategy for extracting minimal reachability subgraphs from richgraph-v1 documents. These subgraphs power Proof of Exposure (PoE) artifacts that provide compact, offline-verifiable evidence of vulnerability reachability. --- ## 1. Overview ### 1.1 Purpose Given a richgraph-v1 call graph and a specific CVE, extract a **minimal subgraph** containing: - All call paths from **entry points** (HTTP handlers, CLI commands, cron jobs) to **vulnerable sinks** (CVE-affected functions) - Only the nodes and edges that participate in reachability - Guard predicates (feature flags, platform conditionals) for auditor evaluation ### 1.2 Inputs | Input | Type | Source | Example | |-------|------|--------|---------| | `graph_hash` | `string` | Scanner output | `blake3:a1b2c3d4e5f6...` | | `build_id` | `string` | ELF/PE/image digest | `gnu-build-id:5f0c7c3c...` | | `component_ref` | `string` | PURL or SBOM ref | `pkg:maven/log4j@2.14.1` | | `vuln_id` | `string` | CVE identifier | `CVE-2021-44228` | | `policy_digest` | `string` | Policy version hash | `sha256:abc123...` | | `options` | `ResolverOptions` | Configuration | `{maxDepth: 10, maxPaths: 5}` | ### 1.3 Outputs | Output | Type | Description | |--------|------|-------------| | `Subgraph` | Record | Minimal subgraph with nodes, edges, entry/sink refs | | `null` | — | Returned when no reachable paths exist | ### 1.4 Key Properties - **Deterministic**: Same inputs always produce same subgraph (stable ordering, reproducible hashes) - **Minimal**: Only nodes/edges participating in entry→sink paths - **Bounded**: Respects `maxDepth` and `maxPaths` limits - **Auditable**: Includes guard predicates and confidence scores --- ## 2. Algorithm Design ### 2.1 High-Level Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ Subgraph Extraction Pipeline │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. Load richgraph-v1 from CAS │ │ ↓ │ │ 2. Resolve Entry Set (EntryTrace + Framework Adapters) │ │ ↓ │ │ 3. Resolve Sink Set (CVE→Symbol Mapping) │ │ ↓ │ │ 4. Run Bounded BFS (Entry → Sink, maxDepth, maxPaths) │ │ ↓ │ │ 5. Prune Paths (Shortest + Highest Confidence) │ │ ↓ │ │ 6. Extract Subgraph (Nodes + Edges from Selected Paths) │ │ ↓ │ │ 7. Normalize & Sort (Deterministic Ordering) │ │ ↓ │ │ 8. Build Subgraph Record with Metadata │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Bounded BFS Algorithm **Objective:** Find all paths from entry set to sink set within `maxDepth` hops. **Pseudocode:** ```python def bounded_bfs(graph, entry_set, sink_set, max_depth, max_paths): paths = [] queue = [(entry_node, [entry_node], 0) for entry_node in entry_set] while queue and len(paths) < max_paths: current, path, depth = queue.pop(0) # Found a sink node if current in sink_set: paths.append(path) continue # Max depth reached if depth >= max_depth: continue # Explore neighbors for edge in graph.edges_from(current): neighbor = edge.to # Avoid cycles if neighbor in path: continue new_path = path + [neighbor] queue.append((neighbor, new_path, depth + 1)) return paths ``` **Optimizations:** 1. **Early termination**: Stop when `max_paths` found 2. **Cycle detection**: Skip nodes already in current path 3. **Confidence pruning**: Deprioritize low-confidence edges (< 0.5) 4. **Runtime prioritization**: Favor runtime-observed edges when available ### 2.3 Path Pruning Strategy When BFS finds more than `max_paths` paths, prune to best candidates: **Scoring Formula:** ``` score = (1.0 / path_length) * avg_confidence * runtime_boost Where: - path_length: Number of hops - avg_confidence: Average edge confidence - runtime_boost: 1.5 if any edge is runtime-observed, else 1.0 ``` **Selection Algorithm:** 1. Compute score for all paths 2. Sort by score (descending) 3. Take top `max_paths` 4. Always include shortest path (even if below cutoff) ### 2.4 Deterministic Ordering To ensure reproducible hashes, all arrays must be sorted deterministically: **Node Ordering:** ```csharp nodes = nodes.OrderBy(n => n.Symbol) .ThenBy(n => n.ModuleHash) .ThenBy(n => n.Addr) .ToArray(); ``` **Edge Ordering:** ```csharp edges = edges.OrderBy(e => e.Caller.Symbol) .ThenBy(e => e.Callee.Symbol) .ToArray(); ``` **Guard Ordering:** ```csharp edge.Guards = edge.Guards.OrderBy(g => g).ToArray(); ``` --- ## 3. Entry Set Resolution ### 3.1 Strategy Entry points are where execution begins. We identify them through: 1. **Semantic EntryTrace Analysis**: HTTP handlers, GRPC endpoints, CLI commands 2. **Framework Adapters**: Spring Boot `@RequestMapping`, ASP.NET `[HttpGet]`, etc. 3. **Synthetic Roots**: ELF `.init_array`, `.preinit_array`, constructors, TLS callbacks 4. **Manual Configuration**: User-specified entry points in scanner config ### 3.2 Entry Point Types | Type | Detection Method | Example Symbol | |------|------------------|----------------| | HTTP Handler | Framework attribute scan | `UserController.GetById(int)` | | GRPC Endpoint | Protobuf service definition | `GreeterService.SayHello(Request)` | | CLI Command | `Main()` or command-line parser | `Program.Main(string[])` | | Scheduled Job | Cron/timer attribute | `BackgroundWorker.ProcessQueue()` | | Init Section | ELF `.init_array` | `__libc_csu_init` | | Message Handler | Message queue consumer | `KafkaConsumer.OnMessage(Message)` | ### 3.3 EntryTrace Integration **Existing Module:** `StellaOps.Scanner.EntryTrace` **API:** ```csharp public interface IEntryPointResolver { Task ResolveAsync( RichGraphV1 graph, BuildContext context, CancellationToken cancellationToken = default ); } public record EntryPointSet( IReadOnlyList Points, EntryPointIntent Intent, // WebServer, Worker, CliTool, etc. double Confidence ); public record EntryPoint( string SymbolId, string Display, EntryPointType Type, // HTTP, GRPC, CLI, Scheduled, etc. string? FrameworkHint // "Spring Boot", "ASP.NET Core", etc. ); ``` ### 3.4 Fallback Strategy If no entry points detected: 1. Use all nodes with `in-degree == 0` (no callers) 2. Use `main()` or equivalent language entry point 3. Use synthetic roots (`.init_array`, constructors) 4. **Fail with warning** if none found (manual configuration required) --- ## 4. Sink Set Resolution ### 4.1 Strategy Sinks are vulnerable functions identified by CVE-to-symbol mapping. **Data Source:** `IVulnSurfaceService` (see `docs/reachability/cve-symbol-mapping.md`) ### 4.2 CVE→Symbol Mapping Flow ``` CVE-2021-44228 → Advisory Linksets → Patch Diff Analysis → Affected Symbols: - pkg:maven/log4j@2.14.1:org.apache.logging.log4j.core.lookup.JndiLookup.lookup(LogEvent, String) - pkg:maven/log4j@2.14.1:org.apache.logging.log4j.core.net.JndiManager.lookup(String) ``` ### 4.3 Sink Resolution API ```csharp public interface IVulnSurfaceService { Task> GetAffectedSymbolsAsync( string vulnId, string componentRef, CancellationToken cancellationToken = default ); } public record AffectedSymbol( string SymbolId, string MethodKey, string Display, ChangeType ChangeType, // Added, Modified, Deleted double Confidence ); ``` ### 4.4 Sink Matching in Graph **Exact Match (Preferred):** ```csharp var sinkNodes = graph.Nodes .Where(n => affectedSymbols.Any(s => s.SymbolId == n.SymbolId)) .ToList(); ``` **Fuzzy Match (Fallback for Stripped Binaries):** ```csharp var sinkNodes = graph.Nodes .Where(n => affectedSymbols.Any(s => FuzzyMatch(s, n))) .ToList(); bool FuzzyMatch(AffectedSymbol symbol, GraphNode node) { // Match by method signature, demangled name, or code_id return symbol.Display.Contains(node.Display) || symbol.MethodKey == node.MethodKey || (symbol.CodeId != null && symbol.CodeId == node.CodeId); } ``` --- ## 5. Guard Predicate Handling ### 5.1 Guard Types Guards are conditions that control edge reachability: | Guard Type | Example | Representation | |------------|---------|----------------| | Feature Flag | `if (featureFlags.darkMode)` | `feature:dark-mode` | | Platform | `#ifdef _WIN32` | `platform:windows` | | Build Tag | `//go:build linux` | `build:linux` | | Configuration | `if (config.enableCache)` | `config:enable-cache` | | Runtime Check | `if (user.isAdmin())` | `runtime:admin-check` | ### 5.2 Guard Extraction **Source-Level (Preferred):** - Parse AST for conditional blocks around call sites - Extract predicate expressions - Normalize to guard format (e.g., `feature:dark-mode`) **Binary-Level (Fallback):** - Identify branch instructions (`je`, `jne`, `cbz`, etc.) - Link to preceding comparison/test instructions - Heuristic: Flag as `guard:unknown-condition` ### 5.3 Guard Propagation Guards propagate through call chains: ``` Entry: main() ↓ (no guards) Edge: main() → processRequest() ↓ (guard: feature:dark-mode) Edge: processRequest() → themeService.apply() ↓ (inherited guard: feature:dark-mode) Sink: themeService.apply() ``` **Rule:** If any edge in path has guards, all downstream edges inherit them. ### 5.4 Guard Metadata in Subgraph ```csharp public record Edge( FunctionId Caller, FunctionId Callee, string[] Guards // ["feature:dark-mode", "platform:linux"] ); ``` --- ## 6. BuildID Propagation ### 6.1 BuildID Sources | Binary Format | BuildID Field | Example | |---------------|---------------|---------| | ELF | `.note.gnu.build-id` | `5f0c7c3c4d5e6f7a8b9c0d1e2f3a4b5c` | | PE (Windows) | PDB GUID + Age | `{12345678-1234-5678-1234-567812345678}-1` | | Mach-O (macOS) | LC_UUID | `12345678-1234-5678-1234-567812345678` | | Container Image | Image Digest | `sha256:abc123...` | ### 6.2 Extraction Logic **Priority:** 1. ELF Build-ID (if present) 2. PE PDB GUID (if present) 3. Mach-O UUID (if present) 4. Container image digest (fallback) 5. File SHA-256 (last resort) **Format:** ```csharp string buildId = format switch { "elf" => $"gnu-build-id:{ExtractElfBuildId(binary)}", "pe" => $"pe-pdb-guid:{ExtractPePdbGuid(binary)}", "macho" => $"macho-uuid:{ExtractMachoUuid(binary)}", "oci" => $"oci-digest:{imageDigest}", _ => $"file-sha256:{ComputeSha256(binary)}" }; ``` ### 6.3 BuildID in Subgraph ```csharp public record Subgraph( string BuildId, // "gnu-build-id:5f0c7c3c..." // ... other fields ); ``` **Verification Use Case:** Auditors can match `BuildId` to image digest or binary hash to confirm PoE applies to specific build. --- ## 7. Integration with Existing Modules ### 7.1 Module Dependencies ``` SubgraphExtractor ├─> IRichGraphStore (fetch richgraph-v1 from CAS) ├─> IEntryPointResolver (EntryTrace module) ├─> IVulnSurfaceService (CVE-symbol mapping) ├─> IBinaryFeatureExtractor (BuildID extraction) └─> ILogger ``` ### 7.2 Dependency Injection Setup ```csharp // Startup.cs or ServiceCollectionExtensions.cs services.AddScoped(); services.AddScoped(); services.AddScoped(); services.AddScoped(); services.AddScoped(); ``` ### 7.3 Configuration **File:** `etc/scanner.yaml` ```yaml reachability: subgraphExtraction: maxDepth: 10 maxPaths: 5 includeGuards: true requireRuntimeConfirmation: false # Entry point resolution entryPoints: enableFrameworkAdapters: true enableSyntheticRoots: true fallbackToZeroInDegree: true manualEntries: [] # Optional: ["com.example.Main.main()"] # Sink resolution sinks: usePatchDiffs: true useAdvisoryLinksets: true fuzzyMatchConfidenceThreshold: 0.6 # Guard extraction guards: enabled: true sourceLevel: true binaryLevel: false # Experimental normalizePredicates: true ``` --- ## 8. Performance Considerations ### 8.1 Graph Size Limits | Graph Size | Max Depth | Max Paths | Expected Time | |------------|-----------|-----------|---------------| | Small (< 1K nodes) | 15 | 10 | < 100ms | | Medium (1K-10K nodes) | 12 | 5 | < 500ms | | Large (10K-100K nodes) | 10 | 3 | < 2s | | Huge (> 100K nodes) | 8 | 1 | < 5s | ### 8.2 Caching Strategy **Cache Key:** `(graph_hash, vuln_id, component_ref, policy_digest)` **Cache Location:** In-memory (LRU cache, max 100 entries) or Redis **TTL:** 1 hour (subgraphs are deterministic, cache can be long-lived) ### 8.3 Parallelization **Opportunity:** Extract subgraphs for multiple CVEs in parallel ```csharp var tasks = vulnerabilities.Select(vuln => resolver.ResolveAsync(new ReachabilityResolutionRequest( graphHash, buildId, componentRef, vuln.Id, policyDigest, options )) ); var subgraphs = await Task.WhenAll(tasks); ``` **Caveat:** Limit concurrency to avoid memory pressure (e.g., max 10 parallel extractions) --- ## 9. Error Handling & Edge Cases ### 9.1 No Reachable Paths **Scenario:** BFS finds no paths from entry to sink. **Action:** Return `null` (not an error, just unreachable) **Logging:** ```csharp _logger.LogInformation( "No reachable paths found for {VulnId} in {ComponentRef} (graph: {GraphHash})", vulnId, componentRef, graphHash ); ``` ### 9.2 Entry Set Empty **Scenario:** Entry point resolution finds no entries. **Action:** Try fallback strategies (Section 3.4), then fail with warning **Error:** ```csharp throw new SubgraphExtractionException( $"Failed to resolve entry points for graph {graphHash}. " + "Consider configuring manual entry points in scanner config." ); ``` ### 9.3 Sink Set Empty **Scenario:** CVE-symbol mapping finds no affected symbols in graph. **Action:** Return `null` (CVE not applicable to this component/graph) **Logging:** ```csharp _logger.LogWarning( "No affected symbols found for {VulnId} in {ComponentRef}. " + "CVE may not apply to this version or symbols may be stripped.", vulnId, componentRef ); ``` ### 9.4 Cycle Detection **Scenario:** BFS encounters circular dependencies. **Action:** Skip nodes already in current path (see Section 2.2) **Note:** Recursion and mutual recursion are common; cycles are not errors. ### 9.5 Max Depth Exceeded **Scenario:** All paths exceed `maxDepth` without reaching sink. **Action:** Return `null` or partial subgraph (configurable) **Logging:** ```csharp _logger.LogWarning( "All paths for {VulnId} exceeded max depth {MaxDepth}. " + "Consider increasing maxDepth or investigating graph complexity.", vulnId, maxDepth ); ``` --- ## 10. Testing Strategy ### 10.1 Unit Tests **File:** `SubgraphExtractorTests.cs` **Coverage:** - Single path extraction (happy path) - Multiple paths with pruning - Max depth limiting - Guard predicate extraction - Deterministic ordering - Entry/sink resolution - No reachable paths (null return) - Cycle handling ### 10.2 Golden Fixtures **Directory:** `tests/Reachability/Subgraph/Fixtures/` **Fixtures:** | Fixture | Description | Expected Output | |---------|-------------|-----------------| | `log4j-cve-2021-44228.json` | Log4j RCE with 3 paths | 3 paths, 8 nodes, 12 edges | | `stripped-binary-c.json` | C/C++ stripped binary | 1 path with code_id nodes | | `guarded-path-dotnet.json` | .NET with feature flags | 2 paths, guards on edges | | `no-path.json` | Unreachable vulnerability | null (no paths) | | `large-graph.json` | 10K nodes, 50K edges | 5 paths (pruned), < 2s | ### 10.3 Determinism Tests **Objective:** Verify same inputs produce same subgraph hash ```csharp [Theory] [InlineData("log4j-cve-2021-44228.json")] [InlineData("stripped-binary-c.json")] public async Task ExtractSubgraph_WithSameInputs_ProducesSameHash(string fixture) { var graph = LoadFixture(fixture); var sg1 = await _extractor.ExtractAsync(graph, entrySet, sinkSet, options); var sg2 = await _extractor.ExtractAsync(graph, entrySet, sinkSet, options); var hash1 = ComputeBlake3(sg1); var hash2 = ComputeBlake3(sg2); Assert.Equal(hash1, hash2); } ``` --- ## 11. Future Enhancements ### 11.1 Dynamic Dispatch Resolution **Challenge:** Virtual method calls, interface dispatch, reflection **Proposal:** Use runtime traces to resolve ambiguous edges **Impact:** More accurate paths for OOP languages (Java, C#, C++) ### 11.2 Inter-Procedural Analysis **Challenge:** Calls across compilation units, shared libraries **Proposal:** Link graphs from multiple artifacts (container layers) **Impact:** Detect cross-component vulnerabilities ### 11.3 Path Ranking with ML **Challenge:** Which paths matter most to auditors? **Proposal:** Train model on auditor feedback (path selections, ignores) **Impact:** Prioritize most relevant paths in PoE ### 11.4 Guard Evidence Linking **Challenge:** Guards without clear evidence (feature flag states unknown) **Proposal:** Link to runtime configuration snapshots or policy documents **Impact:** Stronger PoE claims with verifiable guard states --- ## 12. Cross-References - **Sprint:** `docs/implplan/SPRINT_3500_0001_0001_proof_of_exposure_mvp.md` - **Advisory:** `docs/product-advisories/23-Dec-2026 - Binary Mapping as Attestable Proof.md` - **Reachability Docs:** `docs/reachability/function-level-evidence.md`, `docs/reachability/lattice.md` - **EntryTrace:** `docs/modules/scanner/operations/entrypoint-static-analysis.md` - **CVE Mapping:** `docs/reachability/cve-symbol-mapping.md` --- _Last updated: 2025-12-23. See Sprint 3500.0001.0001 for implementation plan._