# Subgraph Extraction for Proof of Exposure

_Last updated: 2025-12-23. Owner: Scanner Guild._

This document specifies the algorithm and implementation strategy for extracting minimal reachability subgraphs from richgraph-v1 documents. These subgraphs power Proof of Exposure (PoE) artifacts that provide compact, offline-verifiable evidence of vulnerability reachability.

---

## 1. Overview

### 1.1 Purpose

Given a richgraph-v1 call graph and a specific CVE, extract a **minimal subgraph** containing:
- All call paths from **entry points** (HTTP handlers, CLI commands, cron jobs) to **vulnerable sinks** (CVE-affected functions)
- Only the nodes and edges that participate in reachability
- Guard predicates (feature flags, platform conditionals) for auditor evaluation

### 1.2 Inputs

| Input | Type | Source | Example |
|-------|------|--------|---------|
| `graph_hash` | `string` | Scanner output | `blake3:a1b2c3d4e5f6...` |
| `build_id` | `string` | ELF/PE/image digest | `gnu-build-id:5f0c7c3c...` |
| `component_ref` | `string` | PURL or SBOM ref | `pkg:maven/log4j@2.14.1` |
| `vuln_id` | `string` | CVE identifier | `CVE-2021-44228` |
| `policy_digest` | `string` | Policy version hash | `sha256:abc123...` |
| `options` | `ResolverOptions` | Configuration | `{maxDepth: 10, maxPaths: 5}` |

### 1.3 Outputs

| Output | Type | Description |
|--------|------|-------------|
| `Subgraph` | Record | Minimal subgraph with nodes, edges, entry/sink refs |
| `null` | — | Returned when no reachable paths exist |

### 1.4 Key Properties

- **Deterministic**: Same inputs always produce same subgraph (stable ordering, reproducible hashes)
- **Minimal**: Only nodes/edges participating in entry→sink paths
- **Bounded**: Respects `maxDepth` and `maxPaths` limits
- **Auditable**: Includes guard predicates and confidence scores

---

## 2. Algorithm Design

### 2.1 High-Level Flow

```
┌─────────────────────────────────────────────────────────────────┐
│                   Subgraph Extraction Pipeline                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Load richgraph-v1 from CAS                                 │
│     ↓                                                           │
│  2. Resolve Entry Set (EntryTrace + Framework Adapters)        │
│     ↓                                                           │
│  3. Resolve Sink Set (CVE→Symbol Mapping)                      │
│     ↓                                                           │
│  4. Run Bounded BFS (Entry → Sink, maxDepth, maxPaths)         │
│     ↓                                                           │
│  5. Prune Paths (Shortest + Highest Confidence)                │
│     ↓                                                           │
│  6. Extract Subgraph (Nodes + Edges from Selected Paths)       │
│     ↓                                                           │
│  7. Normalize & Sort (Deterministic Ordering)                  │
│     ↓                                                           │
│  8. Build Subgraph Record with Metadata                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### 2.2 Bounded BFS Algorithm

**Objective:** Find all paths from entry set to sink set within `maxDepth` hops.

**Pseudocode:**
```python
def bounded_bfs(graph, entry_set, sink_set, max_depth, max_paths):
    paths = []
    queue = [(entry_node, [entry_node], 0) for entry_node in entry_set]

    while queue and len(paths) < max_paths:
        current, path, depth = queue.pop(0)

        # Found a sink node
        if current in sink_set:
            paths.append(path)
            continue

        # Max depth reached
        if depth >= max_depth:
            continue

        # Explore neighbors
        for edge in graph.edges_from(current):
            neighbor = edge.to

            # Avoid cycles
            if neighbor in path:
                continue

            new_path = path + [neighbor]
            queue.append((neighbor, new_path, depth + 1))

    return paths
```

**Optimizations:**
1. **Early termination**: Stop when `max_paths` found
2. **Cycle detection**: Skip nodes already in current path
3. **Confidence pruning**: Deprioritize low-confidence edges (< 0.5)
4. **Runtime prioritization**: Favor runtime-observed edges when available

### 2.3 Path Pruning Strategy

When BFS finds more than `max_paths` paths, prune to best candidates:

**Scoring Formula:**
```
score = (1.0 / path_length) * avg_confidence * runtime_boost

Where:
- path_length: Number of hops
- avg_confidence: Average edge confidence
- runtime_boost: 1.5 if any edge is runtime-observed, else 1.0
```

**Selection Algorithm:**
1. Compute score for all paths
2. Sort by score (descending)
3. Take top `max_paths`
4. Always include shortest path (even if below cutoff)

### 2.4 Deterministic Ordering

To ensure reproducible hashes, all arrays must be sorted deterministically:

**Node Ordering:**
```csharp
nodes = nodes.OrderBy(n => n.Symbol)
              .ThenBy(n => n.ModuleHash)
              .ThenBy(n => n.Addr)
              .ToArray();
```

**Edge Ordering:**
```csharp
edges = edges.OrderBy(e => e.Caller.Symbol)
              .ThenBy(e => e.Callee.Symbol)
              .ToArray();
```

**Guard Ordering:**
```csharp
edge.Guards = edge.Guards.OrderBy(g => g).ToArray();
```

---

## 3. Entry Set Resolution

### 3.1 Strategy

Entry points are where execution begins. We identify them through:

1. **Semantic EntryTrace Analysis**: HTTP handlers, GRPC endpoints, CLI commands
2. **Framework Adapters**: Spring Boot `@RequestMapping`, ASP.NET `[HttpGet]`, etc.
3. **Synthetic Roots**: ELF `.init_array`, `.preinit_array`, constructors, TLS callbacks
4. **Manual Configuration**: User-specified entry points in scanner config

### 3.2 Entry Point Types

| Type | Detection Method | Example Symbol |
|------|------------------|----------------|
| HTTP Handler | Framework attribute scan | `UserController.GetById(int)` |
| GRPC Endpoint | Protobuf service definition | `GreeterService.SayHello(Request)` |
| CLI Command | `Main()` or command-line parser | `Program.Main(string[])` |
| Scheduled Job | Cron/timer attribute | `BackgroundWorker.ProcessQueue()` |
| Init Section | ELF `.init_array` | `__libc_csu_init` |
| Message Handler | Message queue consumer | `KafkaConsumer.OnMessage(Message)` |

### 3.3 EntryTrace Integration

**Existing Module:** `StellaOps.Scanner.EntryTrace`

**API:**
```csharp
public interface IEntryPointResolver
{
    Task<EntryPointSet> ResolveAsync(
        RichGraphV1 graph,
        BuildContext context,
        CancellationToken cancellationToken = default
    );
}

public record EntryPointSet(
    IReadOnlyList<EntryPoint> Points,
    EntryPointIntent Intent,  // WebServer, Worker, CliTool, etc.
    double Confidence
);

public record EntryPoint(
    string SymbolId,
    string Display,
    EntryPointType Type,  // HTTP, GRPC, CLI, Scheduled, etc.
    string? FrameworkHint  // "Spring Boot", "ASP.NET Core", etc.
);
```

### 3.4 Fallback Strategy

If no entry points detected:
1. Use all nodes with `in-degree == 0` (no callers)
2. Use `main()` or equivalent language entry point
3. Use synthetic roots (`.init_array`, constructors)
4. **Fail with warning** if none found (manual configuration required)

---

## 4. Sink Set Resolution

### 4.1 Strategy

Sinks are vulnerable functions identified by CVE-to-symbol mapping.

**Data Source:** `IVulnSurfaceService` (see `docs/reachability/cve-symbol-mapping.md`)

### 4.2 CVE→Symbol Mapping Flow

```
CVE-2021-44228 →
  Advisory Linksets →
    Patch Diff Analysis →
      Affected Symbols:
        - pkg:maven/log4j@2.14.1:org.apache.logging.log4j.core.lookup.JndiLookup.lookup(LogEvent, String)
        - pkg:maven/log4j@2.14.1:org.apache.logging.log4j.core.net.JndiManager.lookup(String)
```

### 4.3 Sink Resolution API

```csharp
public interface IVulnSurfaceService
{
    Task<IReadOnlyList<AffectedSymbol>> GetAffectedSymbolsAsync(
        string vulnId,
        string componentRef,
        CancellationToken cancellationToken = default
    );
}

public record AffectedSymbol(
    string SymbolId,
    string MethodKey,
    string Display,
    ChangeType ChangeType,  // Added, Modified, Deleted
    double Confidence
);
```

### 4.4 Sink Matching in Graph

**Exact Match (Preferred):**
```csharp
var sinkNodes = graph.Nodes
    .Where(n => affectedSymbols.Any(s => s.SymbolId == n.SymbolId))
    .ToList();
```

**Fuzzy Match (Fallback for Stripped Binaries):**
```csharp
var sinkNodes = graph.Nodes
    .Where(n => affectedSymbols.Any(s => FuzzyMatch(s, n)))
    .ToList();

bool FuzzyMatch(AffectedSymbol symbol, GraphNode node)
{
    // Match by method signature, demangled name, or code_id
    return symbol.Display.Contains(node.Display) ||
           symbol.MethodKey == node.MethodKey ||
           (symbol.CodeId != null && symbol.CodeId == node.CodeId);
}
```

---

## 5. Guard Predicate Handling

### 5.1 Guard Types

Guards are conditions that control edge reachability:

| Guard Type | Example | Representation |
|------------|---------|----------------|
| Feature Flag | `if (featureFlags.darkMode)` | `feature:dark-mode` |
| Platform | `#ifdef _WIN32` | `platform:windows` |
| Build Tag | `//go:build linux` | `build:linux` |
| Configuration | `if (config.enableCache)` | `config:enable-cache` |
| Runtime Check | `if (user.isAdmin())` | `runtime:admin-check` |

### 5.2 Guard Extraction

**Source-Level (Preferred):**
- Parse AST for conditional blocks around call sites
- Extract predicate expressions
- Normalize to guard format (e.g., `feature:dark-mode`)

**Binary-Level (Fallback):**
- Identify branch instructions (`je`, `jne`, `cbz`, etc.)
- Link to preceding comparison/test instructions
- Heuristic: Flag as `guard:unknown-condition`

### 5.3 Guard Propagation

Guards propagate through call chains:

```
Entry: main()
  ↓ (no guards)
Edge: main() → processRequest()
  ↓ (guard: feature:dark-mode)
Edge: processRequest() → themeService.apply()
  ↓ (inherited guard: feature:dark-mode)
Sink: themeService.apply()
```

**Rule:** If any edge in path has guards, all downstream edges inherit them.

### 5.4 Guard Metadata in Subgraph

```csharp
public record Edge(
    FunctionId Caller,
    FunctionId Callee,
    string[] Guards  // ["feature:dark-mode", "platform:linux"]
);
```

---

## 6. BuildID Propagation

### 6.1 BuildID Sources

| Binary Format | BuildID Field | Example |
|---------------|---------------|---------|
| ELF | `.note.gnu.build-id` | `5f0c7c3c4d5e6f7a8b9c0d1e2f3a4b5c` |
| PE (Windows) | PDB GUID + Age | `{12345678-1234-5678-1234-567812345678}-1` |
| Mach-O (macOS) | LC_UUID | `12345678-1234-5678-1234-567812345678` |
| Container Image | Image Digest | `sha256:abc123...` |

### 6.2 Extraction Logic

**Priority:**
1. ELF Build-ID (if present)
2. PE PDB GUID (if present)
3. Mach-O UUID (if present)
4. Container image digest (fallback)
5. File SHA-256 (last resort)

**Format:**
```csharp
string buildId = format switch
{
    "elf" => $"gnu-build-id:{ExtractElfBuildId(binary)}",
    "pe" => $"pe-pdb-guid:{ExtractPePdbGuid(binary)}",
    "macho" => $"macho-uuid:{ExtractMachoUuid(binary)}",
    "oci" => $"oci-digest:{imageDigest}",
    _ => $"file-sha256:{ComputeSha256(binary)}"
};
```

### 6.3 BuildID in Subgraph

```csharp
public record Subgraph(
    string BuildId,  // "gnu-build-id:5f0c7c3c..."
    // ... other fields
);
```

**Verification Use Case:** Auditors can match `BuildId` to image digest or binary hash to confirm PoE applies to specific build.

---

## 7. Integration with Existing Modules

### 7.1 Module Dependencies

```
SubgraphExtractor
  ├─> IRichGraphStore (fetch richgraph-v1 from CAS)
  ├─> IEntryPointResolver (EntryTrace module)
  ├─> IVulnSurfaceService (CVE-symbol mapping)
  ├─> IBinaryFeatureExtractor (BuildID extraction)
  └─> ILogger<SubgraphExtractor>
```

### 7.2 Dependency Injection Setup

```csharp
// Startup.cs or ServiceCollectionExtensions.cs
services.AddScoped<IReachabilityResolver, ReachabilityResolver>();
services.AddScoped<ISubgraphExtractor, SubgraphExtractor>();
services.AddScoped<IEntryPointResolver, EntryPointResolver>();
services.AddScoped<IVulnSurfaceService, VulnSurfaceService>();
services.AddScoped<IBinaryFeatureExtractor, BinaryFeatureExtractor>();
```

### 7.3 Configuration

**File:** `etc/scanner.yaml`

```yaml
reachability:
  subgraphExtraction:
    maxDepth: 10
    maxPaths: 5
    includeGuards: true
    requireRuntimeConfirmation: false

    # Entry point resolution
    entryPoints:
      enableFrameworkAdapters: true
      enableSyntheticRoots: true
      fallbackToZeroInDegree: true
      manualEntries: []  # Optional: ["com.example.Main.main()"]

    # Sink resolution
    sinks:
      usePatchDiffs: true
      useAdvisoryLinksets: true
      fuzzyMatchConfidenceThreshold: 0.6

    # Guard extraction
    guards:
      enabled: true
      sourceLevel: true
      binaryLevel: false  # Experimental
      normalizePredicates: true
```

---

## 8. Performance Considerations

### 8.1 Graph Size Limits

| Graph Size | Max Depth | Max Paths | Expected Time |
|------------|-----------|-----------|---------------|
| Small (< 1K nodes) | 15 | 10 | < 100ms |
| Medium (1K-10K nodes) | 12 | 5 | < 500ms |
| Large (10K-100K nodes) | 10 | 3 | < 2s |
| Huge (> 100K nodes) | 8 | 1 | < 5s |

### 8.2 Caching Strategy

**Cache Key:** `(graph_hash, vuln_id, component_ref, policy_digest)`

**Cache Location:** In-memory (LRU cache, max 100 entries) or Redis

**TTL:** 1 hour (subgraphs are deterministic, cache can be long-lived)

### 8.3 Parallelization

**Opportunity:** Extract subgraphs for multiple CVEs in parallel

```csharp
var tasks = vulnerabilities.Select(vuln =>
    resolver.ResolveAsync(new ReachabilityResolutionRequest(
        graphHash, buildId, componentRef, vuln.Id, policyDigest, options
    ))
);

var subgraphs = await Task.WhenAll(tasks);
```

**Caveat:** Limit concurrency to avoid memory pressure (e.g., max 10 parallel extractions)

---

## 9. Error Handling & Edge Cases

### 9.1 No Reachable Paths

**Scenario:** BFS finds no paths from entry to sink.

**Action:** Return `null` (not an error, just unreachable)

**Logging:**
```csharp
_logger.LogInformation(
    "No reachable paths found for {VulnId} in {ComponentRef} (graph: {GraphHash})",
    vulnId, componentRef, graphHash
);
```

### 9.2 Entry Set Empty

**Scenario:** Entry point resolution finds no entries.

**Action:** Try fallback strategies (Section 3.4), then fail with warning

**Error:**
```csharp
throw new SubgraphExtractionException(
    $"Failed to resolve entry points for graph {graphHash}. " +
    "Consider configuring manual entry points in scanner config."
);
```

### 9.3 Sink Set Empty

**Scenario:** CVE-symbol mapping finds no affected symbols in graph.

**Action:** Return `null` (CVE not applicable to this component/graph)

**Logging:**
```csharp
_logger.LogWarning(
    "No affected symbols found for {VulnId} in {ComponentRef}. " +
    "CVE may not apply to this version or symbols may be stripped.",
    vulnId, componentRef
);
```

### 9.4 Cycle Detection

**Scenario:** BFS encounters circular dependencies.

**Action:** Skip nodes already in current path (see Section 2.2)

**Note:** Recursion and mutual recursion are common; cycles are not errors.

### 9.5 Max Depth Exceeded

**Scenario:** All paths exceed `maxDepth` without reaching sink.

**Action:** Return `null` or partial subgraph (configurable)

**Logging:**
```csharp
_logger.LogWarning(
    "All paths for {VulnId} exceeded max depth {MaxDepth}. " +
    "Consider increasing maxDepth or investigating graph complexity.",
    vulnId, maxDepth
);
```

---

## 10. Testing Strategy

### 10.1 Unit Tests

**File:** `SubgraphExtractorTests.cs`

**Coverage:**
- Single path extraction (happy path)
- Multiple paths with pruning
- Max depth limiting
- Guard predicate extraction
- Deterministic ordering
- Entry/sink resolution
- No reachable paths (null return)
- Cycle handling

### 10.2 Golden Fixtures

**Directory:** `tests/Reachability/Subgraph/Fixtures/`

**Fixtures:**
| Fixture | Description | Expected Output |
|---------|-------------|-----------------|
| `log4j-cve-2021-44228.json` | Log4j RCE with 3 paths | 3 paths, 8 nodes, 12 edges |
| `stripped-binary-c.json` | C/C++ stripped binary | 1 path with code_id nodes |
| `guarded-path-dotnet.json` | .NET with feature flags | 2 paths, guards on edges |
| `no-path.json` | Unreachable vulnerability | null (no paths) |
| `large-graph.json` | 10K nodes, 50K edges | 5 paths (pruned), < 2s |

### 10.3 Determinism Tests

**Objective:** Verify same inputs produce same subgraph hash

```csharp
[Theory]
[InlineData("log4j-cve-2021-44228.json")]
[InlineData("stripped-binary-c.json")]
public async Task ExtractSubgraph_WithSameInputs_ProducesSameHash(string fixture)
{
    var graph = LoadFixture(fixture);

    var sg1 = await _extractor.ExtractAsync(graph, entrySet, sinkSet, options);
    var sg2 = await _extractor.ExtractAsync(graph, entrySet, sinkSet, options);

    var hash1 = ComputeBlake3(sg1);
    var hash2 = ComputeBlake3(sg2);

    Assert.Equal(hash1, hash2);
}
```

---

## 11. Future Enhancements

### 11.1 Dynamic Dispatch Resolution

**Challenge:** Virtual method calls, interface dispatch, reflection

**Proposal:** Use runtime traces to resolve ambiguous edges

**Impact:** More accurate paths for OOP languages (Java, C#, C++)

### 11.2 Inter-Procedural Analysis

**Challenge:** Calls across compilation units, shared libraries

**Proposal:** Link graphs from multiple artifacts (container layers)

**Impact:** Detect cross-component vulnerabilities

### 11.3 Path Ranking with ML

**Challenge:** Which paths matter most to auditors?

**Proposal:** Train model on auditor feedback (path selections, ignores)

**Impact:** Prioritize most relevant paths in PoE

### 11.4 Guard Evidence Linking

**Challenge:** Guards without clear evidence (feature flag states unknown)

**Proposal:** Link to runtime configuration snapshots or policy documents

**Impact:** Stronger PoE claims with verifiable guard states

---

## 12. Cross-References

- **Sprint:** `docs/implplan/SPRINT_3500_0001_0001_proof_of_exposure_mvp.md`
- **Advisory:** `docs/product-advisories/23-Dec-2026 - Binary Mapping as Attestable Proof.md`
- **Reachability Docs:** `docs/reachability/function-level-evidence.md`, `docs/reachability/lattice.md`
- **EntryTrace:** `docs/modules/scanner/operations/entrypoint-static-analysis.md`
- **CVE Mapping:** `docs/reachability/cve-symbol-mapping.md`

---

_Last updated: 2025-12-23. See Sprint 3500.0001.0001 for implementation plan._