25 KiB
Here’s a quick, plain‑English idea you can use right away: not all code diffs are equal—some actually change what’s reachable at runtime (and thus security posture), while others just refactor internals. A “Smart‑Diff” pipeline flags only the diffs that open or close attack paths by combining (1) call‑stack traces, (2) dependency graphs, and (3) dataflow.
Why this matters (background)
- Text diffs ≠ behavior diffs. A rename or refactor can look big in Git but do nothing to reachable flows from external entry points (HTTP, gRPC, CLI, message consumers).
- Security triage gets noisy because scanners attach CVEs to all present packages, not to the code paths you can actually hit.
- Dataflow‑aware diffs shrink noise and make VEX generation honest: “vuln present but not exploitable because the sink is unreachable from any policy‑defined entrypoint.”
Minimal architecture (fits Stella Ops)
-
Entrypoint map (per service): controllers, handlers, consumers.
-
Call graph + dataflow (per commit): Roslyn for C#,
golang.org/x/tools/go/callgraphfor Go, plus taint rules (source→sink). -
Reachability cache keyed by (commit, entrypoint, package@version).
-
Smart‑Diff =
reachable_paths(commit_B) – reachable_paths(commit_A).- If a path to a sensitive sink is newly reachable → High.
- If a path disappears → auto‑generate VEX “not affected (no reachable path)”.
Tiny working seeds
C# (.NET 10) — Roslyn skeleton to diff call‑reachability
// SmartDiff.csproj targets net10.0
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.FindSymbols;
public static class SmartDiff
{
public static async Task<HashSet<string>> ReachableSinks(string solutionPath, string[] entrypoints, string[] sinks)
{
var workspace = MSBuild.MSBuildWorkspace.Create();
var solution = await workspace.OpenSolutionAsync(solutionPath);
var index = new HashSet<string>();
foreach (var proj in solution.Projects)
{
var comp = await proj.GetCompilationAsync();
if (comp is null) continue;
// Resolve entrypoints & sinks by symbol name
var epSymbols = comp.GlobalNamespace.GetMembers().SelectMany(Descend)
.OfType<IMethodSymbol>().Where(m => entrypoints.Contains(m.ToDisplayString())).ToList();
var sinkSymbols = comp.GlobalNamespace.GetMembers().SelectMany(Descend)
.OfType<IMethodSymbol>().Where(m => sinks.Contains(m.ToDisplayString())).ToList();
foreach (var ep in epSymbols)
foreach (var sink in sinkSymbols)
{
// Heuristic reachability: cheap path search via SymbolFinder
var refs = await SymbolFinder.FindReferencesAsync(sink, solution);
if (refs.SelectMany(r => r.Locations).Any()) // replace with real graph walk
index.Add($"{ep.ToDisplayString()} -> {sink.ToDisplayString()}");
}
}
return index;
static IEnumerable<ISymbol> Descend(INamespaceOrTypeSymbol sym)
{
foreach (var m in sym.GetMembers())
{
yield return m;
if (m is INamespaceOrTypeSymbol nt) foreach (var x in Descend(nt)) yield return x;
}
}
}
}
Go — SSA & callgraph seed
// go.mod: require golang.org/x/tools latest
package main
import (
"fmt"
"golang.org/x/tools/go/callgraph/cha"
"golang.org/x/tools/go/packages"
"golang.org/x/tools/go/ssa"
)
func main() {
cfg := &packages.Config{Mode: packages.LoadAllSyntax, Tests: false}
pkgs, _ := packages.Load(cfg, "./...")
prog, pkgsSSA := ssa.NewProgram(pkgs[0].Fset, ssa.BuilderMode(0))
for _, p := range pkgsSSA { prog.CreatePackage(p, p.Syntax, p.TypesInfo, true) }
prog.Build()
cg := cha.CallGraph(prog)
// TODO: map entrypoints & sinks, then walk cg from EPs to sinks
fmt.Println("nodes:", len(cg.Nodes))
}
How to use it in your pipeline (fast win)
-
Pre‑merge job:
- Build call graph for
HEADandHEAD^. - Compute Smart‑Diff.
- If any new EP→sink path appears, fail with a short, proof‑linked note:
“New reachable path:
POST /Invoices -> PdfExporter.Save(string path)(writes outside sandbox).”
- Build call graph for
-
Post‑scan VEX:
- For each CVE on a package, mark Affected only if any EP can reach a symbol that uses that package’s vulnerable surface.
Evidence to show in the UI
- “Path card”: EP → … → Sink, with file:line hop‑list and commit hash.
- “What changed”: before/after path diff (green removed, red added).
- “Why it matters”: sink classification (network write, file write, deserialization, SQL, crypto).
Developer checklist (Stella Ops style)
- Define entrypoints per service (attribute or YAML).
- Define sink taxonomy (FS, NET, DESER, SQL, CRYPTO).
- Implement language adapters:
.NET (Roslyn),Go (SSA), laterJava (Soot/WALA). - Add a ReachabilityCache (Postgres table keyed by commit+lang+service).
- Wire a
SmartDiffJobin CI; emit SARIF + CycloneDXvulnerability-assertionsextension or OpenVEX. - Gate merges on newly‑reachable sensitive sinks; auto‑VEX when paths disappear.
If you want, I can turn this into a small repo scaffold (Roslyn + Go adapters, Postgres schema, a GitLab/GitHub pipeline, and a minimal UI “path card”). Below is a concrete development implementation plan to take the “Smart‑Diff” idea (reachability + dataflow + dependency/vuln context) into a shippable product integrated into your pipeline (Stella Ops style). I’ll assume the initial languages are .NET (C#) and Go, and the initial goal is PR gating + VEX automation with strong evidence (paths + file/line hops).
1) Product definition
Problem you’re solving
Security noise comes from:
- “Vuln exists in dependency” ≠ “vuln exploitable from any entrypoint”
- Git diffs look big even when behavior is unchanged
- Teams struggle to triage “is this change actually risky?”
What Smart‑Diff should do (core behavior)
Given base commit A and head commit B:
-
Identify entrypoints (web handlers, RPC methods, message consumers, CLI commands).
-
Identify sinks (file write, command exec, SQL, SSRF, deserialization, crypto misuse, templating, etc.).
-
Compute reachable paths from entrypoints → sinks (call graph + dataflow/taint).
-
Emit Smart‑Diff:
- Newly reachable EP→sink paths (risk ↑)
- Removed EP→sink paths (risk ↓)
- Changed paths (same sink but different sanitization/guards)
-
Attach dependency vulnerability context:
- If a vulnerable API surface is reachable (or data reaches it), mark “affected/exploitable”
- Otherwise generate VEX: “not affected” / “not exploitable” with evidence
MVP definition (minimum shippable)
A PR check that:
-
Flags new reachable paths to a small set of high‑risk sinks (e.g., command exec, unsafe deserialization, filesystem write, SSRF/network dial, raw SQL).
-
Produces:
- SARIF report (for code scanning UI)
- JSON artifact containing proof paths (EP → … → sink with file:line)
- Optional VEX statement for dependency vulnerabilities (if you already have an SCA feed)
2) Architecture you can actually build
High‑level components
-
Policy & Taxonomy Service
- Defines entrypoints, sources, sinks, sanitizers, confidence rules
- Versioned and centrally managed (but supports repo overrides)
-
Analyzer Workers (language adapters)
- .NET analyzer (Roslyn + control flow)
- Go analyzer (SSA + callgraph)
- Outputs standardized IR (Intermediate Representation)
-
Graph Store + Reachability Engine
- Stores symbol nodes + call edges + dataflow edges
- Computes reachable sinks per entrypoint
- Computes diff between commits A and B
-
Vulnerability Mapper + VEX Generator
- Maps vulnerable packages/functions → “surfaces”
- Joins with reachability results
- Emits OpenVEX (or CycloneDX VEX) with evidence links
-
CI/PR Integrations
- CLI that runs in CI
- Optional server mode (cache + incremental processing)
-
UI/API
- Path cards: “what changed”, “why it matters”, “proof”
- Filters by sink class, confidence, service, entrypoint
Data contracts (standardized IR)
Make every analyzer output the same shapes so the rest of the pipeline is language‑agnostic:
-
Symbols
symbol_id: stable hash of (lang, module, fully-qualified name, signature)- metadata: file, line ranges, kind (method/function), accessibility
-
Edges
- Call edge:
caller_symbol_id -> callee_symbol_id - Dataflow edge:
source_symbol_id -> sink_symbol_idwith variable/parameter traces - Edge metadata: type, confidence, reason (static, reflection guess, interface dispatch, etc.)
- Call edge:
-
Entrypoints / Sources / Sinks
- entrypoint: (symbol_id, route/topic/command metadata)
- sink: (symbol_id, sink_type, severity, cwe mapping optional)
-
Paths
entrypoint -> ... -> sink- hop list: symbol_id + file:line, plus “dataflow step evidence” when relevant
3) Workstreams and deliverables
Workstream A — Policy, taxonomy, configuration
Deliverables
-
smartdiff.policy.yamlschema and validator -
A default sink taxonomy:
CMD_EXEC,UNSAFE_DESER,SQL_RAW,SSRF,FILE_WRITE,PATH_TRAVERSAL,TEMPLATE_INJECTION,CRYPTO_WEAK,AUTHZ_BYPASS(expand later)
-
Initial sanitizer patterns:
- For example: parameter validation, safe deserialization wrappers, ORM parameterized APIs, path normalization, allowlists
Implementation notes
-
Start strict and small: 10–20 sinks, 10 sources, 10 sanitizers.
-
Provide repo-level overrides:
smartdiff.policy.yamlin repo root- Central policies referenced by version tag
Acceptance criteria
-
A service can onboard by configuring:
- entrypoint discovery mode (auto + manual)
- sink classes to enforce
- severity threshold to fail PR
Workstream B — .NET analyzer (Roslyn)
Deliverables
-
Build pipeline that produces:
- call graph (methods and invocations)
- basic control-flow guards for reachability (optional for MVP)
- taint propagation for common patterns (MVP: parameter → sink)
-
Entry point discovery for:
- ASP.NET controllers (
[HttpGet],[HttpPost]) - Minimal APIs (
MapGet/MapPost) - gRPC service methods
- message consumers (configurable attributes/interfaces)
- ASP.NET controllers (
Implementation notes (practical path)
-
MVP static callgraph:
- Use Roslyn semantic model to resolve invocation targets
- For virtual/interface calls: conservative resolution to possible implementations within the compilation
-
MVP taint:
-
“Sources”: request params/body, headers, query string, message payloads
-
“Sinks”: wrappers around
Process.Start,SqlCommand,File.WriteAllText,HttpClient.Send, deserializers, etc. -
Propagate taint across:
- parameter → local → argument
- return values
- simple assignments and concatenations (heuristic)
-
-
Confidence scoring:
- Direct static call resolution: high
- Reflection/dynamic: low (flag separately)
Acceptance criteria
-
On a demo ASP.NET service, if a PR adds:
HttpPost /upload→File.WriteAllBytes(userPath, ...)Smart‑Diff flags new EP→FILE_WRITE path and shows hops with file/line.
Workstream C — Go analyzer (SSA)
Deliverables
-
SSA build + callgraph extraction
-
Entrypoint discovery for:
net/httphandlers- common routers (Gin/Echo/Chi) via adapter rules
- gRPC methods
- consumers (Kafka/NATS/etc.) by config
Implementation notes
-
Use
golang.org/x/tools/go/packages+ssabuild -
Callgraph:
- start with CHA (Class Hierarchy Analysis) for speed
- later add pointer analysis for precision on interfaces
-
Taint:
- sources:
http.Request, router params, message payloads - sinks:
os/exec,database/sqlraw query, file I/O,net/httpoutbound, unsafe deserialization libs
- sources:
Acceptance criteria
- A PR that adds
exec.Command(req.FormValue("cmd"))becomes a new EP→CMD_EXEC finding.
Workstream D — Graph store + reachability computation
Deliverables
-
Schema in Postgres (recommended first) for:
- commits, services, languages
- symbols, edges, entrypoints, sinks
- computed reachable “facts” (entrypoint→sink with shortest path(s))
-
Reachability engine:
- BFS/DFS per entrypoint with early cutoffs
- path reconstruction storage (store predecessor map or store k-shortest paths)
Implementation notes
-
Don’t start with a graph DB unless you must.
-
Use Postgres tables + indexes:
edges(from_symbol, to_symbol, commit_id, kind)symbols(symbol_id, lang, module, fqn, file, line_start, line_end)reachability(entrypoint_id, sink_id, commit_id, path_hash, confidence, severity, evidence_json)
-
Cache:
- keyed by (commit, policy_version, analyzer_version)
- avoids recompute on re-runs
Acceptance criteria
-
For any analyzed commit, you can answer:
- “Which sinks are reachable from these entrypoints?”
- “Show me one proof path per (entrypoint, sink_type).”
Workstream E — Smart‑Diff engine (the “diff” part)
Deliverables
-
Diff algorithm producing three buckets:
added_paths,removed_paths,changed_paths
-
“Changed” means:
- same entrypoint + sink type, but path differs OR taint/sanitization differs OR confidence changes
Implementation notes
-
Identify a path by a stable fingerprint:
path_id = hash(entrypoint_symbol + sink_symbol + sink_type + policy_version + analyzer_version)
-
Store:
- top-k paths for each pair for evidence (k=1 for MVP, add more later)
-
Severity gating rules:
-
Example:
- New path to
CMD_EXEC= fail - New path to
FILE_WRITE= warn unless under/tmpallowlist - New path to
SQL_RAW= fail unless parameterized sanitizer present
- New path to
-
Acceptance criteria
-
Given commits A and B:
-
If B introduces a new reachable sink, CI fails with a single actionable card:
- EP: route / handler
- Sink: type + symbol
- Proof: hop list
- Why: policy rule triggered
-
Workstream F — Vulnerability mapping + VEX
Deliverables
-
Ingest dependency inventory (SBOM or lockfiles)
-
Map vulnerabilities to “surfaces”
- package → vulnerable module/function patterns
- minimal version/range matching (from your existing vuln feed)
-
Decision logic:
- Affected if any reachable path intersects vulnerable surface OR dataflow reaches vulnerable sink
- else Not affected / Not exploitable with justification
Implementation notes
-
Start with a pragmatic approach:
- package‑level reachability: “is any symbol in that package reachable?”
- then iterate toward function‑level surfaces
-
VEX output:
- include commit hash, policy version, evidence paths
- embed links to internal “path card” URLs if available
Acceptance criteria
-
For a known vulnerable dependency, the system emits:
- VEX “not affected” if package code is never reached from any entrypoint, with proof references.
Workstream G — CI integration + developer UX
Deliverables
-
A single CLI:
smartdiff analyze --commit <sha> --service <svc> --lang <dotnet|go>smartdiff diff --base <shaA> --head <shaB> --out sarif
-
CI templates for:
- GitHub Actions / GitLab CI
-
Outputs:
- SARIF
- JSON evidence bundle
- optional OpenVEX file
Acceptance criteria
-
Teams can enable Smart‑Diff by adding:
- CI job + config file
- no additional infra required for MVP (local artifacts mode)
-
When infra is available, enable server caching mode for speed.
Workstream H — UI “Path Cards”
Deliverables
-
UI components:
-
Path card list with filters (sink type, severity, confidence)
-
“What changed” diff view:
- red = added hops
- green = removed hops
-
“Evidence” panel:
- file:line for each hop
- code snippets (optional)
-
-
APIs:
GET /smartdiff/{repo}/{pr}/findingsGET /smartdiff/{repo}/{commit}/path/{path_id}
Acceptance criteria
-
A developer can click one finding and understand:
- how the data got there
- exactly what line introduced the risk
- how to fix (sanitize/guard/allowlist)
4) Milestone plan (sequenced, no time promises)
Milestone 0 — Foundation
-
Repo scaffolding:
smartdiff-cli/analyzers/dotnet/analyzers/go/core-ir/(schemas + validation)server/(optional; can come later)
-
Define IR JSON schema + versioning rules
-
Implement policy YAML + validator + sample policies
-
Implement “local mode” artifact output
Exit criteria
- You can run
smartdiff analyzeand get a valid IR file for at least one trivial repo.
Milestone 1 — Callgraph reachability MVP
- .NET: build call edges + entrypoint discovery (basic)
- Go: build call edges + entrypoint discovery (basic)
- Graph store: in-memory or local sqlite/postgres
- Compute reachable sinks (callgraph only, no taint)
Exit criteria
-
On a demo repo, you can list:
- entrypoints
- reachable sinks (callgraph reachability only)
- a proof path (hop list)
Milestone 2 — Smart‑Diff MVP (PR gating)
-
Compute diff between base/head reachable sink sets
-
Produce SARIF with:
- rule id = sink type
- message includes entrypoint + sink + link to evidence JSON
-
CI templates + documentation
Exit criteria
- In PR checks, the job fails on new EP→sink paths and links to a proof.
Milestone 3 — Taint/dataflow MVP (high-value sinks only)
-
Add taint propagation to reduce false positives:
- differentiate “sink reachable” vs “untrusted data reaches sink”
-
Add sanitizer recognition
-
Add confidence scoring + suppression mechanisms (policy allowlists)
Exit criteria
- A sink is only “high severity” if it is both reachable and tainted (or policy says otherwise).
Milestone 4 — VEX integration MVP
- Join reachability with dependency vulnerabilities
- Emit OpenVEX (and/or CycloneDX VEX)
- Store evidence references (paths) inside VEX justification
Exit criteria
-
For a repo with a vulnerable dependency, you can automatically produce:
- affected/not affected with evidence.
Milestone 5 — Scale and precision improvements
-
Incremental analysis (only analyze changed projects/packages)
-
Better dynamic dispatch handling (Go pointer analysis, .NET interface dispatch expansion)
-
Optional runtime telemetry integration:
- import production traces to prioritize “actually observed” entrypoints
Exit criteria
- Works on large services with acceptable run time and stable noise levels.
5) Backlog you can paste into Jira (epics + key stories)
Epic: Policy & taxonomy
- Story: Define
smartdiff.policy.yamlschema and validator AC: invalid configs fail with clear errors; configs are versioned. - Story: Provide default sink list and severities AC: at least 10 sink rules with test cases.
Epic: .NET analyzer
- Story: Resolve method invocations to symbols (Roslyn) AC: correct targets for direct calls; conservative handling for virtual calls.
- Story: Discover ASP.NET routes and bind to entrypoint symbols AC: entrypoints include route/method metadata.
Epic: Go analyzer
- Story: SSA build and callgraph extraction AC: function nodes and edges generated for a multi-package repo.
- Story: net/http entrypoint discovery AC: handler functions recognized as entrypoints with path labels.
Epic: Reachability engine
- Story: Compute reachable sinks per entrypoint AC: store at least one path with hop list.
- Story: Smart‑Diff A vs B AC: added/removed paths computed deterministically.
Epic: CI/SARIF
- Story: Emit SARIF results AC: findings appear in code scanning UI; include file/line.
Epic: Taint analysis
- Story: Propagate taint from request to sink for 3 sink classes AC: produces “tainted” evidence with a variable/argument trace.
- Story: Sanitizer recognition AC: path marked “sanitized” and downgraded per policy.
Epic: VEX
- Story: Generate OpenVEX statements from reachability + vuln feed AC: for “not affected” includes justification and evidence references.
6) Key engineering decisions (recommended defaults)
Storage
-
Start with Postgres (or even local sqlite for MVP) for simplicity.
-
Introduce a graph DB only if:
- you need very large multi-commit graph queries at low latency
- Postgres performance becomes a hard blocker
Confidence model
Every edge/path should carry:
confidence: High/Med/Lowreasons: e.g.,DirectCall,InterfaceDispatch,ReflectionGuess,RouterHeuristicThis lets you:- gate only on high-confidence paths in early rollout
- keep low-confidence as “informational”
Suppression model
-
Local suppressions:
smartdiff.suppress.yamlwith rule id + symbol id + reason + expiry
-
Policy allowlists:
- allow file writes only under certain directories
- allow outbound network only to configured domains
7) Testing strategy (to avoid “cool demo, unusable tool”)
Unit tests
-
Symbol hashing stability tests
-
Call resolution tests:
- overloads, generics, interfaces, lambdas
-
Policy parsing/validation tests
Integration tests (must-have)
-
Golden repos in
testdata/:- one ASP.NET minimal API
- one MVC controller app
- one Go net/http + one Gin app
-
Golden outputs:
- expected entrypoints
- expected reachable sinks
- expected diff between commits
Regression tests
-
A curated corpus of “known issues”:
- false positives you fixed should never return
- false negatives: ensure known risky path is always found
Performance tests
-
Measure:
- analysis time per 50k LOC
- memory peak
- graph size
-
Budget enforcement:
- if over budget, degrade gracefully (lower precision, mark low confidence)
8) Example configs and outputs (to make onboarding easy)
Example policy YAML (minimal)
version: 1
service: invoices-api
entrypoints:
autodiscover:
dotnet:
aspnet: true
go:
net_http: true
sinks:
- type: CMD_EXEC
severity: high
match:
dotnet:
symbols:
- "System.Diagnostics.Process.Start(string)"
go:
symbols:
- "os/exec.Command"
- type: FILE_WRITE
severity: medium
match:
dotnet:
namespaces: ["System.IO"]
go:
symbols: ["os.WriteFile"]
gating:
fail_on:
- sink_type: CMD_EXEC
when: "added && confidence >= medium"
- sink_type: FILE_WRITE
when: "added && tainted && confidence >= medium"
Evidence JSON shape (what the UI consumes)
{
"commit": "abc123",
"entrypoint": {"symbol": "InvoicesController.Upload()", "route": "POST /upload"},
"sink": {"type": "FILE_WRITE", "symbol": "System.IO.File.WriteAllBytes"},
"confidence": "high",
"tainted": true,
"path": [
{"symbol": "InvoicesController.Upload()", "file": "Controllers/InvoicesController.cs", "line": 42},
{"symbol": "UploadService.Save()", "file": "Services/UploadService.cs", "line": 18},
{"symbol": "System.IO.File.WriteAllBytes", "file": null, "line": null}
]
}
9) Risks and mitigations (explicit)
-
Dynamic behavior (reflection, DI, router magic)
- Mitigation: conservative fallbacks + confidence labels + optional runtime traces later
-
Noise from huge callgraphs
- Mitigation: sink-first slicing (compute reachability backwards from sinks), entrypoint scoping, k‑shortest paths only
-
Large repo build failures
- Mitigation: analyzer runs inside build containers; allow partial analysis with explicit “incomplete” result flag
-
Teams rejecting gating
-
Mitigation: staged rollout:
- Observe-only mode → warn-only → fail-only for high-confidence CMD_EXEC/UNSAFE_DESER
-
10) Definition of done (what “implemented” means)
You should consider Smart‑Diff “implemented” when:
-
A repo can enable it with one config + one CI job.
-
PRs get:
- a small number of actionable findings (not hundreds)
- each finding has a proof path with file/line hops
-
It reliably detects at least:
- new command execution paths
- new unsafe deserialization paths
- new tainted filesystem write paths
-
It can optionally emit VEX decisions backed by reachability evidence.
If you want the next step, I can also give you:
- a concrete repo layout with module boundaries,
- the Postgres schema (tables + indexes),
- and a language adapter interface (so adding Java/Python later is straightforward).