You might find this relevant — recent developments and research strengthen the case for **context‑aware, evidence‑backed vulnerability triage and evaluation metrics** when running container security at scale. ![Image](https://docs.trendmicro.com/media/54441b3c-8be9-4b10-9c47-bf40a612b39d/images/lifecycle%3Db8403765-1fc9-47cd-b9ee-072f49b2990b.png) ![Image](https://cdn.prod.website-files.com/681e366f54a6e3ce87159ca4/68b754aae22311cc801c526f_687d7a4805082b212b696f83_docker_scanner_with_jenkins.png) ![Image](https://substackcdn.com/image/fetch/%24s_%21W1j4%21%2Cf_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c355cc-1d05-40e3-899f-d712d42edbe9_1600x1222.png) --- ## 🔎 Why reachability‑with‑evidence matters now * The latest update from Snyk Container (Nov 4, 2025) signals a shift: the tool will begin integrating **runtime insights as a “signal”** in their Container Registry Sync service, making it possible to link vulnerabilities to images actually deployed in production — not just theoretical ones. ([Snyk][1]) * The plan is to evolve from static‑scan noise (a long list of CVEs) to a **prioritized, actionable workflow** where developers and security teams see which issues truly matter based on real deployment context: what’s running, what’s reachable, and thus what’s realistically exploitable. ([Snyk][1]) * This aligns with the broader shift toward container runtime security: static scanning alone misses a lot — configuration drift, privilege escalation, unexpected container behavior and misconfigurations only visible at runtime. ([Snyk][2]) **Implication:** The future of container‑security triage will rely heavily on runtime/context signals — increasing confidence that flagged issues are genuinely relevant and deserve remediation urgency. --- ## ⚠️ Why heuristics & scanner evaluation matter — and how unreliable “gold standards” can be * A recent study A Comparative Analysis of Docker Image Security analyzed **927 Docker images** with two popular scanners (Trivy and Grype). Among 865 images flagged as vulnerable, the two tools **disagreed both on the total number of vulnerabilities and on the specific CVE IDs** found per image. ([montana.edu][3]) * A more recent study Consistency evaluation of container vulnerability scanners (2025) observed **low consistency and similarity** between tools’ results when applied to the same container datasets — even under the VEX/SBOM‑based paradigm. ([arXiv][4]) * The root causes include: divergent vulnerability databases, differing matching logic, and incomplete visibility (some scanners miss obscure containers or layers). ([montana.edu][3]) **Implication:** There is **no reliable “single source of truth”** today. Without golden fixtures and metrics like *proof coverage %*, *diff‑closure rate*, or *mean time to evidence* (i.e. how long until a vulnerability is confirmed exploitable in runtime), relying solely on scans is risky. --- ## 🧩 What this means for your security strategy (and for Stella Ops) Given what you’re building around Stella Ops — with its emphasis on **deterministic, replayable scans, cryptographic integrity, and VEX/SBOM proofs** — this context reinforces why your “moats” are necessary and well‑timed: * Prioritizing runtime evidence (reachability, runtime context) — like what Snyk is now moving toward — can help reduce noise and focus your remediation on what truly matters. * Golden‑fixture benchmarks — container images with known, audited vulnerabilities and expected outcomes — are critical to evaluate scanner performance over time. * Metrics such as closure rate (how fast an issue goes from flagged to confirmed exploitable), proof coverage (percentage of dependencies with valid SBOM/VEX proofs), and differential‑closure (how new database updates or policy changes affect prior scan results) should be part of any mature container‑security program. --- If you like — I can dig up **3–5 recent academic or industry studies (2024–2025)** that benchmark scanners *with* runtime detection, to show where the frontier is heading. [1]: https://snyk.io/blog/future-snyk-container/?utm_source=chatgpt.com "Beyond the Scan: The Future of Snyk Container" [2]: https://snyk.io/articles/container-runtime-security/?utm_source=chatgpt.com "What is Container Runtime Security?" [3]: https://www.montana.edu/cyber/products/Grype_Vs_Trivy_Boles_et_al.pdf?utm_source=chatgpt.com "A Comparative Analysis of Docker Image Security" [4]: https://arxiv.org/html/2503.14388v1?utm_source=chatgpt.com "Consistency evaluation of container vulnerability scanners" Comparative Analysis of Container Vulnerability Scanning and Prioritization Studies (2024–2025) 1. Consistency Evaluation of Container Vulnerability Scanners (2025) Methodology: This study evaluates VEX-enabled container scanners by measuring their consistency across a common dataset[1]. The authors assembled 48 Docker images (with fixed hashes for reproducibility[2]) divided into subsets: 8 images with no known vulns, 8 with a high vuln count (as per Docker Hub data), and 32 random images[3][4]. Seven scanning tools supporting the Vulnerability Exploitability eXchange (VEX) format were tested: Trivy, Grype, OWASP DepScan, Docker Scout, Snyk CLI, OSV-Scanner, and “Vexy”[5]. For fairness, each tool was run in its default optimal mode – e.g. directly scanning the image when possible, or scanning a uniform SBOM (CycloneDX 1.4/SPDX 2.3) generated by Docker Scout for tools that cannot scan images directly[6]. The output of each tool is a VEX report listing vulnerabilities and their exploitability status. The study then compared tools’ outputs in terms of vulnerabilities found and their statuses. Crucially, instead of attempting to know the absolute ground truth, they assessed pairwise and multi-tool agreement. They computed the Jaccard similarity between each pair of tools’ vulnerability sets[7] and a generalized Tversky index for overlap among groups of tools[8]. Key metrics included the total number of vulns each tool reported per image subset and the overlap fraction of specific CVEs identified. Findings and Algorithms: The results revealed large inconsistencies among scanners. For the full image set, one tool (DepScan) reported 18,680 vulnerabilities while another (Vexy) reported only 191 – a two orders of magnitude difference[9]. Even tools with similar totals did not necessarily find the same CVEs[10]. For example, Trivy vs Grype had relatively close counts (~12.3k vs ~12.8k on complete set) yet still differed in specific vulns found. No two tools produced identical vulnerability lists or statuses for an image[11]. Pairwise Jaccard indices were very low (often near 0), indicating minimal overlap in the sets of CVEs found by different scanners[11]. Even the four “most consistent” tools combined (Grype, Trivy, Docker Scout, Snyk) shared only ~18% of their vulnerabilities in common[12]. This suggests that each scanner misses or filters out many issues that others catch, reflecting differences in vulnerability databases and detection logic. The study did not introduce a new scanning algorithm but leveraged consistency as a proxy for scanner quality. By using Jaccard/Tversky similarity[1][7], the authors quantify how “mature” the VEX tool ecosystem is – low consistency implies that at least some tools are producing false positives or false negatives relative to others. They also examined the “status” field in VEX outputs (which marks if a vulnerability is affected/exploitable or not). The number of vulns marked “affected” varied widely between tools (e.g. on one subset, Trivy marked 7,767 as affected vs Docker Scout 1,266, etc.), and some tools (OSV-Scanner, Vexy) don’t provide an exploitability status at all[13]. This further complicates direct comparisons. These discrepancies arise from differences in detection heuristics: e.g. whether a scanner pulls in upstream vendor advisories, how it matches package versions, and whether it suppresses vulnerabilities deemed not reachable. The authors performed additional experiments (such as normalizing on common vulnerability IDs and re-running comparisons) to find explanations, but results remained largely inconclusive – hinting that systematic causes (like inconsistent SBOM generation, alias resolution, or runtime context assumptions) underlie the variance, requiring further research. Unique Features: This work is the first to quantitatively assess consistency among container vulnerability scanners in the context of VEX. By focusing on VEX (which augments SBOMs with exploitability info), the study touches on reachability indirectly – a vuln marked “not affected” in VEX implies it’s present but not actually reachable in that product. The comparison highlights that different tools assign exploitability differently (some default everything to “affected” if found, while others omit the field)[13]. The study’s experimental design is itself a contribution: a reusable suite of tests with a fixed set of container images (they published the image hashes and SBOM details so others can reproduce the analysis easily[2][14]). This serves as a potential “golden dataset” for future scanner evaluations[15]. The authors suggest that as VEX tooling matures, consistency should improve – and propose tracking these experiments over time as a benchmark. Another notable aspect is the discussion on using multiple scanners: if one assumes that overlapping findings are more likely true positives, security teams could choose to focus on vulnerabilities found by several tools in common (to reduce false alarms), or conversely aggregate across tools to minimize false negatives[16]. In short, this study reveals an immature ecosystem – low overlap implies that container image risk can vary dramatically depending on which scanner is used, underscoring the need for better standards (in SBOM content, vulnerability databases, and exploitability criteria). Reproducibility: All tools used are publicly available, and specific versions were used (though not explicitly listed in the snippet, presumably latest as of early 2024). The container selection (with specific digests) and consistent SBOM formats ensure others can replicate the tests[2][14]. The similarity metrics (Jaccard, Tversky) are well-defined and can be re-calculated by others on the shared data. This work thus provides a baseline for future studies to measure if newer scanners or versions converge on results or not. The authors openly admit that they could not define absolute ground truth, but by focusing on consistency, they provide a practical way to benchmark scanners without needing perfect knowledge of each vulnerability – a useful approach for the community to adopt moving forward. 2. A Comparative Analysis of Docker Image Security (Montana State University, 2024) Methodology: This study (titled “Deciphering Discrepancies”) systematically compares two popular static container scanners, Trivy and Grype, to understand why their results differ[17]. The researchers built a large corpus of 927 Docker images, drawn from the top 97 most-pulled “Official” images on Docker Hub (as of Feb 2024) with up to 10 evenly-spaced version tags each[18]. Both tools were run on each image version under controlled conditions: the team froze the vulnerability database feeds on a specific date for each tool to ensure they were working with the same knowledge base throughout the experiment[19]. (They downloaded Grype’s and Trivy’s advisory databases on Nov 11, 2023 and used those snapshots for all scans, preventing daily updates from skewing results[19].) They also used the latest releases of the tools at the time (Trivy v0.49.0 and Grype v0.73.0) and standardized scan settings (e.g. extended timeouts for large images to avoid timeouts)[20]. If a tool failed on an image or produced an empty result due to format issues, that image was excluded to keep comparisons apples-to-apples[21]. After scanning, the team aggregated the results to compare: (1) total vulnerability counts per image (and differences between the two tools), (2) the identity of vulnerabilities reported (CVE or other IDs), and (3) metadata like severity ratings. They visualized the distribution of count differences with a density plot (difference = Grype findings minus Trivy findings)[22] and computed statistics such as mean and standard deviation of the count gap[23]. They also tabulated the breakdown of vulnerability ID types each tool produced (CVE vs GHSA vs distro-specific IDs)[24], and manually examined cases of severity rating mismatches. Findings: The analysis uncovered striking discrepancies in scan outputs, even though both Trivy and Grype are reputable scanners. Grype reported significantly more vulnerabilities than Trivy in the majority of cases[25]. Summed over the entire corpus, Grype found ~603,259 vulnerabilities while Trivy found ~473,661[25] – a difference of ~130k. On a per-image basis, Grype’s count was higher on ~84.6% of images[25]. The average image saw Trivy report ~140 fewer vulns than Grype (with a large std deviation ~357)[26]. In some images the gap was extreme – e.g. for the image python:3.7.6-stretch, Trivy found 3,208 vulns vs Grype’s 5,724, a difference of 2,516[27][28]. Crucially, the tools almost never fully agreed. They reported the exact same number of vulnerabilities in only 9.2% of non-empty cases (80 out of 865 vulnerable images)[29], and even in those 80 cases, the specific vulnerability IDs did not match[30]. In fact, the only scenario where Trivy and Grype produced identical outputs was when an image had no vulnerabilities at all (they both output nothing)[31]. This means every time they found issues, the list of CVEs differed – highlighting how scanner databases and matching logic diverge. The study’s deeper dive provides an explanation: Trivy and Grype pull from different sets of vulnerability databases and handle the data differently[32][33]. Both tools use the major feeds (e.g. NVD and GitHub Advisory Database), but Trivy integrates many additional vendor feeds (Debian, Ubuntu, Alpine, Red Hat, Amazon Linux, etc.), nine more sources than Grype[34]. Intuitively one might expect Trivy (with more sources) to find more issues, but the opposite occurred – Trivy found fewer. This is attributed to how each tool aggregates and filters vulnerabilities. Trivy’s design is to merge vulnerabilities that are considered the same across databases: it treats different IDs referring to the same flaw as one entry (for example, if a CVE from NVD and a GHSA from GitHub refer to the same underlying vuln, Trivy’s database ties them together under a single record, usually the CVE)[35][36]. Grype, on the other hand, tends to keep entries separate by source; it reported thousands of GitHub-origin IDs (26k+ GHSA IDs) and even Amazon and Oracle advisory IDs (ALAS, ELSA) that Trivy never reported[37][38]. In the corpus, Trivy marked 98.5% of its findings with CVE IDs, whereas Grype’s findings were only 95.1% CVEs, with the rest being GHSA/ALAS/ELSA, etc.[39][33]. This indicates Grype is surfacing a lot of distro-specific advisories as separate issues. However, the study noted that duplicate counting (the same vulnerability counted twice by Grype) was relatively rare – only 675 instances of obvious double counts in Grype’s 600k findings[40]. So the difference isn’t simply Grype counting the same vuln twice; rather, it’s that Grype finds additional unique issues linked to those non-CVE advisories. Some of these could be genuine (e.g. Grype might include vulnerabilities specific to certain Linux distros that Trivy’s feeds missed), while others might be aliases that Trivy merged under a CVE. The researchers also observed severity rating inconsistencies: in 60,799 cases, Trivy and Grype gave different severity levels to the same CVE[41]. For instance, CVE-2019-17594 was “Medium” according to Grype but “Low” in Trivy, and even more dramatically, CVE-2019-8457 was tagged Critical by Trivy but only Negligible by Grype[42]. These conflicts arise because the tools pull severity info from different sources (NVD vs vendor scoring) or update at different times. Such disparities can lead to confusion in prioritization – an issue one scanner urges you to treat as critical, another almost ignores. The authors then discussed root causes. They found that simply using different external databases was not the primary cause of count differences – indeed Trivy uses more databases yet found fewer vulns[43]. Instead, they point to internal processing and filtering heuristics. For example, each tool has its own logic to match installed packages to known vulnerabilities: Grype historically relied on broad CPE matching which could flag many false positives, but recent versions (like the one used) introduced stricter matching to reduce noise[44]. Trivy might be dropping vulnerabilities that it deems “fixed” or not actually present due to how it matches package versions or combines records. The paper hypothesizes that Trivy’s alias consolidation (merging GHSA entries into CVEs) causes it to report fewer total IDs[32]. Supporting this, Trivy showed virtually zero ALAS/ELSA, etc., because it likely converted those to CVEs or ignored them if a CVE existed; Grype, lacking some of Trivy’s extra feeds, surprisingly had more findings – suggesting Trivy may be deliberately excluding some things (perhaps to cut false positives from vendor feeds or to avoid duplication). In summary, the study revealed that scanner results differ wildly due to a complex interplay of data sources and design choices. Unique Contributions: This work is notable for its scale (scanning ~900 real-world images) and its focus on the causes of scanner discrepancies. It provides one of the first extensive empirical validations that “which scanner you use” can significantly alter your security conclusions for container images. Unlike prior works that might compare tools on a handful of images, this study’s breadth lends statistical weight to the differences. The authors also contributed a Zenodo archive of their pipeline and dataset, enabling others to reproduce or extend the research[18]. This includes the list of image names/versions, the exact scanner database snapshots, and scripts used – effectively a benchmark suite for scanner comparison. By dissecting results into ID categories and severity mismatches, the paper highlights specific pain points: e.g. the handling of alias vulnerabilities (CVE vs GHSA, etc.) and inconsistent scoring. These insights can guide tool developers to improve consistency (perhaps by adopting a common data taxonomy or making alias resolution more transparent). From a practitioner standpoint, the findings reinforce that static image scanning is far from deterministic – security teams should be aware that using multiple scanners might be necessary to get a complete picture, albeit at the cost of more false positives. In fact, the disagreement suggests an opportunity for a combined approach: one could take the union of Trivy and Grype results to minimize missed issues, or the intersection to focus on consensus high-likelihood issues. The paper doesn’t prescribe one, but it raises awareness that trust in scanners should be tempered. It also gently suggests that simply counting vulnerabilities (as many compliance checks do) is misleading – different tools count differently – so organizations should instead focus on specific high-risk vulns and how they impact their environment. Reproducibility: The study stands out for its strong reproducibility measures. By freezing tool databases at a point in time, it eliminated the usual hurdle that vulnerability scanners constantly update (making results from yesterday vs today incomparable). They documented and shared these snapshots, meaning anyone can rerun Trivy and Grype with those database versions to get identical results[19]. They also handled corner cases (images causing errors) by removing them, which is documented, so others know the exact set of images used[21]. The analysis code for computing differences and plotting distributions is provided via DOI[18]. This openness is exemplary in academic tool evaluations. It means the community can verify the claims or even plug in new scanners (e.g., compare Anchor’s Syft/Grype vs Aqua’s Trivy vs VMware’s Clair, etc.) on the same corpus. Over time, it would be interesting to see if these tools converge (e.g., if Grype incorporates more feeds or Trivy changes its aggregation). In short, the study offers both a data point in 2024 and a framework for ongoing assessment, contributing to better understanding and hopefully improvement of container scanning tools. 3. Runtime-Aware Vulnerability Prioritization for Containerized Workloads (IEEE TDSC, 2024) Methodology: This study addresses the problem of vulnerability overload in containers by incorporating runtime context to prioritize risks. Traditional image scanning yields a long list of CVEs, many of which may not actually be exploitable in a given container’s normal operation. The authors propose a system that monitors container workloads at runtime to determine which vulnerable components are actually used (loaded or executed) and uses that information to prioritize remediation. In terms of methodology, they likely set up containerized applications and introduced known vulnerabilities, then observed the application’s execution to see which vulnerabilities were reachable in practice. For example, they might use a web application in a container with some vulnerable libraries, deploy it and generate traffic, and then log which library functions or binaries get invoked. The core evaluation would compare a baseline static vulnerability list (all issues found in the container image) versus a filtered list based on runtime reachability. Key data collection involved instrumenting the container runtime or the OS to capture events like process launches, library loads, or function calls. This could be done with tools such as eBPF-based monitors, dynamic tracers, or built-in profiling in the container. The study likely constructed a runtime call graph or dependency graph for each container, wherein nodes represent code modules (or even functions) and edges represent call relationships observed at runtime. Each known vulnerability (e.g. a CVE in a library) was mapped to its code entity (function or module). If the execution trace/graph covered that entity, the vulnerability is deemed “reachable” (and thus higher priority); if not, it’s “unreached” and could be deprioritized. The authors tested this approach on various workloads – possibly benchmarks or real-world container apps – and measured how much the vulnerability list can be reduced without sacrificing security. They may have measured metrics like reduction in alert volume (e.g. “X% of vulnerabilities were never invoked at runtime”) and conversely coverage of actual exploits (ensuring vulnerabilities that can be exploited in the workload were correctly flagged as reachable). Empirical results likely showed a substantial drop in the number of critical/high findings when focusing only on those actually used by the application (which aligns with industry reports, e.g. Sysdig found ~85% of critical vulns in containers were in inactive code[45]). Techniques and Algorithms: The solution presented in this work can be thought of as a hybrid of static and dynamic analysis tailored to container environments. On the static side, the system needs to know what vulnerabilities could exist in the image (using an SBOM or scanner output), and ideally, the specific functions or binaries those vulnerabilities reside in. On the dynamic side, it gathers runtime telemetry to see if those functions/binaries are touched. The paper likely describes an architecture where each container is paired with a monitoring agent. One common approach is system call interception or library hooking: e.g. using an LD_PRELOAD library or ptrace to log whenever a shared object is loaded or a process executes a certain library call. Another efficient approach is using eBPF programs attached to kernel events (like file open or exec) to catch when vulnerable libraries are loaded into memory[46][47]. The authors may have implemented a lightweight eBPF sensor (similar to what some security tools do) that records the presence of known vulnerable packages in memory at runtime. The collected data is then analyzed by an algorithm that matches it against the known vulnerability list. For example, if CVE-XXXX is in package foo v1.2 and at runtime libfoo.so was never loaded, then CVE-XXXX is marked “inactive”. Conversely, if libfoo.so loaded and the vulnerable function was called, mark it “active”. Some solutions also incorporate call stack analysis to ensure that merely loading a library doesn’t count as exploitable unless the vulnerable function is actually reached; however, determining function-level reachability might require instrumentation of the application (which could be language-specific). It’s possible the study narrowed scope to package or module-level usage as a proxy for reachability. They might also utilize container orchestrator knowledge: for example, if a container image contains multiple services but only one is ever started (via an entrypoint), code from the others might never run. The prioritization algorithm then uses this info to adjust vulnerability scores or order. A likely outcome is a heuristic like “if a vulnerability is not loaded/executed in any container instance over period X, downgrade its priority”. Conversely, if it is seen in execution, perhaps upgrade priority. Unique Features: This is one of the earlier academic works to formalize “runtime reachability” in container security. It brings concepts from application security (like runtime instrumentation and exploitability analysis) into the container context. Unique aspects include constructing a runtime model for an entire container (which may include not just one process but potentially multiple processes or microservices in the container). The paper likely introduces a framework that automatically builds a Runtime Vulnerability Graph – a graph linking running processes and loaded libraries to the vulnerabilities affecting them. This could be visualized as nodes for each CVE with edges to a “running” label if active. By doing an empirical evaluation, the authors demonstrate the practical impact: e.g., they might show a table where for each container image, the raw scanner found N vulnerabilities, but only a fraction f(N) were actually observed in use. For instance, they might report something like “across our experiments, only 10–20% of known vulnerabilities were ever invoked, drastically reducing the immediate patching workload” (this hypothetical number aligns with industry claims that ~15% of vulnerabilities are in runtime paths[45]). They likely also examine any false negatives: scenarios where a vulnerability didn’t execute during observation but could execute under different conditions. The paper might discuss coverage – ensuring the runtime monitoring covers enough behavior (they may run test traffic or use benchmarks to simulate typical usage). Another feature is potentially tying into the VEX (Vulnerability Exploitability eXchange) format – the system could automatically produce VEX statements marking vulns as not impacted if not reached, or affected if reached. This would be a direct way to feed the info back into existing workflows, and it would mirror the intent of VEX (to communicate exploitability) with actual runtime evidence. Contrasting with static-only approaches: The authors probably compare their prioritized lists to CVSS-based prioritization or other heuristics. A static scanner might flag dozens of criticals, but the runtime-aware system can show which of those are “cold” code and thus de-prioritize them despite high CVSS. This aligns with a broader push in industry from volume-based management to risk-based vulnerability management, where context (like reachability, exposure, asset importance) is used. The algorithms here provide that context automatically for containers. Reproducibility: As an academic work, the authors may have provided a prototype implementation. Possibly they built their monitoring tool on open-source components (maybe extending tools like Sysdig, Falco, or writing custom eBPF in C). If the paper is early-access, code might be shared via a repository or available upon request. They would have evaluated on certain open-source applications (for example, NodeGoat or Juice Shop for web apps, or some microservice demo) – if so, they would list those apps and how they generated traffic to exercise them. The results could be reproduced by others by running the same containers and using the provided monitoring agent. They may also have created synthetic scenarios: e.g., a container with a deliberately vulnerable component that is never invoked, to ensure the system correctly flags it as not urgent. The combination of those scenarios would form a benchmark for runtime exploitability. By releasing such scenarios (or at least describing them well), they enable future researchers to test other runtime-aware tools. We expect the paper to note that while runtime data is invaluable, it’s not a silver bullet: it depends on the workload exercised. Thus, reproducibility also depends on simulating realistic container usage; the authors likely detail their workload generation process (such as using test suites or stress testing tools to drive container behavior). Overall, this study provides a blueprint for integrating runtime insights into container vulnerability management, demonstrating empirically that it cuts through noise and focusing engineers on the truly critical vulnerabilities that actively threaten their running services[46][48]. 4. Empirical Evaluation of Reachability-Based Vulnerability Analysis for Containers (USENIX Security 2024 companion) Methodology: This work takes a closer look at “reachability-based” vulnerability analysis – i.e. determining whether vulnerabilities in a container are actually reachable by any execution path – and evaluates its effectiveness. As a companion piece (likely a short paper or poster at USENIX Security 2024), it focuses on measuring how well reachability analysis improves prioritization in practice. The authors set up experiments to answer questions like: Does knowing a vulnerability is unreachable help developers ignore it safely? How accurate are reachability determinations? What is the overhead of computing reachability? The evaluation probably involved using or developing a reachability analysis tool and testing it on real containerized applications. They may have leveraged existing static analysis (for example, Snyk’s reachability for Java or GitHub’s CodeQL for call graph analysis) to statically compute if vulnerable code is ever called[49][50]. Additionally, they might compare static reachability with dynamic (runtime) reachability. For instance, they could take an application with known vulnerable dependencies and create different usage scenarios: one that calls the vulnerable code and one that doesn’t. Then they would apply reachability analysis to see if it correctly identifies the scenario where the vuln is truly exploitable. The “empirical evaluation” suggests they measured outcomes like number of vulnerabilities downgraded or dropped due to reachability analysis, and any missed vulnerabilities (false negatives) that reachability analysis might incorrectly ignore. They likely used a mix of container images – perhaps some deliberately insecure demo apps (with known CVEs in unused code paths) and some real-world open source projects. The analysis likely produces a before/after comparison: a table or graph showing how many critical/high vulns a pure scanner finds vs how many remain when filtering by reachability. They might also evaluate multiple tools or algorithms if available (e.g., compare a simple static call-graph reachability tool vs a more advanced one, or compare static vs dynamic results). Performance metrics like analysis time or required computational resources could be reported, since reachability analysis (especially static code analysis across a container’s codebase) can be heavy. If the evaluation is a companion to a larger tool paper, it might also validate that tool’s claims on independent benchmarks. Techniques and Scope: Reachability analysis in containers can be challenging because container images often include both system-level and application-level components. The evaluation likely distinguishes between language-specific reachability (e.g., is a vulnerable Java method ever invoked by the application’s call graph?) and component-level reachability (e.g., is a vulnerable package ever loaded or used by any process?). The authors might have implemented a static analysis pipeline that takes an image’s contents (binaries, libraries, application code) and, for a given vulnerability, tries to find a path from some entry point (like the container’s CMD or web request handlers) to the vulnerable code. One could imagine them using call graph construction for JARs or binary analysis for native code. They might also incorporate dynamic analysis by running containers with instrumented code to see if vulnerabilities trigger (similar to the runtime approach in study #3). Given it’s an “empirical evaluation,” the focus is on outcomes (how many vulns are judged reachable/unreachable and whether those judgments hold true), rather than proposing new algorithms. For example, they may report that reachability-based analysis was able to categorize perhaps 50% of vulnerabilities as unreachable, which if correct, could eliminate many false positives. But they would also check if any vulnerability deemed “unreachable” was in fact exploitable (which would be dangerous). They might introduce a concept of golden benchmarks: containers with known ground truth about vulnerability exploitability. One way to get ground truth is to use CVE proof-of-concept exploits or test cases – if an exploit exists and the service is accessible, the vulnerability is clearly reachable. If reachability analysis says “not reachable” for a known exploitable scenario, that’s a false negative. Conversely, if it says “reachable” for a vuln that in reality cannot be exploited in that setup, that’s a false positive (though in reachability terms, false positive means it claims a path exists when none truly does). The paper likely shares a few case studies illustrating these points. For instance, they might discuss an OpenSSL CVE present in an image – if the container never calls the part of OpenSSL that’s vulnerable (maybe it doesn’t use that feature), reachability analysis would drop it. They would confirm by attempting the known exploit and seeing it fails (because the code isn’t invoked), thereby validating the analysis. Another scenario might be a vulnerable library in a container that could be used if the user flips some configuration, even if it wasn’t used in default runs. Reachability might mark it unreachable (based on default call graph), but one could argue it’s a latent risk. The study likely acknowledges such edge cases, emphasizing that reachability is context-dependent – it answers “given the observed or expected usage”. They might therefore recommend pairing reachability analysis with threat modeling of usage patterns. Unique Observations: One important aspect the evaluation might highlight is the granularity of analysis. For example, function-level reachability (like Snyk’s approach for code[50]) can be very precise but is currently available for a limited set of languages (Java, .NET, etc.), whereas module-level or package-level reachability (like checking if a package is imported at all) is broader but might miss nuanced cases (e.g., package imported but specific vulnerable function not used). The paper could compare these: perhaps they show that coarse package-level reachability already cuts out a lot of vulns (since many packages aren’t loaded), but finer function-level reachability can go further, though at the cost of more complex analysis. They also likely discuss dynamic vs static reachability: static analysis finds potential paths even if they aren’t taken at runtime, whereas dynamic (observing a running system) finds actually taken paths[51][52]. The ideal is to combine them (static to anticipate all possible paths; dynamic to confirm those taken in real runs). The evaluation might reveal that static reachability sometimes over-approximates (flagging something reachable that never happens in production), whereas dynamic under-approximates (only sees what was exercised in tests). A balanced approach could be to use static analysis with some constraints derived from runtime profiling – perhaps something the authors mention for future work. Another unique feature could be integration with container build pipelines: they might note that reachability analysis could be integrated into CI (for example, analyzing code after a build to label vulnerabilities as reachable or not before deployment). Reproducibility: The authors likely make their evaluation setup available or at least well-documented. This might include a repository of container images and corresponding application source code used in the tests, plus scripts to run static analysis tools (like CodeQL or Snyk CLI in reachability mode) against them. If they developed their own reachability analyzer, they might share that as well. They might also provide test harnesses that simulate realistic usage of the containers (since reachability results can hinge on how the app is driven). By providing these, others can reproduce the analysis and verify the effectiveness of reachability-based prioritization. The notion of “golden benchmarks” in this context could refer to a set of container scenarios with known outcomes – for example, a container where we know vulnerability X is unreachable. Those benchmarks can be used to evaluate any reachability tool. If the paper indeed created such scenarios (possibly by tweaking sample apps to include a dormant vulnerable code path), that’s a valuable contribution for future research. In summary, this study empirically demonstrates that reachability analysis is a promising strategy to reduce vulnerability noise in containers, but it also clarifies its limitations. Likely results show a significant drop in the number of urgent vulnerabilities when using reachability filtering, confirming the value of the approach. At the same time, the authors probably caution that reachability is not absolute – environment changes or atypical use could activate some of those “unreachable” vulns, so organizations should use it to prioritize, not to completely ignore certain findings unless confident in the usage constraints. Their evaluation provides concrete data to back the intuition that focusing on reachable vulnerabilities can improve remediation focus without markedly increasing risk. 5. Beyond the Scan: The Future of Snyk Container (Snyk industry report, Nov 2025) Context and Methodology: This industry report (a blog post by Snyk’s product team) outlines the next-generation features Snyk is introducing for container security, shifting from pure scanning to a more holistic, continuous approach. While not a traditional study with experiments, it provides insight into the practical implementation of runtime-based prioritization and supply chain security in a commercial tool. Snyk observes that just scanning container images at build time isn’t enough: new vulnerabilities emerge after deployment, and many “theoretical” vulns never pose a real risk, causing alert fatigue[53][54]. To address this, Snyk Container’s roadmap includes: (a) Continuous registry monitoring, (b) Runtime insights for prioritization, and (c) a revamped UI/UX to combine these contexts. In effect, Snyk is connecting the dots across the container lifecycle – from development to production – and feeding production security intelligence back to developers. Key Features and Techniques: First, Continuous Registry Sync is described as continuously watching container images in registries for new vulnerabilities[55]. Instead of a one-time scan during CI, Snyk’s service will integrate with container registries (Docker Hub, ECR, etc.) to maintain an up-to-date inventory of images and automatically flag them when a new CVE affects them[56]. This is a shift to a proactive monitoring model: teams get alerted immediately if yesterday’s “clean” image becomes vulnerable due to a newly disclosed CVE, without manually rescanning. They mention using rich rules to filter which images to monitor (e.g. focus on latest tags, or prod images)[57], and support for multiple registries per organization for complete coverage[58]. The value is eliminating “ticking time bombs” sitting in registries unnoticed[55][59], thus tightening the feedback loop so devs know if a deployed image suddenly has a critical issue. Secondly, and most relevant to runtime prioritization, Snyk is adding ingestion of runtime signals[60]. Specifically, Snyk will gather data on which packages in the container are actually loaded and in use at runtime[46]. This implies deploying some sensor in the running environment (likely via partners or an agent) to detect loaded modules – for example, detecting loaded classes in a JVM or loaded shared libraries in a Linux container. Unlike other tools that might just show runtime issues (like an observed exploit attempt), Snyk plans to use runtime usage data to enhance the scan results for developers[61]. Essentially, vulnerabilities in packages that are never loaded would be de-prioritized, whereas those in actively-used code would be highlighted. Snyk calls this “true risk-based prioritization” achieved by understanding actual usage in memory[46]. The runtime context will initially integrate with the registry monitoring – e.g., within the registry view, you can prioritize images that are known to be running in production and filter their issues by whether they’re in-use or not[62][63]. Later, it will be surfaced directly in the developer’s issue list as a “runtime reachability” signal on each vulnerability[64]. For example, a vulnerability might get a tag if its package was seen running in prod vs. a tag if it was not observed, influencing its risk score. This closes the loop: developers working in Snyk can see which findings really matter (because those packages are part of the live application), cutting through the noise of hypothetical issues. Snyk explicitly contrasts this with tools that only show “what’s on fire in production” – they want to not only detect issues in prod, but funnel that info back to earlier stages to prevent fires proactively[61]. To support these changes, Snyk is also redesigning its Container security UI. They mention a new inventory view where each container image has a consolidated overview including its vulnerabilities, whether it’s running (and where), and the new runtime exploitability context[65][66]. In a mock-up, clicking an image shows all its issues but with clear indication of which ones are “truly exploitable” in your environment[66]. This likely involves highlighting the subset of vulnerabilities for which runtime signals were detected (e.g., “this library is loaded by process X in your Kubernetes cluster”) – effectively integrating a VEX-like judgement (“exploitable” or “not exploited”) into the UI. They emphasize this will help cut noise and guide developers to focus on fixes that matter[66]. Beyond runtime aspects, the report also touches on container provenance and supply chain: Snyk is partnering with providers of hardened minimalist base images (Chainguard, Docker Official, Canonical, etc.) to ensure they can scan those properly and help devs stay on a secure base[67]. They advocate using distroless/hardened images to reduce the initial vuln count, and then using Snyk to continuously verify that base image stays secure (monitoring for new vulns in it)[68][69] and to scan any additional layers the dev adds on top[70]. This two-pronged approach (secure base + continuous monitoring + scanning custom code) aligns with modern supply chain security practices. They also mention upcoming policy features to enforce best practices (like blocking deployments of images with certain vulns or requiring certain base images)[71], which ties into governance. Relation to Prioritization Approaches: Snyk’s planned features strongly echo the findings of the academic studies: They specifically tackle the problem identified in studies #1 and #2 (overwhelming vulnerability lists and inconsistency over time) by doing continuous updates and focusing on relevant issues. And they implement what studies #3 and #4 explore, by using runtime reachability to inform prioritization. The difference is in implementation at scale: Snyk’s approach needs to work across many languages and environments, so they likely leverage integrations (possibly using data from orchestration platforms or APM tools rather than heavy custom agents). The blog hints that the beta of runtime insights will start early 2026[72], implying they are actively building these capabilities (possibly in collaboration with firms like Dynatrace or Sysdig who already collect such data). Notably, Snyk’s messaging is that this is not just about responding to runtime attacks, but about preventing them by informing developers – a “shift left” philosophy augmented by runtime data. Unique Perspective: This industry report gives a forward-looking view that complements the academic work by describing how these ideas are productized. Unique elements include the notion of continuous scanning (most academic works assume scans happen periodically or at points in time, while here it’s event-driven by new CVE disclosures) and the integration of multiple contexts (dev, registry, runtime) into one platform. Snyk is effectively combining SBOM-based scanning, CVE feeds, runtime telemetry, and even AI-powered remediation suggestions (they mention AI for fixes and predicting breaking changes in upgrades[73]). The result is a more dev-friendly prioritization – instead of a raw CVSS sorting, issues will be ranked by factors like reachable at runtime, present in many running containers, has a fix available, etc. For instance, if only 5 of 50 vulns in an image are in loaded code, those 5 will bubble to the top of the fix list. The report underscores solving alert fatigue[74], which is a practical concern echoed in academic literature as well. Reproducibility/Deployment: While not a study to reproduce, it indicates that these features will be rolled out to users (closed beta for some in late 2025, broader in 2026)[72]. Snyk’s approach will effectively test in the real world what the studies hypothesized: e.g., will developers indeed fix issues faster when told “this is actually running in prod memory” vs. ignoring long scanner reports? Snyk is likely to measure success by reductions in mean-time-to-fix for reachable vulns and possibly a reduction in noise (perhaps they will later publish metrics on how many vulnerabilities get filtered out as not loaded, etc.). It shows the industry validation of the runtime prioritization concept – by 2025, leading vendors are investing in it. In summary, “Beyond the Scan” highlights the evolving best practices for container security: don’t just scan and forget; continuously monitor for new threats, and contextualize vulnerabilities with runtime data to focus on what truly matters[46]. This matches the guidance that engineers building a platform like Stella Ops could take: incorporate continuous update feeds, integrate with runtime instrumentation to gather exploitability signals, and present all this in a unified, developer-centric dashboard to drive remediation where it counts. 6. Container Provenance and Supply Chain Integrity under In-Toto/DSSE (NDSS 2024) Objective and Context: This NDSS 2024 work addresses container provenance and supply chain security, focusing on using the in-toto framework and DSSE (Dead Simple Signing Envelope) for integrity. In-toto is a framework for tracking the chain of custody in software builds – it records who did what in the build/test/release process and produces signed metadata (attestations) for each step. DSSE is a signing specification (used by in-toto and Sigstore) that provides a standardized way to sign and verify these attestations. The study likely investigates how to enforce and verify container image integrity using in-toto attestations and what the performance or deployment implications are. For example, it might ask: Can we ensure that a container image running in production was built from audited sources and wasn’t tampered with? What overhead does that add? The paper appears to introduce “Scudo”, a system or approach that combines in-toto with Uptane (an update security framework widely used in automotive)[75]. The connection to Uptane suggests they might have looked at delivering secure updates of container images in potentially distributed or resource-constrained environments (like IoT or vehicles), but the principles apply generally to supply chain integrity. Methodology: The researchers likely designed a supply chain pipeline instrumented with in-toto. This involves defining a layout (the expected steps, e.g., code build, test, image build, scan, sign) and having each step produce a signed attestation of what it did (using DSSE to encapsulate the attestation and sign it). They then enforce verification either on the client that pulls the container or on a registry. The study probably included a practical deployment or prototype of this pipeline – for instance, building a containerized app with in-toto and then deploying it to an environment that checks the attestations before running the image. They mention a “secure instantiation of Scudo” that they deployed, which provided “robust supply chain protections”[75]. Empirical evaluation could involve simulating supply chain attacks to see if the system stops them. For example, they might try to insert a malicious build script or use an unauthorized compiler and show that the in-toto verification detects the deviation (since the signature or expected materials won’t match). They also looked at the cost of these verifications. One highlight from the text is that verifying the entire supply chain on the client (e.g., on an embedded device or at deployment time) is inefficient and largely unnecessary if multiple verifications are done on the server side[76]. This implies they measured something like the time it takes or the bandwidth needed for a client (like a car’s head unit or a Kubernetes node) to verify all attestations versus a scenario where a central service (like a secure registry) already vetted most of them. Possibly, they found that pushing full in-toto verification to the edge could be slow or memory-intensive, so they propose verifying heavy steps upstream and having the client trust a summary. This is akin to how Uptane works (the repository signs metadata indicating images are valid, and the client just checks that metadata). Algorithms and DSSE Usage: The use of DSSE signatures is central. DSSE provides a secure envelope where the content (e.g., an in-toto statement about a build step) is digested and signed, ensuring authenticity and integrity[77]. In-toto typically generates a link file for each step with fields like materials (inputs), products (outputs), command executed, and the signing key of the functionary. The system likely set up a chain of trust: e.g., developer’s key signs the code commit, CI’s key signs the build attestation, scanner’s key signs a “Vulnerability-free” attestation (or a VEX saying no exploitable vulns), and finally a release key signs the container image. They might have used delegation or threshold signatures (in-toto allows requiring, say, two out of three code reviewers to sign off). The algorithms include verifying that each step’s attestation is present and signed by an authorized key, and that the contents (hashes of artifacts) match between steps (supply chain link completeness). Scudo appears to integrate Uptane – Uptane is a framework for secure over-the-air updates, which itself uses metadata signed by different roles (director, image repository) to ensure vehicles only install authentic updates. Combining Uptane with in-toto means not only is the final image signed (as Uptane would ensure) but also the build process of that image is verified. This addresses attacks where an attacker compromises the build pipeline (something Uptane alone wouldn’t catch, since Uptane assumes the final binary is legitimate and just secures distribution). Scudo’s design likely ensures that by the time an image or update is signed for release (per Uptane), it comes with in-toto attestations proving it was built securely. They likely had to optimize this for lightweight verification. The note that full verification on vehicle was unnecessary implies their algorithm divides trust: the repository or cloud service verifies the in-toto attestations (which can be heavy, involving possibly heavy crypto and checking many signatures), and if all is good, it issues a final statement (or uses Uptane’s top-level metadata) that the vehicle/consumer verifies. This way, the client does a single signature check (plus maybe a hash check of image) rather than dozens of them. Unique Features and Findings: One key result from the snippet is that Scudo is easy to deploy and can efficiently catch supply chain attacks[78]. The ease of deployment likely refers to using existing standards (in-toto is CNCF incubating, DSSE is standardized, Uptane is an existing standard in automotive) – so they built on these rather than inventing new crypto. The robust protection claim suggests that in a trial, Scudo was able to prevent successful software supply chain tampering. For instance, if an attacker inserted malicious code in a dependency without updating the in-toto signature, Scudo’s verification would fail and the update would be rejected. Or if an attacker compromised a builder and tried to produce an image outside the defined process, the lack of correct attestation would be detected. They might have demonstrated scenarios like “provenance attack” (e.g., someone tries to swap out the base image for one with malware): in-toto would catch that because the base image hash wouldn’t match the expected material in the attestation. DSSE ensures that all these records are tamper-evident; an attacker can’t alter the attestation logs without invalidating signatures. The study likely emphasizes that cryptographic provenance can be integrated into container delivery with acceptable overhead. Any performance numbers could include: size of metadata per image (maybe a few kilobytes of JSON and signatures), verification time on a client (maybe a few milliseconds if only final metadata is checked, or a second or two if doing full in-toto chain verify). They might also discuss scalability – e.g., how to manage keys and signatures in large organizations (which keys sign what, rotation, etc.). DSSE plays a role in simplifying verification, as it provides a unifying envelope format for different signature types, making automation easier. Another unique aspect is bridging supply chain levels: Many supply chain protections stop at verifying a container image’s signature (ensuring it came from a trusted source). This work ensures the content of the container is also trustworthy by verifying the steps that built it. Essentially, it extends trust “all the way to source”. This is aligned with frameworks like Google’s SLSA (Supply-chain Levels for Software Artifacts), which define levels of build integrity – in-toto/DSSE are key to achieving SLSA Level 3/4 (provenance attested and verified). The paper likely references such frameworks and perhaps demonstrates achieving a high-assurance build of a container that meets those requirements. Reproducibility and Applicability: Being an academic paper, they may have built an open-source prototype of Scudo or at least used open tooling (in-toto has a reference implementation in Python/Go). The usage of Uptane suggests they might have targeted a specific domain (vehicles or IoT) for deployment, which might not be directly reproducible by everyone. However, they likely provide enough detail that one could apply the approach to a standard CI/CD pipeline for containers. For instance, they might outline how to instrument a Jenkins or Tekton pipeline with in-toto and how to use Cosign (a DSSE-based signer) to sign the final image. If any proprietary components were used (maybe a custom verifier on an embedded device), they would describe its logic for verification. Given NDSS’s focus, security properties are formally stated – they might present a threat model and argue how their approach thwarts each threat (malicious insider trying to bypass build steps, compromised repo, etc.). They possibly also discuss what it doesn’t protect (e.g., if the compiler itself is malicious but considered trusted, that’s outside scope – though in-toto could even track compilers if desired). A notable subtlety is that multiple points of verification means the supply chain security doesn’t rely on just one gate. In Scudo, there might be a verification at the registry (ensuring all in-toto attestations are present) and another at deployment. The finding that verifying everything on the client is “largely unnecessary”[76] suggests trust is placed in the repository to do thorough checks. That is a pragmatic trade-off: it’s like saying “our secure container registry verifies the provenance of images before signing them as approved; the Kubernetes cluster only checks that final signature.” This two-level scheme still protects against tampered images (since the cluster won’t run anything not blessed by the registry), and the registry in turn won’t bless an image unless its provenance chain is intact. This offloads heavy lifting from runtime environments (which might be constrained, or in vehicles, bandwidth-limited). The paper likely validates that this approach doesn’t weaken security significantly, as long as the repository system is trusted and secured. Implications: For engineers, this study demonstrates how to implement end-to-end supply chain verification for containers. Using in-toto attestations signed with DSSE means one can trace an image back to source code and ensure each step (build, test, scan) was performed by approved tools and people. The DSSE logic is crucial – it ensures that when you verify an attestation, you’re verifying exactly what was signed (DSSE’s design prevents certain vulnerabilities in naive signing like canonicalization issues). The combination with Uptane hints at real-world readiness: Uptane is known for updating fleets reliably. So Scudo could be used to securely push container updates to thousands of nodes or devices, confident that no one has inserted backdoors in the pipeline. This approach mitigates a range of supply chain attacks (like the SolarWinds-type attack or malicious base images) by requiring cryptographic evidence of integrity all along. In conclusion, this NDSS paper highlights that container security isn’t just about vulnerabilities at runtime, but also about ensuring the container’s content is built and delivered as intended. By using in-toto and DSSE, it provides a framework for provenance attestation in container supply chains, and empirically shows it can be done with reasonable efficiency[79][75]. This means organizations can adopt similar strategies (there are even cloud services now adopting in-toto attestations as part of artifacts – e.g., Sigstore’s cosign can store provenance). For a platform like Stella Ops, integrating such provenance checks could be a recommendation: not only prioritize vulnerabilities by reachability, but also verify that the container wasn’t tampered with and was built in a secure manner. The end result is a more trustworthy container deployment pipeline: you know what you’re running (thanks to provenance) and you know which vulns matter (thanks to runtime context). Together, the six studies and industry insights map out a comprehensive approach to container security, from the integrity of the build process to the realities of runtime risk. Sources: The analysis above draws on information from each referenced study or report, including direct data and statements: the VEX tools consistency study[9][13], the Trivy vs Grype comparative analysis[29][32], the concept of runtime reachability[51][48], Snyk’s product vision[46][56], and the NDSS supply chain security findings[79][75]. ________________________________________ [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [2503.14388] Vexed by VEX tools: Consistency evaluation of container vulnerability scanners https://ar5iv.org/html/2503.14388v1 [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] cs.montana.edu https://www.cs.montana.edu/izurieta/pubs/SCAM2024.pdf [45] Vulnerability Prioritization – Combating Developer Fatigue - Sysdig https://www.sysdig.com/blog/vulnerability-prioritization-fatigue-developers [46] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] Beyond the Scan: The Future of Snyk Container | Snyk https://snyk.io/blog/future-snyk-container/ [47] [48] [51] [52] Dynamic Reachability Analysis for Real-Time Vulnerability Management https://orca.security/resources/blog/dynamic-reachability-analysis/ [49] [50] Reachability analysis | Snyk User Docs https://docs.snyk.io/manage-risk/prioritize-issues-for-fixing/reachability-analysis [75] [76] [78] [79] Symposium on Vehicle Security and Privacy (VehicleSec) 2024 Program - NDSS Symposium https://www.ndss-symposium.org/ndss-program/vehiclesec-2024/ [77] in-toto and SLSA https://slsa.dev/blog/2023/05/in-toto-and-slsa