26 KiB
Here’s a crisp plan to publish a small, public “vulnerable binaries” dataset (PHP, JS, C#) and a way to compare reachability results across tools—so you can ship something useful fast, gather feedback, and iterate.
Scope (MVP)
-
Languages: PHP (composer), JavaScript (npm), C# (.NET).
-
Artifacts per sample:
- minimal app, 2) lockfile, 3) SBOM (CycloneDX JSON), 4) VEX (OSV/CycloneDX VEX), 5) ground‑truth reachability notes, 6) scriptable repro (Docker).
-
Size: 3–5 samples per language (9–15 total). Keep each sample ≤200 LOC.
Repo layout
vuln-reach-dataset/
LICENSE
README.md
schema/
ground-truth.schema.json
run-matrix.schema.json
runners/
run_all.sh
run_all.ps1
results/
<tool>/<lang>/<sample>/run.json
samples/
php/...
js/...
csharp/...
“Ground truth” format (minimal)
{
"sample_id": "php-001-phar-deserialize",
"lang": "php",
"package_manager": "composer",
"vuln_ids": ["CVE-2019-XXXX","OSV:GHSA-..."],
"entrypoints": ["public/index.php"],
"reachable_symbols": [
{"purl":"pkg:composer/vendor/package@1.2.3","symbol":"Vendor\\Unsafe::unserialize"},
{"purl":"pkg:composer/monolog/monolog@2.9.0","symbol":"Monolog\\Logger::pushHandler","note":"benign"}
],
"evidence": [
{"type":"path","file":"public/index.php","line":18,"desc":"tainted input -> unserialize"},
{"type":"exec","cmd":"curl 'http://localhost/?p=O:...'", "result":"triggered sink"}
]
}
Samples to include (suggested)
PHP (composer)
-
php-001-phar-deserialize
- Risk: unsafe
unserialize()on user input; optional PHAR gadget. - Ground truth: reachable sink
unserialize.
- Risk: unsafe
-
php-002-xxe-simplexml
- Risk: XML external entity in
simplexml_load_stringwith libxml options off. - Ground truth: reachable XXE sink.
- Risk: XML external entity in
-
php-003-ssrf-guzzle
- Risk: user‑controlled URL into Guzzle client.
- Ground truth: SSRF call chain to
Client::request.
JavaScript (npm)
-
js-001-prototype-pollution
- Risk:
lodash.merge(known vulns historically) with user object. - Ground truth: polluted
{__proto__}path reaches object creation site.
- Risk:
-
js-002-yaml-unsafe-load
- Risk:
js-yamlloadon untrusted text. - Ground truth: call to
loadreachable from HTTP route.
- Risk:
-
js-003-ssrf-node-fetch
- Risk: user URL to
node-fetch. - Ground truth: request issued to attacker-controlled host.
- Risk: user URL to
C# (.NET)
-
cs-001-binaryformatter-deserialize
- Risk:
BinaryFormatter.Deserializeon user input (legacy). - Ground truth: reachable call to
Deserialize.
- Risk:
-
cs-002-processstartinfo-injection
- Risk:
Process.Startwith unsanitized arg (Windows/Linux). - Ground truth: taint to
Process.Start.
- Risk:
-
cs-003-xmlreader-xxe
- Risk: insecure
XmlReadersettings (DtdProcessing = Parse). - Ground truth: external entity resolved.
- Risk: insecure
Each sample should:
- Pin a known vulnerable version in lockfile.
- Provide a positive (reachable) and negative (not reachable) path.
- Include a tiny HTTP entrypoint to exercise the path.
SBOM & VEX per sample
- CycloneDX 1.6 JSON SBOM produced via native tool (composer, npm, dotnet) + converter.
- VEX: one document stating the vulnerability is affected and exploitable for the positive path; not_affected for the negative path with justification (e.g., “vulnerable code not invoked”).
Runner & result format (tool-agnostic)
- Runners call each selected tool (e.g., “ToolA”, “ToolB”), then normalize outputs to:
{
"tool": "ToolA",
"version": "x.y.z",
"sample_id": "js-002-yaml-unsafe-load",
"detected_vulns": ["OSV:GHSA-..."],
"reachable_symbols_reported": [
{"purl":"pkg:npm/js-yaml@4.1.0","symbol":"load"}
],
"verdict": {
"reachable": true,
"confidence": 0.92
},
"raw": "path/to/original/tool/output.json"
}
Comparison metrics
For each sample:
- TP (tool says reachable & ground truth reachable)
- FP (tool says reachable but ground truth not reachable)
- FN (tool says not reachable but ground truth reachable)
- TN (tool says not reachable & ground truth not reachable)
Aggregate per language & tool:
- Precision, recall, F1, and Reachability Accuracy = (TP+TN)/All.
- Optional: Path depth agreement (did tool cite the expected symbol/edge?).
- Optional: Time-to-result (seconds) and scan mode (static, dynamic, hybrid).
Minimal example (JS) — samples/js/js-002-yaml-unsafe-load
package.json
package-lock.json
server.js # express route POST /parse -> js-yaml load(body.text)
README.md
sbom.cdx.json
vex.cdx.json
repro.sh # npm ci; node server.js; curl -XPOST ...
GROUND_TRUTH.json
- Positive path: POST
{"text":"a: &a 1\nb: *a"}to exercise parser. - Negative path: guarded route that rejects user input unless whitelisted.
Publishing checklist
-
License: CC BY 4.0 (dataset) + MIT (runners).
-
Data hygiene: no real secrets; deterministic scripts; pinned versions.
-
Repro: one‑command
docker compose upper language. -
Docs:
- What is “reachability”? (vulnerable code is actually callable from app inputs).
- How we built ground truth (static review + runnable PoC).
- How to add a new sample (template folder + PR checklist).
Fast path to first release (1–2 days of focused work)
- Ship one sample per language with full ground truth + SBOM/VEX.
- Include one tool runner (even a no‑op placeholder) and the result schema.
- Add a results/README with the confusion‑matrix table filled for these 3 samples.
- Open issues inviting contributions: more samples, more tools, more sinks.
Why this helps
- Creates a neutral, reproducible yardstick for reachability.
- Lets vendors & researchers compare apples to apples.
- Encourages PRs (small, self‑contained samples) and early citations for Stella Ops.
If you want, I can generate the repo skeleton (folders, sample stubs, JSON schemas, and runner scripts) so you can push it directly to GitHub.
Here’s a “drop in” developer guide + concrete samples you can paste into your repo (or split into README.md / docs/DEVELOPER_GUIDE.md). I’ll show:
- How the project is structured
- Very detailed example samples for PHP, JS, C#
- How tool authors integrate their reachability tool
- How contributors add new samples
You can tweak names/IDs, but everything below is self‑consistent.
1. Repository structure (recap)
vuln-reach-dataset/
README.md
docs/
DEVELOPER_GUIDE.md # this file (or paste sections into README)
schema/
ground-truth.schema.json
run-matrix.schema.json
samples/
php/
php-001-phar-deserialize/
php-002-xxe-simplexml/
...
js/
js-002-yaml-unsafe-load/
...
csharp/
cs-001-binaryformatter-deserialize/
...
runners/
run_all.sh
run_all.ps1
run_with_tool_mytool.py # example tool integration
results/
mytool/
php/php-001-phar-deserialize/run.json
js/js-002-yaml-unsafe-load/run.json
...
Core idea:
Each samples/<lang>/<sample_id>/ folder is:
- A minimal runnable app containing a known vulnerability
- A positive path (vulnerable code reachable) and (ideally) a negative path (package present but not reachable)
GROUND_TRUTH.jsondescribing what is actually reachable- SBOM + VEX files describing vulnerabilities at the component level
- A
repro.shscript to run the app and trigger the bug
Tool authors plug in by reading each sample folder, running their scanner, and writing normalized results to results/<tool>/<lang>/<sample_id>/run.json.
2. Ground truth schema (what tools are judged against)
Minimal JSON format (you can store a full JSON Schema in schema/ground-truth.schema.json):
{
"sample_id": "php-001-phar-deserialize",
"lang": "php",
"package_manager": "composer",
"vuln_ids": [
"OSV:PLACEHOLDER-2019-XXXX"
],
"entrypoints": [
"public/index.php"
],
"reachable_symbols": [
{
"purl": "pkg:composer/example/vendor@1.2.3",
"symbol": "Example\\Unsafe::unserialize",
"kind": "sink",
"note": "User-controlled input can reach this sink in /?mode=unsafe&data=..."
},
{
"purl": "pkg:composer/example/vendor@1.2.3",
"symbol": "Example\\Unsafe::unserialize",
"kind": "sink",
"note": "NOT reached in /?mode=safe (negative path)."
}
],
"evidence": [
{
"type": "path",
"file": "public/index.php",
"line": 25,
"desc": "Tainted $_GET['data'] flows into Example\\Unsafe::unserialize"
},
{
"type": "exec",
"cmd": "curl 'http://localhost:8000/?mode=unsafe&data=...payload...'",
"result": "Trigger behavior / exploit / exception"
}
]
}
Fields are intentionally simple:
reachable_symbolsdescribes what is reachable and from which package/version.evidenceexplains why we marked it reachable (code path + repro command).
3. PHP sample (php-001-phar-deserialize)
3.1 Folder layout
samples/php/php-001-phar-deserialize/:
composer.json
composer.lock # pinned, checked-in
public/
index.php
src/
UnsafeDeser.php
sbom.cdx.json
vex.cdx.json
GROUND_TRUTH.json
repro.sh
Dockerfile # optional, but recommended
README.md # local sample README
3.2 composer.json
Pin a vulnerable (or pretend-vulnerable) version:
{
"name": "dataset/php-001-phar-deserialize",
"description": "Minimal PHP app demonstrating unsafe unserialize reachability.",
"require": {
"php": "^8.1",
"example/vendor": "1.2.3" // pretend vulnerable package
},
"autoload": {
"psr-4": {
"Dataset\\Php001\\": "src/"
}
}
}
3.3 src/UnsafeDeser.php
<?php
namespace Dataset\Php001;
class UnsafeDeser
{
/**
* Vulnerable sink: directly calls unserialize() on user-controlled data.
*/
public static function unsafeUnserialize(string $data): mixed
{
// This is the "vulnerable symbol" tools should flag as reachable.
return unserialize($data);
}
/**
* Example of a "benign" path: input is compared, not deserialized.
*/
public static function safeCompare(string $data): bool
{
return $data === 'ok';
}
}
3.4 public/index.php
<?php
declare(strict_types=1);
require __DIR__ . '/../vendor/autoload.php';
use Dataset\Php001\UnsafeDeser;
$mode = $_GET['mode'] ?? 'safe';
$data = $_GET['data'] ?? 's:2:"ok";';
if ($mode === 'unsafe') {
// POSITIVE PATH: user input -> vulnerable sink
$result = UnsafeDeser::unsafeUnserialize($data);
echo "UNSAFE RESULT:\n";
var_dump($result);
} else {
// NEGATIVE PATH: package is present but sink not invoked
$isOk = UnsafeDeser::safeCompare($data);
echo "SAFE RESULT:\n";
var_dump($isOk);
}
3.5 GROUND_TRUTH.json
{
"sample_id": "php-001-phar-deserialize",
"lang": "php",
"package_manager": "composer",
"vuln_ids": [
"OSV:PLACEHOLDER-2019-UNSERIALIZE"
],
"entrypoints": [
"public/index.php"
],
"reachable_symbols": [
{
"purl": "pkg:composer/example/vendor@1.2.3",
"symbol": "Dataset\\Php001\\UnsafeDeser::unsafeUnserialize",
"kind": "sink",
"note": "Reachable when mode=unsafe (positive path)."
}
],
"evidence": [
{
"type": "path",
"file": "public/index.php",
"line": 15,
"desc": "$_GET['data'] flows into UnsafeDeser::unsafeUnserialize without validation."
},
{
"type": "exec",
"cmd": "php -S 0.0.0.0:8000 -t public",
"result": "Dev server started at http://0.0.0.0:8000"
},
{
"type": "exec",
"cmd": "curl 'http://localhost:8000/?mode=unsafe&data=O:4:\"Test\":0:{}'",
"result": "Object of class Test created via unserialize()"
}
]
}
3.6 Minimal SBOM (sbom.cdx.json)
Very small CycloneDX 1.6 example (trim or enrich as needed):
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"version": 1,
"metadata": {
"component": {
"type": "application",
"name": "php-001-phar-deserialize"
}
},
"components": [
{
"type": "library",
"name": "example/vendor",
"version": "1.2.3",
"purl": "pkg:composer/example/vendor@1.2.3"
}
]
}
3.7 Minimal VEX (vex.cdx.json)
CycloneDX VEX example:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"version": 1,
"metadata": {
"component": {
"type": "application",
"name": "php-001-phar-deserialize"
}
},
"vulnerabilities": [
{
"id": "OSV:PLACEHOLDER-2019-UNSERIALIZE",
"source": {
"name": "OSV",
"url": "https://osv.dev/"
},
"affects": [
{
"ref": "pkg:composer/example/vendor@1.2.3"
}
],
"analysis": {
"state": "affected",
"justification": "exploitable",
"detail": "UnsafeDeser::unsafeUnserialize is reachable from HTTP query parameter 'data' when mode=unsafe."
}
}
]
}
3.8 repro.sh
#!/usr/bin/env bash
set -euxo pipefail
# Install dependencies
composer install --no-interaction --no-progress
# Start built-in PHP server in background
php -S 0.0.0.0:8000 -t public &
SERVER_PID=$!
# Give server a moment to start
sleep 2
echo "[+] Safe path (should NOT reach vulnerable sink)"
curl -s 'http://localhost:8000/?mode=safe&data=s:2:"ok";' || true
echo "[+] Unsafe path (should reach vulnerable sink)"
curl -s 'http://localhost:8000/?mode=unsafe&data=O:4:"Test":0:{}' || true
kill "$SERVER_PID"
wait || true
4. JavaScript sample (js-002-yaml-unsafe-load)
This example is intentionally simple: an Express server that calls js-yaml’s unsafe load() on user input.
4.1 Layout
samples/js/js-002-yaml-unsafe-load/:
package.json
package-lock.json
server.js
sbom.cdx.json
vex.cdx.json
GROUND_TRUTH.json
repro.sh
Dockerfile (optional)
README.md
4.2 package.json
{
"name": "js-002-yaml-unsafe-load",
"version": "1.0.0",
"description": "Minimal Node.js sample demonstrating unsafe js-yaml load reachability.",
"main": "server.js",
"scripts": {
"start": "node server.js"
},
"dependencies": {
"express": "^4.19.0",
"js-yaml": "4.1.0"
}
}
4.3 server.js
const express = require('express');
const yaml = require('js-yaml');
const app = express();
app.use(express.text({ type: '*/*' }));
// POSITIVE PATH: unsafe load of attacker-controlled YAML
app.post('/parse-unsafe', (req, res) => {
try {
const doc = yaml.load(req.body); // vulnerable symbol
res.json({ parsed: doc });
} catch (err) {
res.status(400).json({ error: String(err) });
}
});
// NEGATIVE PATH: same dependency, but not reachable as a sink
app.post('/parse-safe', (req, res) => {
// Pretend we validated and reject anything non-whitelisted
if (req.body.length > 100) {
return res.status(400).json({ error: 'Too big' });
}
// No call to yaml.load() here; dependency is present but sink not invoked
res.json({ length: req.body.length });
});
const port = process.env.PORT || 3000;
app.listen(port, () => {
console.log(`js-002-yaml-unsafe-load listening on http://localhost:${port}`);
});
4.4 GROUND_TRUTH.json
{
"sample_id": "js-002-yaml-unsafe-load",
"lang": "javascript",
"package_manager": "npm",
"vuln_ids": [
"OSV:PLACEHOLDER-js-yaml-unsafe-load"
],
"entrypoints": [
"server.js"
],
"reachable_symbols": [
{
"purl": "pkg:npm/js-yaml@4.1.0",
"symbol": "load",
"kind": "sink",
"note": "Reachable from POST /parse-unsafe body."
}
],
"evidence": [
{
"type": "path",
"file": "server.js",
"line": 9,
"desc": "req.body passes directly into yaml.load() without validation."
},
{
"type": "exec",
"cmd": "node server.js",
"result": "Server listening on http://localhost:3000"
},
{
"type": "exec",
"cmd": "curl -XPOST localhost:3000/parse-unsafe -d 'foo: bar'",
"result": "JSON response with {\"foo\":\"bar\"}"
}
]
}
4.5 SBOM & VEX
Very similar to the PHP example, but with npm purl:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"version": 1,
"components": [
{
"type": "library",
"name": "js-yaml",
"version": "4.1.0",
"purl": "pkg:npm/js-yaml@4.1.0"
}
]
}
and:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"version": 1,
"vulnerabilities": [
{
"id": "OSV:PLACEHOLDER-js-yaml-unsafe-load",
"affects": [
{ "ref": "pkg:npm/js-yaml@4.1.0" }
],
"analysis": {
"state": "affected",
"justification": "exploitable",
"detail": "yaml.load() reachable from POST /parse-unsafe."
}
}
]
}
4.6 repro.sh
#!/usr/bin/env bash
set -euxo pipefail
npm ci
node server.js &
SERVER_PID=$!
sleep 2
echo "[+] Positive (reachable) path"
curl -s -XPOST localhost:3000/parse-unsafe -d 'foo: bar' || true
echo "[+] Negative (not reaching sink) path"
curl -s -XPOST localhost:3000/parse-safe -d 'foo: bar' || true
kill "$SERVER_PID"
wait || true
5. C# sample (cs-001-binaryformatter-deserialize)
Minimal ASP.NET Core-style sample that uses BinaryFormatter.Deserialize on request data.
5.1 Layout
samples/csharp/cs-001-binaryformatter-deserialize/:
Cs001BinaryFormatter.csproj
Program.cs
sbom.cdx.json
vex.cdx.json
GROUND_TRUTH.json
repro.sh
Dockerfile (optional)
README.md
5.2 Cs001BinaryFormatter.csproj
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
</PropertyGroup>
<ItemGroup>
<!-- Pretend vulnerable library -->
<PackageReference Include="Example.VulnerableLib" Version="1.0.0" />
</ItemGroup>
</Project>
5.3 Program.cs
using System.Runtime.Serialization.Formatters.Binary;
using System.Text;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
app.MapPost("/deserialize-unsafe", async (HttpContext ctx) =>
{
// POSITIVE PATH: body -> BinaryFormatter.Deserialize
using var ms = new MemoryStream(await ToBytes(ctx.Request.Body));
#pragma warning disable SYSLIB0011
var formatter = new BinaryFormatter();
var obj = formatter.Deserialize(ms); // vulnerable symbol
#pragma warning restore SYSLIB0011
await ctx.Response.WriteAsJsonAsync(new { success = true, type = obj?.GetType().FullName });
});
app.MapPost("/deserialize-safe", async (HttpContext ctx) =>
{
// NEGATIVE PATH: we read input, but never deserialize
using var reader = new StreamReader(ctx.Request.Body, Encoding.UTF8);
var text = await reader.ReadToEndAsync();
await ctx.Response.WriteAsJsonAsync(new { length = text.Length });
});
app.Run();
static async Task<byte[]> ToBytes(Stream stream)
{
using var ms = new MemoryStream();
await stream.CopyToAsync(ms);
return ms.ToArray();
}
5.4 GROUND_TRUTH.json
{
"sample_id": "cs-001-binaryformatter-deserialize",
"lang": "csharp",
"package_manager": "nuget",
"vuln_ids": [
"OSV:PLACEHOLDER-BinaryFormatter"
],
"entrypoints": [
"Program.cs"
],
"reachable_symbols": [
{
"purl": "pkg:nuget/Example.VulnerableLib@1.0.0",
"symbol": "System.Runtime.Serialization.Formatters.Binary.BinaryFormatter::Deserialize",
"kind": "sink",
"note": "Reachable from POST /deserialize-unsafe body."
}
],
"evidence": [
{
"type": "path",
"file": "Program.cs",
"line": 15,
"desc": "Request body copied verbatim into BinaryFormatter.Deserialize."
},
{
"type": "exec",
"cmd": "dotnet run",
"result": "App listening on http://localhost:5000"
},
{
"type": "exec",
"cmd": "curl -XPOST http://localhost:5000/deserialize-unsafe --data-binary @payload.bin",
"result": "Response includes type name of deserialized object."
}
]
}
SBOM/VEX same pattern as previous examples, with purl: "pkg:nuget/Example.VulnerableLib@1.0.0".
6. Tool output schema and integration
This is the normalized output your runners should produce for each (tool, sample) pair.
6.1 run.json schema (results///<sample_id>/run.json)
{
"tool": "mytool",
"version": "1.2.3",
"sample_id": "js-002-yaml-unsafe-load",
"lang": "javascript",
"detected_vulns": [
"OSV:PLACEHOLDER-js-yaml-unsafe-load"
],
"reachable_symbols_reported": [
{
"purl": "pkg:npm/js-yaml@4.1.0",
"symbol": "load",
"kind": "sink",
"evidence": "Taint flow from POST /parse-unsafe body to js-yaml load()."
}
],
"verdict": {
"reachable": true,
"confidence": 0.92
},
"timing": {
"scan_ms": 2300
},
"raw": "tool-output.json" // optional path to original tool output
}
Fields:
reachableis your top-level yes/no reachability verdict for the specific vulnerability(ies) listed in the sample.reachable_symbols_reportedshould map ontoGROUND_TRUTH.reachable_symbolswhere possible.
7. Example integration: running a tool against all samples
7.1 Simple Bash runner (runners/run_all.sh)
#!/usr/bin/env bash
set -euo pipefail
TOOL_NAME="${1:-mytool}"
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
for lang_dir in "$ROOT_DIR/samples"/*; do
lang="$(basename "$lang_dir")"
for sample_dir in "$lang_dir"/*; do
sample_id="$(basename "$sample_dir")"
echo "[*] Running $TOOL_NAME on $lang/$sample_id"
sbom="$sample_dir/sbom.cdx.json"
vex="$sample_dir/vex.cdx.json"
mkdir -p "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id"
# Example: assume your tool supports a CLI like:
# mytool scan --sbom sbom.cdx.json --vex vex.cdx.json --project-root .
mytool scan \
--sbom "$sbom" \
--vex "$vex" \
--project-root "$sample_dir" \
> "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/tool-output.json"
# Normalize output to run.json via helper script
python "$ROOT_DIR/runners/normalize_${TOOL_NAME}.py" \
"$sample_dir" \
"$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/tool-output.json" \
> "$ROOT_DIR/results/$TOOL_NAME/$lang/$sample_id/run.json"
done
done
7.2 Python normalizer example (runners/normalize_mytool.py)
This script turns your proprietary tool output into our run.json schema.
#!/usr/bin/env python
import json
import sys
from pathlib import Path
sample_dir = Path(sys.argv[1])
tool_output_path = Path(sys.argv[2])
ground_truth = json.loads((sample_dir / "GROUND_TRUTH.json").read_text())
tool_output = json.loads(tool_output_path.read_text())
# Example: adapt based on your tool's own schema
run = {
"tool": "mytool",
"version": tool_output.get("tool_version", "unknown"),
"sample_id": ground_truth["sample_id"],
"lang": ground_truth["lang"],
"detected_vulns": tool_output.get("vuln_ids", []),
"reachable_symbols_reported": [],
"verdict": {
"reachable": bool(tool_output.get("reachable", False)),
"confidence": float(tool_output.get("confidence", 0.0))
},
"timing": {
"scan_ms": tool_output.get("scan_ms", None)
},
"raw": str(tool_output_path.name)
}
for r in tool_output.get("reachable_sinks", []):
run["reachable_symbols_reported"].append({
"purl": r.get("purl"),
"symbol": r.get("symbol"),
"kind": r.get("kind", "sink"),
"evidence": r.get("evidence", "")
})
print(json.dumps(run, indent=2))
So tool authors only need to:
- Implement a CLI to scan a project (given SBOM/VEX).
- Implement a small normalizer to produce
run.json.
8. Adding a new sample (for contributors)
This is what you’d document so others can extend the dataset.
-
Pick a language & ID
- Folder:
samples/<lang>/<lang-short-id>-NNN-<name>/ - Example:
samples/php/php-004-guzzle-ssrf/
- Folder:
-
Create a minimal app
-
It must install with one command (
composer install,npm ci,dotnet restore, etc.). -
Include:
- Positive path: user-controllable data reaches the vulnerable sink.
- Negative path (if possible): same dependency present but sink not reachable.
-
-
Pin dependencies
- Commit lockfiles (
composer.lock,package-lock.json, etc.). - Make sure the vulnerable version is used.
- Commit lockfiles (
-
Write
GROUND_TRUTH.json- Fill all required fields from the schema above.
- Be explicit about which symbol(s) are reachable and how to reproduce.
-
Generate SBOM
- Use your preferred SBOM generator and convert to CycloneDX 1.6 JSON (
sbom.cdx.json). - Ensure PURLs match those you reference in
GROUND_TRUTH.json.
- Use your preferred SBOM generator and convert to CycloneDX 1.6 JSON (
-
Write VEX (
vex.cdx.json)- At minimum: one vulnerability with
analysis.state = affectedornot_affected. - Link to the SBOM component via
affects.ref.
- At minimum: one vulnerability with
-
Add
repro.sh-
Script that:
- Installs deps.
- Starts the app.
- Executes at least one positive and one negative HTTP/CLI call.
-
Must exit non‑zero on obvious failure.
-
-
Document briefly in local README
- What vulnerability pattern this sample represents (e.g., SSRF, XXE, unsafe deserialization).
- Expected tool behavior (what should be marked reachable).
If you want, you can literally copy-paste the code and JSON above as your initial three samples (php-001, js-002, cs-001) and then we can layer in more patterns (XXE, SSRF, prototype pollution, etc.) the same way.