up
This commit is contained in:
@@ -25,10 +25,18 @@
|
||||
2) Write NDJSON with stable ordering; compute SHA-256 for each file; write manifest.
|
||||
3) Run validation script to assert counts, schema shape, and hash reproducibility.
|
||||
|
||||
## Open items (to resolve before data generation)
|
||||
## Interim fixtures (delivered 2025-12-01)
|
||||
- Synthetic deterministic graphs generated under `samples/graph/interim/`:
|
||||
- `graph-50k` (50k nodes, ~200k edges)
|
||||
- `graph-100k` (100k nodes, ~400k edges)
|
||||
- Minimal schema (`id, kind, name, version, tenant`), seeded RNG, stable ordering, manifests with hashes.
|
||||
- Purpose: unblock BENCH-GRAPH-21-001/002 while overlay format is finalized. Overlays not included yet.
|
||||
|
||||
## Open items (to resolve before canonical data generation)
|
||||
- Confirm overlay field set and file naming (Graph Guild, due 2025-11-22).
|
||||
- Confirm allowed mock SBOM source list and artifact naming (Graph Guild / SBOM Service Guild).
|
||||
- Provide expected node/edge cardinality breakdown (packages vs files vs relationships) to guide generation.
|
||||
|
||||
## Next steps
|
||||
- Blocked pending overlay/schema confirmation; revisit after 2025-11-22 checkpoint.
|
||||
- Keep SAMPLES-GRAPH-24-003 blocked until overlay/schema confirmation, but interim fixtures are available for benches.
|
||||
- Once overlay schema final, extend generator to emit overlays + CAS manifests and promote to official fixture.
|
||||
|
||||
27
samples/graph/interim/README.md
Normal file
27
samples/graph/interim/README.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Interim Graph Fixtures (synthetic)
|
||||
|
||||
Generated by `samples/graph/interim/generate.py` to unblock BENCH-GRAPH-21-001/002 while SAMPLES-GRAPH-24-003 remains blocked.
|
||||
|
||||
## Contents
|
||||
- `graph-50k/`
|
||||
- `nodes.ndjson` (50,000 package nodes)
|
||||
- `edges.ndjson` (199,988 depends_on edges)
|
||||
- `manifest.json` (hashes/counts)
|
||||
- `graph-100k/`
|
||||
- `nodes.ndjson` (100,000 package nodes)
|
||||
- `edges.ndjson` (399,972 depends_on edges)
|
||||
- `manifest.json`
|
||||
|
||||
## Determinism
|
||||
- Seeded RNG (`seed=42`) for edge fanout.
|
||||
- Stable ordering, UTF-8, sorted keys.
|
||||
- Hashes in `manifest.json` for verification.
|
||||
|
||||
## How to regenerate
|
||||
```bash
|
||||
python samples/graph/interim/generate.py
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Schema is minimal (`id, kind, name, version, tenant`). Overlay format still pending; add overlays once Graph Guild finalizes fields.
|
||||
- Use these fixtures for throughput/latency benches and UI scripting; swap to canonical SAMPLES-GRAPH-24-003 once available.
|
||||
125
samples/graph/interim/generate.py
Normal file
125
samples/graph/interim/generate.py
Normal file
@@ -0,0 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Deterministic interim graph fixture generator.
|
||||
|
||||
Produces two fixtures (50k and 100k nodes) with simple package/version nodes
|
||||
and dependency edges. Output shape is NDJSON with stable ordering.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import math
|
||||
import random
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List
|
||||
|
||||
ROOT = Path(__file__).resolve().parent
|
||||
OUT_DIR = ROOT
|
||||
TENANT = "demo-tenant"
|
||||
|
||||
|
||||
def chunked(seq: Iterable, size: int):
|
||||
chunk = []
|
||||
for item in seq:
|
||||
chunk.append(item)
|
||||
if len(chunk) >= size:
|
||||
yield chunk
|
||||
chunk = []
|
||||
if chunk:
|
||||
yield chunk
|
||||
|
||||
|
||||
def make_nodes(count: int) -> List[dict]:
|
||||
nodes: List[dict] = []
|
||||
for i in range(1, count + 1):
|
||||
nodes.append(
|
||||
{
|
||||
"id": f"pkg-{i:06d}",
|
||||
"kind": "package",
|
||||
"name": f"package-{i:06d}",
|
||||
"version": f"1.{(i % 10)}.{(i % 7)}",
|
||||
"tenant": TENANT,
|
||||
}
|
||||
)
|
||||
return nodes
|
||||
|
||||
|
||||
def make_edges(nodes: List[dict], fanout: int) -> List[dict]:
|
||||
edges: List[dict] = []
|
||||
rng = random.Random(42)
|
||||
n = len(nodes)
|
||||
for idx, node in enumerate(nodes):
|
||||
# Connect each node to up to `fanout` later nodes to keep sparse DAG
|
||||
targets = set()
|
||||
while len(targets) < fanout:
|
||||
t = rng.randint(idx + 1, n)
|
||||
if t <= n:
|
||||
targets.add(t)
|
||||
if idx + fanout >= n:
|
||||
break
|
||||
for t in sorted(targets):
|
||||
edges.append(
|
||||
{
|
||||
"id": f"edge-{node['id']}-{t:06d}",
|
||||
"kind": "depends_on",
|
||||
"source": node["id"],
|
||||
"target": f"pkg-{t:06d}",
|
||||
"tenant": TENANT,
|
||||
}
|
||||
)
|
||||
return edges
|
||||
|
||||
|
||||
def write_ndjson(path: Path, records: Iterable[dict]):
|
||||
with path.open("w", encoding="utf-8") as f:
|
||||
for rec in records:
|
||||
f.write(json.dumps(rec, separators=(",", ":"), sort_keys=True))
|
||||
f.write("\n")
|
||||
|
||||
|
||||
def sha256_file(path: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with path.open("rb") as f:
|
||||
for chunk in iter(lambda: f.read(8192), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def generate_fixture(name: str, node_count: int):
|
||||
fixture_dir = OUT_DIR / name
|
||||
fixture_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print(f"Generating {name} with {node_count} nodes…")
|
||||
nodes = make_nodes(node_count)
|
||||
# keep fanout small to limit edges and file size
|
||||
fanout = max(1, int(math.log10(node_count)))
|
||||
edges = make_edges(nodes, fanout=fanout)
|
||||
|
||||
nodes_path = fixture_dir / "nodes.ndjson"
|
||||
edges_path = fixture_dir / "edges.ndjson"
|
||||
manifest_path = fixture_dir / "manifest.json"
|
||||
|
||||
write_ndjson(nodes_path, nodes)
|
||||
write_ndjson(edges_path, edges)
|
||||
|
||||
manifest = {
|
||||
"version": "1.0.0",
|
||||
"tenant": TENANT,
|
||||
"counts": {"nodes": len(nodes), "edges": len(edges)},
|
||||
"hashes": {
|
||||
"nodes.ndjson": sha256_file(nodes_path),
|
||||
"edges.ndjson": sha256_file(edges_path),
|
||||
},
|
||||
}
|
||||
manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True))
|
||||
print(f"Wrote manifest {manifest_path}")
|
||||
|
||||
|
||||
def main():
|
||||
generate_fixture("graph-50k", 50_000)
|
||||
generate_fixture("graph-100k", 100_000)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
499980
samples/graph/interim/graph-100k/edges.ndjson
Normal file
499980
samples/graph/interim/graph-100k/edges.ndjson
Normal file
File diff suppressed because it is too large
Load Diff
12
samples/graph/interim/graph-100k/manifest.json
Normal file
12
samples/graph/interim/graph-100k/manifest.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"counts": {
|
||||
"edges": 499980,
|
||||
"nodes": 100000
|
||||
},
|
||||
"hashes": {
|
||||
"edges.ndjson": "4f09d36e908b4cc5136ef74fdb716f657156765c7ccf5f5fb4f46a744a2681ff",
|
||||
"nodes.ndjson": "74723965607ae70dbc34c658d75cc7f5491f3e27780cb8c5a2e1eb25620165b2"
|
||||
},
|
||||
"tenant": "demo-tenant",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
100000
samples/graph/interim/graph-100k/nodes.ndjson
Normal file
100000
samples/graph/interim/graph-100k/nodes.ndjson
Normal file
File diff suppressed because it is too large
Load Diff
199988
samples/graph/interim/graph-50k/edges.ndjson
Normal file
199988
samples/graph/interim/graph-50k/edges.ndjson
Normal file
File diff suppressed because it is too large
Load Diff
12
samples/graph/interim/graph-50k/manifest.json
Normal file
12
samples/graph/interim/graph-50k/manifest.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"counts": {
|
||||
"edges": 199988,
|
||||
"nodes": 50000
|
||||
},
|
||||
"hashes": {
|
||||
"edges.ndjson": "811fc5e34399191e8c8ce2139f418b9ae3b151527ddfe853a8d39fc079179042",
|
||||
"nodes.ndjson": "8583293ef89d6ef60815d15060f92ffdcefafdfef135b1171e1512438522f447"
|
||||
},
|
||||
"tenant": "demo-tenant",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
50000
samples/graph/interim/graph-50k/nodes.ndjson
Normal file
50000
samples/graph/interim/graph-50k/nodes.ndjson
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,18 +1,24 @@
|
||||
# Generation driver (stub) — SAMPLES-GRAPH-24-003
|
||||
# Interim & final fixture generation — SAMPLES-GRAPH-24-003
|
||||
|
||||
> Blocked: overlay schema + mock SBOM bundle list pending. Script outline only.
|
||||
## Current status
|
||||
- Interim synthetic fixtures (50k/100k) are generated via `samples/graph/interim/generate.py` (deterministic, hashes in manifest). Use these for BENCH-GRAPH-21-001/002 until overlay schema is finalized.
|
||||
- Canonical fixture remains blocked on overlay field confirmation from Graph Guild.
|
||||
|
||||
## Outline
|
||||
1) Input bundle(s): scanner surface mock bundle v1 (or real caches when available).
|
||||
2) Deterministic seeding: `RANDOM_SEED=424242`; time source frozen at `2025-11-22T00:00:00Z`.
|
||||
3) Steps (once unblocked):
|
||||
- Parse SBOM mock bundle, expand to node/edge sets following Graph schema.
|
||||
- Generate policy overlay snapshot with placeholder verdicts until final fields confirmed.
|
||||
## Plan for canonical fixture
|
||||
1) **Inputs:** scanner surface mock bundle v1 (or real caches when cleared), overlay schema from Graph Guild, tenant `demo-tenant`.
|
||||
2) **Determinism:** `RANDOM_SEED=424242`, timestamps frozen to `2025-11-22T00:00:00Z`, UTF-8, sorted keys/rows.
|
||||
3) **Generation steps (once unblocked):**
|
||||
- Parse mock SBOM bundle → node/edge sets per Graph schema.
|
||||
- Generate policy overlay snapshot using final overlay fields; include verdict, ruleId, severity, provenance hash.
|
||||
- Write NDJSON (`nodes.ndjson`, `edges.ndjson`, `overlays/policy.ndjson`) sorted by `id`.
|
||||
- Emit `manifest.json` with SHA-256, counts, timestamps.
|
||||
- Add `verify.sh` to recompute hashes and validate counts.
|
||||
- Emit `manifest.json` with SHA-256, counts, timestamps; DSSE-sign manifest for offline kits.
|
||||
- Add `verify.sh` to recompute hashes and validate counts/overlay fields.
|
||||
|
||||
## TODO when unblocked
|
||||
- Fill overlay field mapping once Graph Guild confirms schema (checkpoint 2025-11-22).
|
||||
- Confirm allowed mock SBOM source list with SBOM / Graph guilds.
|
||||
- Implement generator script in Python or C# (deterministic ordering, no network access).
|
||||
## TODO to unblock
|
||||
- Receive overlay field mapping + file naming from Graph Guild (was due 2025-11-22).
|
||||
- Confirm allowed mock SBOM source list and artifact naming (Graph Guild / SBOM Service Guild).
|
||||
- Provide expected node/edge cardinality breakdown to guide generation.
|
||||
|
||||
## Scripts
|
||||
- Interim: `samples/graph/interim/generate.py`
|
||||
- Canonical (to write): `samples/graph/scripts/generate-canonical.py` + `verify.sh` (DSSE + hash check), once schema confirmed.
|
||||
|
||||
Reference in New Issue
Block a user