9.9 KiB
9.9 KiB
Golden Corpus Folder Layout
Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification Task: GCB-006 - Document corpus folder layout and maintenance procedures
Overview
The golden corpus is a curated dataset of pre/post security patch binary pairs used for:
- Validating binary matching algorithms
- Benchmarking reproducibility verification
- Training machine learning models for function identification
- Generating audit-ready evidence bundles
Root Layout
golden-corpus/
├── corpus/ # Security pairs organized by distro
│ ├── debian/
│ ├── ubuntu/
│ └── alpine/
├── mirrors/ # Local mirrors of upstream sources
│ ├── debian/
│ ├── ubuntu/
│ ├── alpine/
│ └── osv/
├── harness/ # Build and verification tooling
│ ├── chroots/
│ ├── lifter-matcher/
│ ├── sbom-canonicalizer/
│ └── verifier/
├── evidence/ # Generated evidence bundles
│ └── <pkg>-<advisory>-bundle.oci.tar
└── bench/ # Benchmark data and baselines
├── baselines/
└── results/
Corpus Directory Structure
Each security pair follows a consistent structure:
corpus/<distro>/<package>/<advisory-id>/
├── pre/ # Pre-patch (vulnerable) artifacts
│ ├── src/ # Source code
│ │ ├── *.tar.gz # Original source tarball
│ │ ├── debian/ # Packaging metadata
│ │ └── buildinfo # Build reproducibility info
│ └── debs/ # Built binaries
│ ├── *.deb # Binary packages
│ ├── *.ddeb # Debug symbols
│ └── buildlog # Build log
├── post/ # Post-patch (fixed) artifacts
│ ├── src/
│ └── debs/
└── metadata/
├── advisory.json # Advisory details
├── osv.json # OSV format vulnerability
├── pair-manifest.json # Pair configuration
└── ground-truth.json # Function-level ground truth
Debian Example
corpus/debian/openssl/DSA-5678-1/
├── pre/
│ ├── src/
│ │ ├── openssl_3.0.10.orig.tar.gz
│ │ ├── openssl_3.0.10-1.debian.tar.xz
│ │ ├── openssl_3.0.10-1.dsc
│ │ └── openssl_3.0.10-1.buildinfo
│ └── debs/
│ ├── libssl3_3.0.10-1_amd64.deb
│ ├── libssl3-dbgsym_3.0.10-1_amd64.ddeb
│ └── build.log
├── post/
│ ├── src/
│ │ ├── openssl_3.0.11.orig.tar.gz
│ │ ├── openssl_3.0.11-1.debian.tar.xz
│ │ └── ...
│ └── debs/
│ └── ...
└── metadata/
├── advisory.json
└── ground-truth.json
Ubuntu Example
corpus/ubuntu/curl/USN-1234-1/
├── pre/
│ ├── src/
│ │ └── curl_8.4.0-1ubuntu1.tar.xz
│ └── debs/
│ └── libcurl4_8.4.0-1ubuntu1_amd64.deb
├── post/
│ └── ...
└── metadata/
├── advisory.json
└── usn.json
Alpine Example
corpus/alpine/zlib/CVE-2022-37434/
├── pre/
│ ├── src/
│ │ └── APKBUILD
│ └── apks/
│ └── zlib-1.2.12-r2.apk
├── post/
│ └── ...
└── metadata/
└── secdb-entry.json
Mirrors Directory Structure
Local mirrors cache upstream artifacts for offline operation:
mirrors/
├── debian/
│ ├── archive/ # snapshot.debian.org mirrors
│ │ └── pool/main/o/openssl/
│ ├── snapshot/ # Point-in-time snapshots
│ │ └── 20260101T000000Z/
│ └── buildinfo/ # buildinfos.debian.net cache
│ └── <source-name>/
├── ubuntu/
│ ├── archive/ # archive.ubuntu.com mirrors
│ ├── usn-index/ # USN metadata
│ │ └── usn-db.json
│ └── launchpad/ # Build logs from Launchpad
├── alpine/
│ ├── packages/ # Alpine package mirror
│ └── secdb/ # Security database
│ └── community.json
└── osv/
├── all.zip # Full OSV database
└── debian/ # Distro-specific extracts
Harness Directory Structure
Build and verification tooling:
harness/
├── chroots/ # Build environments
│ ├── debian-bookworm-amd64/
│ ├── debian-bullseye-amd64/
│ ├── ubuntu-noble-amd64/
│ └── alpine-3.19-amd64/
├── lifter-matcher/ # Binary analysis tools
│ ├── ghidra/ # Ghidra installation
│ ├── bsim-server/ # BSim database server
│ └── semantic-diffing/ # Semantic diff tools
├── sbom-canonicalizer/ # SBOM normalization
│ └── config/
└── verifier/ # Standalone verifier
├── stella-verifier # Verifier binary
└── trust-profiles/ # Trust profiles
Evidence Directory Structure
Generated bundles for audit/compliance:
evidence/
├── openssl-DSA-5678-1-bundle.oci.tar
├── curl-USN-1234-1-bundle.oci.tar
└── manifests/
└── inventory.json
Bundle Internal Structure (OCI Format)
openssl-DSA-5678-1-bundle.oci.tar/
├── oci-layout # OCI layout version
├── index.json # OCI index with referrers
├── blobs/
│ └── sha256/
│ ├── <manifest> # Bundle manifest
│ ├── <sbom-pre> # Pre-patch SBOM
│ ├── <sbom-post> # Post-patch SBOM
│ ├── <binary-pre> # Pre-patch binary
│ ├── <binary-post> # Post-patch binary
│ ├── <delta-sig> # DSSE delta-sig predicate
│ ├── <provenance> # Build provenance
│ └── <timestamp> # RFC 3161 timestamp
└── manifest.json # Signed bundle manifest
Bench Directory Structure
Benchmark data and KPI baselines:
bench/
├── baselines/
│ ├── current.json # Active KPI baseline
│ └── archive/ # Historical baselines
│ ├── baseline-20260115.json
│ └── baseline-20260108.json
├── results/
│ ├── 20260122120000.json # Validation run results
│ └── ...
└── reports/
└── regression-report-*.md
Baseline File Format
{
"baselineId": "baseline-20260122120000",
"createdAt": "2026-01-22T12:00:00Z",
"source": "abc123def456",
"description": "Post-semantic-diffing-v2 baseline",
"precision": 0.95,
"recall": 0.92,
"falseNegativeRate": 0.08,
"deterministicReplayRate": 1.0,
"ttfrpP95Ms": 150,
"additionalKpis": {}
}
File Naming Conventions
| Type | Pattern | Example |
|---|---|---|
| Advisory ID (Debian) | DSA-<number>-<revision> |
DSA-5678-1 |
| Advisory ID (Ubuntu) | USN-<number>-<revision> |
USN-1234-1 |
| Advisory ID (Alpine) | CVE-<year>-<number> |
CVE-2022-37434 |
| Bundle file | <pkg>-<advisory>-bundle.oci.tar |
openssl-DSA-5678-1-bundle.oci.tar |
| Baseline file | baseline-<timestamp>.json |
baseline-20260122120000.json |
| Results file | <timestamp>.json |
20260122120000.json |
Metadata Files
advisory.json
{
"advisoryId": "DSA-5678-1",
"cves": ["CVE-2024-1234", "CVE-2024-5678"],
"package": "openssl",
"vulnerableVersions": ["3.0.10-1"],
"fixedVersions": ["3.0.11-1"],
"severity": "high",
"publishedAt": "2024-11-15T00:00:00Z",
"summary": "Multiple vulnerabilities in OpenSSL"
}
pair-manifest.json
{
"pairId": "openssl-DSA-5678-1",
"package": "openssl",
"distribution": "debian",
"suite": "bookworm",
"architecture": "amd64",
"preVersion": "3.0.10-1",
"postVersion": "3.0.11-1",
"binaries": [
"libssl3",
"libcrypto3"
],
"createdAt": "2026-01-15T10:00:00Z",
"validatedAt": "2026-01-22T12:00:00Z"
}
ground-truth.json
{
"pairId": "openssl-DSA-5678-1",
"binary": "libcrypto.so.3",
"functions": [
{
"name": "EVP_DigestInit_ex",
"preAddress": "0x12345",
"postAddress": "0x12347",
"status": "modified",
"confidence": 1.0
},
{
"name": "EVP_DigestUpdate",
"preAddress": "0x12400",
"postAddress": "0x12400",
"status": "unchanged",
"confidence": 1.0
}
],
"metadata": {
"generatedBy": "manual-annotation",
"reviewedBy": "security-team",
"reviewedAt": "2026-01-20T14:00:00Z"
}
}
Access Patterns
Read-Only Access
- Validation harness reads corpus pairs
- CI reads baselines for regression checks
- Auditors read evidence bundles
Write Access
- Corpus ingestion adds new pairs
- Baseline update writes new baseline files
- Bundle export creates evidence bundles
Sync Access
- Mirror sync updates upstream caches
- Scheduled jobs refresh OSV database
Storage Requirements
| Component | Typical Size | Growth Rate |
|---|---|---|
| Corpus (per pair) | 50-500 MB | N/A |
| Mirrors (Debian) | 10-50 GB | Monthly |
| Mirrors (Ubuntu) | 5-20 GB | Monthly |
| Mirrors (Alpine) | 1-5 GB | Monthly |
| OSV Database | 500 MB | Weekly |
| Evidence bundles | 100-500 MB each | Per pair |
| Baselines | < 10 KB each | Per run |