# Golden Corpus Folder Layout Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification Task: GCB-006 - Document corpus folder layout and maintenance procedures ## Overview The golden corpus is a curated dataset of pre/post security patch binary pairs used for: - Validating binary matching algorithms - Benchmarking reproducibility verification - Training machine learning models for function identification - Generating audit-ready evidence bundles ## Root Layout ``` golden-corpus/ ├── corpus/ # Security pairs organized by distro │ ├── debian/ │ ├── ubuntu/ │ └── alpine/ ├── mirrors/ # Local mirrors of upstream sources │ ├── debian/ │ ├── ubuntu/ │ ├── alpine/ │ └── osv/ ├── harness/ # Build and verification tooling │ ├── chroots/ │ ├── lifter-matcher/ │ ├── sbom-canonicalizer/ │ └── verifier/ ├── evidence/ # Generated evidence bundles │ └── --bundle.oci.tar └── bench/ # Benchmark data and baselines ├── baselines/ └── results/ ``` ## Corpus Directory Structure Each security pair follows a consistent structure: ``` corpus//// ├── pre/ # Pre-patch (vulnerable) artifacts │ ├── src/ # Source code │ │ ├── *.tar.gz # Original source tarball │ │ ├── debian/ # Packaging metadata │ │ └── buildinfo # Build reproducibility info │ └── debs/ # Built binaries │ ├── *.deb # Binary packages │ ├── *.ddeb # Debug symbols │ └── buildlog # Build log ├── post/ # Post-patch (fixed) artifacts │ ├── src/ │ └── debs/ └── metadata/ ├── advisory.json # Advisory details ├── osv.json # OSV format vulnerability ├── pair-manifest.json # Pair configuration └── ground-truth.json # Function-level ground truth ``` ### Debian Example ``` corpus/debian/openssl/DSA-5678-1/ ├── pre/ │ ├── src/ │ │ ├── openssl_3.0.10.orig.tar.gz │ │ ├── openssl_3.0.10-1.debian.tar.xz │ │ ├── openssl_3.0.10-1.dsc │ │ └── openssl_3.0.10-1.buildinfo │ └── debs/ │ ├── libssl3_3.0.10-1_amd64.deb │ ├── libssl3-dbgsym_3.0.10-1_amd64.ddeb │ └── build.log ├── post/ │ ├── src/ │ │ ├── openssl_3.0.11.orig.tar.gz │ │ ├── openssl_3.0.11-1.debian.tar.xz │ │ └── ... │ └── debs/ │ └── ... └── metadata/ ├── advisory.json └── ground-truth.json ``` ### Ubuntu Example ``` corpus/ubuntu/curl/USN-1234-1/ ├── pre/ │ ├── src/ │ │ └── curl_8.4.0-1ubuntu1.tar.xz │ └── debs/ │ └── libcurl4_8.4.0-1ubuntu1_amd64.deb ├── post/ │ └── ... └── metadata/ ├── advisory.json └── usn.json ``` ### Alpine Example ``` corpus/alpine/zlib/CVE-2022-37434/ ├── pre/ │ ├── src/ │ │ └── APKBUILD │ └── apks/ │ └── zlib-1.2.12-r2.apk ├── post/ │ └── ... └── metadata/ └── secdb-entry.json ``` ## Mirrors Directory Structure Local mirrors cache upstream artifacts for offline operation: ``` mirrors/ ├── debian/ │ ├── archive/ # snapshot.debian.org mirrors │ │ └── pool/main/o/openssl/ │ ├── snapshot/ # Point-in-time snapshots │ │ └── 20260101T000000Z/ │ └── buildinfo/ # buildinfos.debian.net cache │ └── / ├── ubuntu/ │ ├── archive/ # archive.ubuntu.com mirrors │ ├── usn-index/ # USN metadata │ │ └── usn-db.json │ └── launchpad/ # Build logs from Launchpad ├── alpine/ │ ├── packages/ # Alpine package mirror │ └── secdb/ # Security database │ └── community.json └── osv/ ├── all.zip # Full OSV database └── debian/ # Distro-specific extracts ``` ## Harness Directory Structure Build and verification tooling: ``` harness/ ├── chroots/ # Build environments │ ├── debian-bookworm-amd64/ │ ├── debian-bullseye-amd64/ │ ├── ubuntu-noble-amd64/ │ └── alpine-3.19-amd64/ ├── lifter-matcher/ # Binary analysis tools │ ├── ghidra/ # Ghidra installation │ ├── bsim-server/ # BSim database server │ └── semantic-diffing/ # Semantic diff tools ├── sbom-canonicalizer/ # SBOM normalization │ └── config/ └── verifier/ # Standalone verifier ├── stella-verifier # Verifier binary └── trust-profiles/ # Trust profiles ``` ## Evidence Directory Structure Generated bundles for audit/compliance: ``` evidence/ ├── openssl-DSA-5678-1-bundle.oci.tar ├── curl-USN-1234-1-bundle.oci.tar └── manifests/ └── inventory.json ``` ### Bundle Internal Structure (OCI Format) ``` openssl-DSA-5678-1-bundle.oci.tar/ ├── oci-layout # OCI layout version ├── index.json # OCI index with referrers ├── blobs/ │ └── sha256/ │ ├── # Bundle manifest │ ├── # Pre-patch SBOM │ ├── # Post-patch SBOM │ ├── # Pre-patch binary │ ├── # Post-patch binary │ ├── # DSSE delta-sig predicate │ ├── # Build provenance │ └── # RFC 3161 timestamp └── manifest.json # Signed bundle manifest ``` ## Bench Directory Structure Benchmark data and KPI baselines: ``` bench/ ├── baselines/ │ ├── current.json # Active KPI baseline │ └── archive/ # Historical baselines │ ├── baseline-20260115.json │ └── baseline-20260108.json ├── results/ │ ├── 20260122120000.json # Validation run results │ └── ... └── reports/ └── regression-report-*.md ``` ### Baseline File Format ```json { "baselineId": "baseline-20260122120000", "createdAt": "2026-01-22T12:00:00Z", "source": "abc123def456", "description": "Post-semantic-diffing-v2 baseline", "precision": 0.95, "recall": 0.92, "falseNegativeRate": 0.08, "deterministicReplayRate": 1.0, "ttfrpP95Ms": 150, "additionalKpis": {} } ``` ## File Naming Conventions | Type | Pattern | Example | |------|---------|---------| | Advisory ID (Debian) | `DSA--` | `DSA-5678-1` | | Advisory ID (Ubuntu) | `USN--` | `USN-1234-1` | | Advisory ID (Alpine) | `CVE--` | `CVE-2022-37434` | | Bundle file | `--bundle.oci.tar` | `openssl-DSA-5678-1-bundle.oci.tar` | | Baseline file | `baseline-.json` | `baseline-20260122120000.json` | | Results file | `.json` | `20260122120000.json` | ## Metadata Files ### advisory.json ```json { "advisoryId": "DSA-5678-1", "cves": ["CVE-2024-1234", "CVE-2024-5678"], "package": "openssl", "vulnerableVersions": ["3.0.10-1"], "fixedVersions": ["3.0.11-1"], "severity": "high", "publishedAt": "2024-11-15T00:00:00Z", "summary": "Multiple vulnerabilities in OpenSSL" } ``` ### pair-manifest.json ```json { "pairId": "openssl-DSA-5678-1", "package": "openssl", "distribution": "debian", "suite": "bookworm", "architecture": "amd64", "preVersion": "3.0.10-1", "postVersion": "3.0.11-1", "binaries": [ "libssl3", "libcrypto3" ], "createdAt": "2026-01-15T10:00:00Z", "validatedAt": "2026-01-22T12:00:00Z" } ``` ### ground-truth.json ```json { "pairId": "openssl-DSA-5678-1", "binary": "libcrypto.so.3", "functions": [ { "name": "EVP_DigestInit_ex", "preAddress": "0x12345", "postAddress": "0x12347", "status": "modified", "confidence": 1.0 }, { "name": "EVP_DigestUpdate", "preAddress": "0x12400", "postAddress": "0x12400", "status": "unchanged", "confidence": 1.0 } ], "metadata": { "generatedBy": "manual-annotation", "reviewedBy": "security-team", "reviewedAt": "2026-01-20T14:00:00Z" } } ``` ## Access Patterns ### Read-Only Access - Validation harness reads corpus pairs - CI reads baselines for regression checks - Auditors read evidence bundles ### Write Access - Corpus ingestion adds new pairs - Baseline update writes new baseline files - Bundle export creates evidence bundles ### Sync Access - Mirror sync updates upstream caches - Scheduled jobs refresh OSV database ## Storage Requirements | Component | Typical Size | Growth Rate | |-----------|--------------|-------------| | Corpus (per pair) | 50-500 MB | N/A | | Mirrors (Debian) | 10-50 GB | Monthly | | Mirrors (Ubuntu) | 5-20 GB | Monthly | | Mirrors (Alpine) | 1-5 GB | Monthly | | OSV Database | 500 MB | Weekly | | Evidence bundles | 100-500 MB each | Per pair | | Baselines | < 10 KB each | Per run | ## Related Documentation - [Ground Truth Corpus Overview](ground-truth-corpus.md) - [Golden Corpus Maintenance](golden-corpus-maintenance.md) - [Corpus Ingestion Operations](corpus-ingestion-operations.md) - [Golden Corpus Operations Runbook](../../runbooks/golden-corpus-operations.md)