Files
git.stella-ops.org/docs/modules/binary-index/golden-corpus-layout.md
2026-01-22 19:08:46 +02:00

9.9 KiB

Golden Corpus Folder Layout

Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification Task: GCB-006 - Document corpus folder layout and maintenance procedures

Overview

The golden corpus is a curated dataset of pre/post security patch binary pairs used for:

  • Validating binary matching algorithms
  • Benchmarking reproducibility verification
  • Training machine learning models for function identification
  • Generating audit-ready evidence bundles

Root Layout

golden-corpus/
├── corpus/                    # Security pairs organized by distro
│   ├── debian/
│   ├── ubuntu/
│   └── alpine/
├── mirrors/                   # Local mirrors of upstream sources
│   ├── debian/
│   ├── ubuntu/
│   ├── alpine/
│   └── osv/
├── harness/                   # Build and verification tooling
│   ├── chroots/
│   ├── lifter-matcher/
│   ├── sbom-canonicalizer/
│   └── verifier/
├── evidence/                  # Generated evidence bundles
│   └── <pkg>-<advisory>-bundle.oci.tar
└── bench/                     # Benchmark data and baselines
    ├── baselines/
    └── results/

Corpus Directory Structure

Each security pair follows a consistent structure:

corpus/<distro>/<package>/<advisory-id>/
├── pre/                       # Pre-patch (vulnerable) artifacts
│   ├── src/                   # Source code
│   │   ├── *.tar.gz          # Original source tarball
│   │   ├── debian/           # Packaging metadata
│   │   └── buildinfo         # Build reproducibility info
│   └── debs/                  # Built binaries
│       ├── *.deb             # Binary packages
│       ├── *.ddeb            # Debug symbols
│       └── buildlog          # Build log
├── post/                      # Post-patch (fixed) artifacts
│   ├── src/
│   └── debs/
└── metadata/
    ├── advisory.json         # Advisory details
    ├── osv.json              # OSV format vulnerability
    ├── pair-manifest.json    # Pair configuration
    └── ground-truth.json     # Function-level ground truth

Debian Example

corpus/debian/openssl/DSA-5678-1/
├── pre/
│   ├── src/
│   │   ├── openssl_3.0.10.orig.tar.gz
│   │   ├── openssl_3.0.10-1.debian.tar.xz
│   │   ├── openssl_3.0.10-1.dsc
│   │   └── openssl_3.0.10-1.buildinfo
│   └── debs/
│       ├── libssl3_3.0.10-1_amd64.deb
│       ├── libssl3-dbgsym_3.0.10-1_amd64.ddeb
│       └── build.log
├── post/
│   ├── src/
│   │   ├── openssl_3.0.11.orig.tar.gz
│   │   ├── openssl_3.0.11-1.debian.tar.xz
│   │   └── ...
│   └── debs/
│       └── ...
└── metadata/
    ├── advisory.json
    └── ground-truth.json

Ubuntu Example

corpus/ubuntu/curl/USN-1234-1/
├── pre/
│   ├── src/
│   │   └── curl_8.4.0-1ubuntu1.tar.xz
│   └── debs/
│       └── libcurl4_8.4.0-1ubuntu1_amd64.deb
├── post/
│   └── ...
└── metadata/
    ├── advisory.json
    └── usn.json

Alpine Example

corpus/alpine/zlib/CVE-2022-37434/
├── pre/
│   ├── src/
│   │   └── APKBUILD
│   └── apks/
│       └── zlib-1.2.12-r2.apk
├── post/
│   └── ...
└── metadata/
    └── secdb-entry.json

Mirrors Directory Structure

Local mirrors cache upstream artifacts for offline operation:

mirrors/
├── debian/
│   ├── archive/              # snapshot.debian.org mirrors
│   │   └── pool/main/o/openssl/
│   ├── snapshot/             # Point-in-time snapshots
│   │   └── 20260101T000000Z/
│   └── buildinfo/            # buildinfos.debian.net cache
│       └── <source-name>/
├── ubuntu/
│   ├── archive/              # archive.ubuntu.com mirrors
│   ├── usn-index/            # USN metadata
│   │   └── usn-db.json
│   └── launchpad/            # Build logs from Launchpad
├── alpine/
│   ├── packages/             # Alpine package mirror
│   └── secdb/                # Security database
│       └── community.json
└── osv/
    ├── all.zip               # Full OSV database
    └── debian/               # Distro-specific extracts

Harness Directory Structure

Build and verification tooling:

harness/
├── chroots/                  # Build environments
│   ├── debian-bookworm-amd64/
│   ├── debian-bullseye-amd64/
│   ├── ubuntu-noble-amd64/
│   └── alpine-3.19-amd64/
├── lifter-matcher/           # Binary analysis tools
│   ├── ghidra/               # Ghidra installation
│   ├── bsim-server/          # BSim database server
│   └── semantic-diffing/     # Semantic diff tools
├── sbom-canonicalizer/       # SBOM normalization
│   └── config/
└── verifier/                 # Standalone verifier
    ├── stella-verifier       # Verifier binary
    └── trust-profiles/       # Trust profiles

Evidence Directory Structure

Generated bundles for audit/compliance:

evidence/
├── openssl-DSA-5678-1-bundle.oci.tar
├── curl-USN-1234-1-bundle.oci.tar
└── manifests/
    └── inventory.json

Bundle Internal Structure (OCI Format)

openssl-DSA-5678-1-bundle.oci.tar/
├── oci-layout               # OCI layout version
├── index.json               # OCI index with referrers
├── blobs/
│   └── sha256/
│       ├── <manifest>       # Bundle manifest
│       ├── <sbom-pre>       # Pre-patch SBOM
│       ├── <sbom-post>      # Post-patch SBOM
│       ├── <binary-pre>     # Pre-patch binary
│       ├── <binary-post>    # Post-patch binary
│       ├── <delta-sig>      # DSSE delta-sig predicate
│       ├── <provenance>     # Build provenance
│       └── <timestamp>      # RFC 3161 timestamp
└── manifest.json            # Signed bundle manifest

Bench Directory Structure

Benchmark data and KPI baselines:

bench/
├── baselines/
│   ├── current.json         # Active KPI baseline
│   └── archive/             # Historical baselines
│       ├── baseline-20260115.json
│       └── baseline-20260108.json
├── results/
│   ├── 20260122120000.json  # Validation run results
│   └── ...
└── reports/
    └── regression-report-*.md

Baseline File Format

{
  "baselineId": "baseline-20260122120000",
  "createdAt": "2026-01-22T12:00:00Z",
  "source": "abc123def456",
  "description": "Post-semantic-diffing-v2 baseline",
  "precision": 0.95,
  "recall": 0.92,
  "falseNegativeRate": 0.08,
  "deterministicReplayRate": 1.0,
  "ttfrpP95Ms": 150,
  "additionalKpis": {}
}

File Naming Conventions

Type Pattern Example
Advisory ID (Debian) DSA-<number>-<revision> DSA-5678-1
Advisory ID (Ubuntu) USN-<number>-<revision> USN-1234-1
Advisory ID (Alpine) CVE-<year>-<number> CVE-2022-37434
Bundle file <pkg>-<advisory>-bundle.oci.tar openssl-DSA-5678-1-bundle.oci.tar
Baseline file baseline-<timestamp>.json baseline-20260122120000.json
Results file <timestamp>.json 20260122120000.json

Metadata Files

advisory.json

{
  "advisoryId": "DSA-5678-1",
  "cves": ["CVE-2024-1234", "CVE-2024-5678"],
  "package": "openssl",
  "vulnerableVersions": ["3.0.10-1"],
  "fixedVersions": ["3.0.11-1"],
  "severity": "high",
  "publishedAt": "2024-11-15T00:00:00Z",
  "summary": "Multiple vulnerabilities in OpenSSL"
}

pair-manifest.json

{
  "pairId": "openssl-DSA-5678-1",
  "package": "openssl",
  "distribution": "debian",
  "suite": "bookworm",
  "architecture": "amd64",
  "preVersion": "3.0.10-1",
  "postVersion": "3.0.11-1",
  "binaries": [
    "libssl3",
    "libcrypto3"
  ],
  "createdAt": "2026-01-15T10:00:00Z",
  "validatedAt": "2026-01-22T12:00:00Z"
}

ground-truth.json

{
  "pairId": "openssl-DSA-5678-1",
  "binary": "libcrypto.so.3",
  "functions": [
    {
      "name": "EVP_DigestInit_ex",
      "preAddress": "0x12345",
      "postAddress": "0x12347",
      "status": "modified",
      "confidence": 1.0
    },
    {
      "name": "EVP_DigestUpdate",
      "preAddress": "0x12400",
      "postAddress": "0x12400",
      "status": "unchanged",
      "confidence": 1.0
    }
  ],
  "metadata": {
    "generatedBy": "manual-annotation",
    "reviewedBy": "security-team",
    "reviewedAt": "2026-01-20T14:00:00Z"
  }
}

Access Patterns

Read-Only Access

  • Validation harness reads corpus pairs
  • CI reads baselines for regression checks
  • Auditors read evidence bundles

Write Access

  • Corpus ingestion adds new pairs
  • Baseline update writes new baseline files
  • Bundle export creates evidence bundles

Sync Access

  • Mirror sync updates upstream caches
  • Scheduled jobs refresh OSV database

Storage Requirements

Component Typical Size Growth Rate
Corpus (per pair) 50-500 MB N/A
Mirrors (Debian) 10-50 GB Monthly
Mirrors (Ubuntu) 5-20 GB Monthly
Mirrors (Alpine) 1-5 GB Monthly
OSV Database 500 MB Weekly
Evidence bundles 100-500 MB each Per pair
Baselines < 10 KB each Per run