tests fixes and sprints work
This commit is contained in:
347
docs/modules/binary-index/golden-corpus-layout.md
Normal file
347
docs/modules/binary-index/golden-corpus-layout.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Golden Corpus Folder Layout
|
||||
|
||||
Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification
|
||||
Task: GCB-006 - Document corpus folder layout and maintenance procedures
|
||||
|
||||
## Overview
|
||||
|
||||
The golden corpus is a curated dataset of pre/post security patch binary pairs used for:
|
||||
- Validating binary matching algorithms
|
||||
- Benchmarking reproducibility verification
|
||||
- Training machine learning models for function identification
|
||||
- Generating audit-ready evidence bundles
|
||||
|
||||
## Root Layout
|
||||
|
||||
```
|
||||
golden-corpus/
|
||||
├── corpus/ # Security pairs organized by distro
|
||||
│ ├── debian/
|
||||
│ ├── ubuntu/
|
||||
│ └── alpine/
|
||||
├── mirrors/ # Local mirrors of upstream sources
|
||||
│ ├── debian/
|
||||
│ ├── ubuntu/
|
||||
│ ├── alpine/
|
||||
│ └── osv/
|
||||
├── harness/ # Build and verification tooling
|
||||
│ ├── chroots/
|
||||
│ ├── lifter-matcher/
|
||||
│ ├── sbom-canonicalizer/
|
||||
│ └── verifier/
|
||||
├── evidence/ # Generated evidence bundles
|
||||
│ └── <pkg>-<advisory>-bundle.oci.tar
|
||||
└── bench/ # Benchmark data and baselines
|
||||
├── baselines/
|
||||
└── results/
|
||||
```
|
||||
|
||||
## Corpus Directory Structure
|
||||
|
||||
Each security pair follows a consistent structure:
|
||||
|
||||
```
|
||||
corpus/<distro>/<package>/<advisory-id>/
|
||||
├── pre/ # Pre-patch (vulnerable) artifacts
|
||||
│ ├── src/ # Source code
|
||||
│ │ ├── *.tar.gz # Original source tarball
|
||||
│ │ ├── debian/ # Packaging metadata
|
||||
│ │ └── buildinfo # Build reproducibility info
|
||||
│ └── debs/ # Built binaries
|
||||
│ ├── *.deb # Binary packages
|
||||
│ ├── *.ddeb # Debug symbols
|
||||
│ └── buildlog # Build log
|
||||
├── post/ # Post-patch (fixed) artifacts
|
||||
│ ├── src/
|
||||
│ └── debs/
|
||||
└── metadata/
|
||||
├── advisory.json # Advisory details
|
||||
├── osv.json # OSV format vulnerability
|
||||
├── pair-manifest.json # Pair configuration
|
||||
└── ground-truth.json # Function-level ground truth
|
||||
```
|
||||
|
||||
### Debian Example
|
||||
|
||||
```
|
||||
corpus/debian/openssl/DSA-5678-1/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ ├── openssl_3.0.10.orig.tar.gz
|
||||
│ │ ├── openssl_3.0.10-1.debian.tar.xz
|
||||
│ │ ├── openssl_3.0.10-1.dsc
|
||||
│ │ └── openssl_3.0.10-1.buildinfo
|
||||
│ └── debs/
|
||||
│ ├── libssl3_3.0.10-1_amd64.deb
|
||||
│ ├── libssl3-dbgsym_3.0.10-1_amd64.ddeb
|
||||
│ └── build.log
|
||||
├── post/
|
||||
│ ├── src/
|
||||
│ │ ├── openssl_3.0.11.orig.tar.gz
|
||||
│ │ ├── openssl_3.0.11-1.debian.tar.xz
|
||||
│ │ └── ...
|
||||
│ └── debs/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
├── advisory.json
|
||||
└── ground-truth.json
|
||||
```
|
||||
|
||||
### Ubuntu Example
|
||||
|
||||
```
|
||||
corpus/ubuntu/curl/USN-1234-1/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ └── curl_8.4.0-1ubuntu1.tar.xz
|
||||
│ └── debs/
|
||||
│ └── libcurl4_8.4.0-1ubuntu1_amd64.deb
|
||||
├── post/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
├── advisory.json
|
||||
└── usn.json
|
||||
```
|
||||
|
||||
### Alpine Example
|
||||
|
||||
```
|
||||
corpus/alpine/zlib/CVE-2022-37434/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ └── APKBUILD
|
||||
│ └── apks/
|
||||
│ └── zlib-1.2.12-r2.apk
|
||||
├── post/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
└── secdb-entry.json
|
||||
```
|
||||
|
||||
## Mirrors Directory Structure
|
||||
|
||||
Local mirrors cache upstream artifacts for offline operation:
|
||||
|
||||
```
|
||||
mirrors/
|
||||
├── debian/
|
||||
│ ├── archive/ # snapshot.debian.org mirrors
|
||||
│ │ └── pool/main/o/openssl/
|
||||
│ ├── snapshot/ # Point-in-time snapshots
|
||||
│ │ └── 20260101T000000Z/
|
||||
│ └── buildinfo/ # buildinfos.debian.net cache
|
||||
│ └── <source-name>/
|
||||
├── ubuntu/
|
||||
│ ├── archive/ # archive.ubuntu.com mirrors
|
||||
│ ├── usn-index/ # USN metadata
|
||||
│ │ └── usn-db.json
|
||||
│ └── launchpad/ # Build logs from Launchpad
|
||||
├── alpine/
|
||||
│ ├── packages/ # Alpine package mirror
|
||||
│ └── secdb/ # Security database
|
||||
│ └── community.json
|
||||
└── osv/
|
||||
├── all.zip # Full OSV database
|
||||
└── debian/ # Distro-specific extracts
|
||||
```
|
||||
|
||||
## Harness Directory Structure
|
||||
|
||||
Build and verification tooling:
|
||||
|
||||
```
|
||||
harness/
|
||||
├── chroots/ # Build environments
|
||||
│ ├── debian-bookworm-amd64/
|
||||
│ ├── debian-bullseye-amd64/
|
||||
│ ├── ubuntu-noble-amd64/
|
||||
│ └── alpine-3.19-amd64/
|
||||
├── lifter-matcher/ # Binary analysis tools
|
||||
│ ├── ghidra/ # Ghidra installation
|
||||
│ ├── bsim-server/ # BSim database server
|
||||
│ └── semantic-diffing/ # Semantic diff tools
|
||||
├── sbom-canonicalizer/ # SBOM normalization
|
||||
│ └── config/
|
||||
└── verifier/ # Standalone verifier
|
||||
├── stella-verifier # Verifier binary
|
||||
└── trust-profiles/ # Trust profiles
|
||||
```
|
||||
|
||||
## Evidence Directory Structure
|
||||
|
||||
Generated bundles for audit/compliance:
|
||||
|
||||
```
|
||||
evidence/
|
||||
├── openssl-DSA-5678-1-bundle.oci.tar
|
||||
├── curl-USN-1234-1-bundle.oci.tar
|
||||
└── manifests/
|
||||
└── inventory.json
|
||||
```
|
||||
|
||||
### Bundle Internal Structure (OCI Format)
|
||||
|
||||
```
|
||||
openssl-DSA-5678-1-bundle.oci.tar/
|
||||
├── oci-layout # OCI layout version
|
||||
├── index.json # OCI index with referrers
|
||||
├── blobs/
|
||||
│ └── sha256/
|
||||
│ ├── <manifest> # Bundle manifest
|
||||
│ ├── <sbom-pre> # Pre-patch SBOM
|
||||
│ ├── <sbom-post> # Post-patch SBOM
|
||||
│ ├── <binary-pre> # Pre-patch binary
|
||||
│ ├── <binary-post> # Post-patch binary
|
||||
│ ├── <delta-sig> # DSSE delta-sig predicate
|
||||
│ ├── <provenance> # Build provenance
|
||||
│ └── <timestamp> # RFC 3161 timestamp
|
||||
└── manifest.json # Signed bundle manifest
|
||||
```
|
||||
|
||||
## Bench Directory Structure
|
||||
|
||||
Benchmark data and KPI baselines:
|
||||
|
||||
```
|
||||
bench/
|
||||
├── baselines/
|
||||
│ ├── current.json # Active KPI baseline
|
||||
│ └── archive/ # Historical baselines
|
||||
│ ├── baseline-20260115.json
|
||||
│ └── baseline-20260108.json
|
||||
├── results/
|
||||
│ ├── 20260122120000.json # Validation run results
|
||||
│ └── ...
|
||||
└── reports/
|
||||
└── regression-report-*.md
|
||||
```
|
||||
|
||||
### Baseline File Format
|
||||
|
||||
```json
|
||||
{
|
||||
"baselineId": "baseline-20260122120000",
|
||||
"createdAt": "2026-01-22T12:00:00Z",
|
||||
"source": "abc123def456",
|
||||
"description": "Post-semantic-diffing-v2 baseline",
|
||||
"precision": 0.95,
|
||||
"recall": 0.92,
|
||||
"falseNegativeRate": 0.08,
|
||||
"deterministicReplayRate": 1.0,
|
||||
"ttfrpP95Ms": 150,
|
||||
"additionalKpis": {}
|
||||
}
|
||||
```
|
||||
|
||||
## File Naming Conventions
|
||||
|
||||
| Type | Pattern | Example |
|
||||
|------|---------|---------|
|
||||
| Advisory ID (Debian) | `DSA-<number>-<revision>` | `DSA-5678-1` |
|
||||
| Advisory ID (Ubuntu) | `USN-<number>-<revision>` | `USN-1234-1` |
|
||||
| Advisory ID (Alpine) | `CVE-<year>-<number>` | `CVE-2022-37434` |
|
||||
| Bundle file | `<pkg>-<advisory>-bundle.oci.tar` | `openssl-DSA-5678-1-bundle.oci.tar` |
|
||||
| Baseline file | `baseline-<timestamp>.json` | `baseline-20260122120000.json` |
|
||||
| Results file | `<timestamp>.json` | `20260122120000.json` |
|
||||
|
||||
## Metadata Files
|
||||
|
||||
### advisory.json
|
||||
|
||||
```json
|
||||
{
|
||||
"advisoryId": "DSA-5678-1",
|
||||
"cves": ["CVE-2024-1234", "CVE-2024-5678"],
|
||||
"package": "openssl",
|
||||
"vulnerableVersions": ["3.0.10-1"],
|
||||
"fixedVersions": ["3.0.11-1"],
|
||||
"severity": "high",
|
||||
"publishedAt": "2024-11-15T00:00:00Z",
|
||||
"summary": "Multiple vulnerabilities in OpenSSL"
|
||||
}
|
||||
```
|
||||
|
||||
### pair-manifest.json
|
||||
|
||||
```json
|
||||
{
|
||||
"pairId": "openssl-DSA-5678-1",
|
||||
"package": "openssl",
|
||||
"distribution": "debian",
|
||||
"suite": "bookworm",
|
||||
"architecture": "amd64",
|
||||
"preVersion": "3.0.10-1",
|
||||
"postVersion": "3.0.11-1",
|
||||
"binaries": [
|
||||
"libssl3",
|
||||
"libcrypto3"
|
||||
],
|
||||
"createdAt": "2026-01-15T10:00:00Z",
|
||||
"validatedAt": "2026-01-22T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### ground-truth.json
|
||||
|
||||
```json
|
||||
{
|
||||
"pairId": "openssl-DSA-5678-1",
|
||||
"binary": "libcrypto.so.3",
|
||||
"functions": [
|
||||
{
|
||||
"name": "EVP_DigestInit_ex",
|
||||
"preAddress": "0x12345",
|
||||
"postAddress": "0x12347",
|
||||
"status": "modified",
|
||||
"confidence": 1.0
|
||||
},
|
||||
{
|
||||
"name": "EVP_DigestUpdate",
|
||||
"preAddress": "0x12400",
|
||||
"postAddress": "0x12400",
|
||||
"status": "unchanged",
|
||||
"confidence": 1.0
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"generatedBy": "manual-annotation",
|
||||
"reviewedBy": "security-team",
|
||||
"reviewedAt": "2026-01-20T14:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Access Patterns
|
||||
|
||||
### Read-Only Access
|
||||
- Validation harness reads corpus pairs
|
||||
- CI reads baselines for regression checks
|
||||
- Auditors read evidence bundles
|
||||
|
||||
### Write Access
|
||||
- Corpus ingestion adds new pairs
|
||||
- Baseline update writes new baseline files
|
||||
- Bundle export creates evidence bundles
|
||||
|
||||
### Sync Access
|
||||
- Mirror sync updates upstream caches
|
||||
- Scheduled jobs refresh OSV database
|
||||
|
||||
## Storage Requirements
|
||||
|
||||
| Component | Typical Size | Growth Rate |
|
||||
|-----------|--------------|-------------|
|
||||
| Corpus (per pair) | 50-500 MB | N/A |
|
||||
| Mirrors (Debian) | 10-50 GB | Monthly |
|
||||
| Mirrors (Ubuntu) | 5-20 GB | Monthly |
|
||||
| Mirrors (Alpine) | 1-5 GB | Monthly |
|
||||
| OSV Database | 500 MB | Weekly |
|
||||
| Evidence bundles | 100-500 MB each | Per pair |
|
||||
| Baselines | < 10 KB each | Per run |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Ground Truth Corpus Overview](ground-truth-corpus.md)
|
||||
- [Golden Corpus Maintenance](golden-corpus-maintenance.md)
|
||||
- [Corpus Ingestion Operations](corpus-ingestion-operations.md)
|
||||
- [Golden Corpus Operations Runbook](../../runbooks/golden-corpus-operations.md)
|
||||
Reference in New Issue
Block a user