tests fixes and sprints work
This commit is contained in:
@@ -1239,7 +1239,183 @@ binaryindex:
|
||||
|
||||
---
|
||||
|
||||
## 10. References
|
||||
## 10. Golden Corpus for Patch Provenance
|
||||
|
||||
> **Sprint:** SPRINT_20260121_034/035/036 - Golden Corpus Implementation
|
||||
|
||||
The BinaryIndex module supports a **golden corpus** of patch-paired artifacts that enables offline SBOM reproducibility and binary-level patch provenance verification.
|
||||
|
||||
### 10.1 Corpus Purpose
|
||||
|
||||
The golden corpus provides:
|
||||
- **Auditor-ready evidence bundles** for air-gapped customers
|
||||
- **Regression testing** for binary matching accuracy
|
||||
- **Proof of patch status** independent of package metadata
|
||||
|
||||
### 10.2 Corpus Sources
|
||||
|
||||
| Source | Type | Purpose |
|
||||
|--------|------|---------|
|
||||
| Debian Security Tracker / DSAs | Advisory | Primary advisory linkage |
|
||||
| Debian Snapshot | Binary archive | Pre/post patch binary pairs |
|
||||
| Ubuntu Security Notices | Advisory | Ubuntu-specific advisories |
|
||||
| Alpine secdb | Advisory | Alpine YAML advisories |
|
||||
| OSV dump | Unified schema | Cross-reference and commit ranges |
|
||||
|
||||
### 10.2.1 Symbol Source Connectors
|
||||
|
||||
> **Sprint:** SPRINT_20260121_035_BinaryIndex_golden_corpus_connectors_cli
|
||||
|
||||
The corpus ingestion layer uses pluggable connectors to retrieve symbols and metadata from upstream sources:
|
||||
|
||||
| Connector ID | Implementation | Protocol | Data Retrieved |
|
||||
|--------------|----------------|----------|----------------|
|
||||
| `debuginfod-fedora` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
|
||||
| `debuginfod-ubuntu` | `DebuginfodConnector` | debuginfod HTTP | ELF debug symbols by Build-ID |
|
||||
| `ddeb-ubuntu` | `DdebConnector` | APT/HTTP | `.ddeb` debug packages |
|
||||
| `buildinfo-debian` | `BuildinfoConnector` | HTTP | `.buildinfo` reproducibility records |
|
||||
| `secdb-alpine` | `AlpineSecDbConnector` | Git/HTTP | `secfixes` YAML from APKBUILD |
|
||||
|
||||
**Connector Interface:**
|
||||
|
||||
```csharp
|
||||
public interface ISymbolSourceConnector
|
||||
{
|
||||
string ConnectorId { get; }
|
||||
string DisplayName { get; }
|
||||
string[] SupportedDistros { get; }
|
||||
|
||||
Task<ConnectorStatus> GetStatusAsync(CancellationToken ct);
|
||||
Task SyncAsync(SyncOptions options, CancellationToken ct);
|
||||
Task<SymbolLookupResult?> LookupByBuildIdAsync(string buildId, CancellationToken ct);
|
||||
Task<IAsyncEnumerable<SymbolRecord>> SearchAsync(SymbolSearchQuery query, CancellationToken ct);
|
||||
}
|
||||
```
|
||||
|
||||
**Debuginfod Connector:**
|
||||
|
||||
The `DebuginfodConnector` implements the [debuginfod protocol](https://sourceware.org/elfutils/Debuginfod.html) for retrieving debug symbols:
|
||||
|
||||
- Endpoint: `GET /buildid/<build-id>/debuginfo`
|
||||
- Supports federated queries across multiple debuginfod servers
|
||||
- Caches retrieved symbols in RustFS blob storage
|
||||
- Rate-limited to respect upstream server policies
|
||||
|
||||
**Ubuntu ddeb Connector:**
|
||||
|
||||
The `DdebConnector` retrieves Ubuntu debug symbol packages (`.ddeb`):
|
||||
|
||||
- Sources: `ddebs.ubuntu.com` mirror
|
||||
- Indexes: Reads `Packages.xz` for package metadata
|
||||
- Extraction: Unpacks `.ddeb` AR archives to extract DWARF symbols
|
||||
- Mapping: Links debug symbols to binary packages via Build-ID
|
||||
|
||||
**Debian Buildinfo Connector:**
|
||||
|
||||
The `BuildinfoConnector` retrieves Debian buildinfo files for reproducibility verification:
|
||||
|
||||
- Source: `buildinfos.debian.net` and snapshot archives
|
||||
- Purpose: Provides build environment metadata for reproducible builds
|
||||
- Fields extracted: `Build-Date`, `Build-Architecture`, `Checksums-Sha256`
|
||||
- Integration: Cross-references with binary packages for provenance
|
||||
|
||||
**Alpine SecDB Connector:**
|
||||
|
||||
The `AlpineSecDbConnector` parses Alpine's security database:
|
||||
|
||||
- Source: `secfixes` blocks in APKBUILD files
|
||||
- Repository: `alpine/aports` Git repository
|
||||
- Format: YAML blocks mapping CVEs to fixed versions
|
||||
- Example:
|
||||
```yaml
|
||||
secfixes:
|
||||
3.0.11-r0:
|
||||
- CVE-2024-0727
|
||||
- CVE-2024-0728
|
||||
```
|
||||
|
||||
**OSV Dump Parser:**
|
||||
|
||||
The `OsvDumpParser` processes Google OSV database dumps for advisory cross-correlation:
|
||||
|
||||
- Source: `osv.dev` bulk exports (JSON)
|
||||
- Purpose: CVE → commit range extraction for patch identification
|
||||
- Cross-reference: Correlates OSV entries with distribution advisories
|
||||
- Inconsistency detection: Identifies discrepancies between OSV and distro advisories
|
||||
|
||||
```csharp
|
||||
public interface IOsvDumpParser
|
||||
{
|
||||
IAsyncEnumerable<OsvParsedEntry> ParseDumpAsync(Stream osvDumpStream, CancellationToken ct);
|
||||
OsvCveIndex BuildCveIndex(IEnumerable<OsvParsedEntry> entries);
|
||||
IEnumerable<AdvisoryCorrelation> CrossReferenceWithExternal(
|
||||
OsvCveIndex osvIndex,
|
||||
IEnumerable<ExternalAdvisory> externalAdvisories);
|
||||
IEnumerable<AdvisoryInconsistency> DetectInconsistencies(
|
||||
IEnumerable<AdvisoryCorrelation> correlations);
|
||||
}
|
||||
```
|
||||
|
||||
**CLI Access:**
|
||||
|
||||
All connectors are manageable via the `stella groundtruth sources` CLI commands:
|
||||
|
||||
```bash
|
||||
# List all connectors
|
||||
stella groundtruth sources list
|
||||
|
||||
# Sync specific connector
|
||||
stella groundtruth sources sync --source buildinfo-debian --full
|
||||
|
||||
# Enable/disable connectors
|
||||
stella groundtruth sources enable ddeb-ubuntu
|
||||
stella groundtruth sources disable debuginfod-fedora
|
||||
```
|
||||
|
||||
See [Ground-Truth CLI Guide](../cli/guides/ground-truth-cli.md) for complete CLI documentation
|
||||
|
||||
### 10.3 Key Performance Indicators
|
||||
|
||||
| KPI | Target | Description |
|
||||
|-----|--------|-------------|
|
||||
| Per-function match rate | >= 90% | Functions matched in post-patch binary |
|
||||
| False-negative patch detection | <= 5% | Patched functions incorrectly classified |
|
||||
| SBOM canonical-hash stability | 3/3 | Determinism across independent runs |
|
||||
| Binary reconstruction equivalence | Trend | Rebuilt binary matches original |
|
||||
| End-to-end verify time (p95, cold) | Trend | Offline verification performance |
|
||||
|
||||
### 10.4 Validation Harness
|
||||
|
||||
The validation harness (`IValidationHarness`) orchestrates end-to-end verification:
|
||||
|
||||
```
|
||||
Binary Pair (pre/post) → Symbol Recovery → IR Lifting → Fingerprinting → Matching → Metrics
|
||||
```
|
||||
|
||||
### 10.5 Evidence Bundle Format
|
||||
|
||||
Evidence bundles follow OCI/ORAS conventions:
|
||||
|
||||
```
|
||||
<pkg>-<advisory>-bundle.oci.tar
|
||||
├── manifest.json # OCI manifest
|
||||
└── blobs/
|
||||
├── sha256:<sbom> # Canonical SBOM
|
||||
├── sha256:<pre-bin> # Pre-fix binary
|
||||
├── sha256:<post-bin> # Post-fix binary
|
||||
├── sha256:<delta-sig> # DSSE delta-sig predicate
|
||||
└── sha256:<timestamp> # RFC 3161 timestamp
|
||||
```
|
||||
|
||||
### 10.6 Related Documentation
|
||||
|
||||
- [Golden Corpus KPIs](../../benchmarks/golden-corpus-kpis.md)
|
||||
- [Golden Corpus Seed List](../../benchmarks/golden-corpus-seed-list.md)
|
||||
- [Ground-Truth Corpus Specification](../../benchmarks/ground-truth-corpus.md)
|
||||
|
||||
---
|
||||
|
||||
## 11. References
|
||||
|
||||
- Advisory: `docs/product/advisories/21-Dec-2025 - Mapping Evidence Within Compiled Binaries.md`
|
||||
- Scanner Native Analysis: `src/Scanner/StellaOps.Scanner.Analyzers.Native/`
|
||||
@@ -1248,8 +1424,9 @@ binaryindex:
|
||||
- **Semantic Diffing Sprint:** `docs/implplan/SPRINT_20260105_001_001_BINDEX_semdiff_ir_semantics.md`
|
||||
- **Semantic Library:** `src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.Semantic/`
|
||||
- **Semantic Tests:** `src/BinaryIndex/__Tests/StellaOps.BinaryIndex.Semantic.Tests/`
|
||||
- **Golden Corpus Sprints:** `docs/implplan/SPRINT_20260121_034_BinaryIndex_golden_corpus_foundation.md`
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.1.1*
|
||||
*Last Updated: 2026-01-14*
|
||||
*Document Version: 1.2.0*
|
||||
*Last Updated: 2026-01-21*
|
||||
|
||||
347
docs/modules/binary-index/golden-corpus-layout.md
Normal file
347
docs/modules/binary-index/golden-corpus-layout.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Golden Corpus Folder Layout
|
||||
|
||||
Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification
|
||||
Task: GCB-006 - Document corpus folder layout and maintenance procedures
|
||||
|
||||
## Overview
|
||||
|
||||
The golden corpus is a curated dataset of pre/post security patch binary pairs used for:
|
||||
- Validating binary matching algorithms
|
||||
- Benchmarking reproducibility verification
|
||||
- Training machine learning models for function identification
|
||||
- Generating audit-ready evidence bundles
|
||||
|
||||
## Root Layout
|
||||
|
||||
```
|
||||
golden-corpus/
|
||||
├── corpus/ # Security pairs organized by distro
|
||||
│ ├── debian/
|
||||
│ ├── ubuntu/
|
||||
│ └── alpine/
|
||||
├── mirrors/ # Local mirrors of upstream sources
|
||||
│ ├── debian/
|
||||
│ ├── ubuntu/
|
||||
│ ├── alpine/
|
||||
│ └── osv/
|
||||
├── harness/ # Build and verification tooling
|
||||
│ ├── chroots/
|
||||
│ ├── lifter-matcher/
|
||||
│ ├── sbom-canonicalizer/
|
||||
│ └── verifier/
|
||||
├── evidence/ # Generated evidence bundles
|
||||
│ └── <pkg>-<advisory>-bundle.oci.tar
|
||||
└── bench/ # Benchmark data and baselines
|
||||
├── baselines/
|
||||
└── results/
|
||||
```
|
||||
|
||||
## Corpus Directory Structure
|
||||
|
||||
Each security pair follows a consistent structure:
|
||||
|
||||
```
|
||||
corpus/<distro>/<package>/<advisory-id>/
|
||||
├── pre/ # Pre-patch (vulnerable) artifacts
|
||||
│ ├── src/ # Source code
|
||||
│ │ ├── *.tar.gz # Original source tarball
|
||||
│ │ ├── debian/ # Packaging metadata
|
||||
│ │ └── buildinfo # Build reproducibility info
|
||||
│ └── debs/ # Built binaries
|
||||
│ ├── *.deb # Binary packages
|
||||
│ ├── *.ddeb # Debug symbols
|
||||
│ └── buildlog # Build log
|
||||
├── post/ # Post-patch (fixed) artifacts
|
||||
│ ├── src/
|
||||
│ └── debs/
|
||||
└── metadata/
|
||||
├── advisory.json # Advisory details
|
||||
├── osv.json # OSV format vulnerability
|
||||
├── pair-manifest.json # Pair configuration
|
||||
└── ground-truth.json # Function-level ground truth
|
||||
```
|
||||
|
||||
### Debian Example
|
||||
|
||||
```
|
||||
corpus/debian/openssl/DSA-5678-1/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ ├── openssl_3.0.10.orig.tar.gz
|
||||
│ │ ├── openssl_3.0.10-1.debian.tar.xz
|
||||
│ │ ├── openssl_3.0.10-1.dsc
|
||||
│ │ └── openssl_3.0.10-1.buildinfo
|
||||
│ └── debs/
|
||||
│ ├── libssl3_3.0.10-1_amd64.deb
|
||||
│ ├── libssl3-dbgsym_3.0.10-1_amd64.ddeb
|
||||
│ └── build.log
|
||||
├── post/
|
||||
│ ├── src/
|
||||
│ │ ├── openssl_3.0.11.orig.tar.gz
|
||||
│ │ ├── openssl_3.0.11-1.debian.tar.xz
|
||||
│ │ └── ...
|
||||
│ └── debs/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
├── advisory.json
|
||||
└── ground-truth.json
|
||||
```
|
||||
|
||||
### Ubuntu Example
|
||||
|
||||
```
|
||||
corpus/ubuntu/curl/USN-1234-1/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ └── curl_8.4.0-1ubuntu1.tar.xz
|
||||
│ └── debs/
|
||||
│ └── libcurl4_8.4.0-1ubuntu1_amd64.deb
|
||||
├── post/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
├── advisory.json
|
||||
└── usn.json
|
||||
```
|
||||
|
||||
### Alpine Example
|
||||
|
||||
```
|
||||
corpus/alpine/zlib/CVE-2022-37434/
|
||||
├── pre/
|
||||
│ ├── src/
|
||||
│ │ └── APKBUILD
|
||||
│ └── apks/
|
||||
│ └── zlib-1.2.12-r2.apk
|
||||
├── post/
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
└── secdb-entry.json
|
||||
```
|
||||
|
||||
## Mirrors Directory Structure
|
||||
|
||||
Local mirrors cache upstream artifacts for offline operation:
|
||||
|
||||
```
|
||||
mirrors/
|
||||
├── debian/
|
||||
│ ├── archive/ # snapshot.debian.org mirrors
|
||||
│ │ └── pool/main/o/openssl/
|
||||
│ ├── snapshot/ # Point-in-time snapshots
|
||||
│ │ └── 20260101T000000Z/
|
||||
│ └── buildinfo/ # buildinfos.debian.net cache
|
||||
│ └── <source-name>/
|
||||
├── ubuntu/
|
||||
│ ├── archive/ # archive.ubuntu.com mirrors
|
||||
│ ├── usn-index/ # USN metadata
|
||||
│ │ └── usn-db.json
|
||||
│ └── launchpad/ # Build logs from Launchpad
|
||||
├── alpine/
|
||||
│ ├── packages/ # Alpine package mirror
|
||||
│ └── secdb/ # Security database
|
||||
│ └── community.json
|
||||
└── osv/
|
||||
├── all.zip # Full OSV database
|
||||
└── debian/ # Distro-specific extracts
|
||||
```
|
||||
|
||||
## Harness Directory Structure
|
||||
|
||||
Build and verification tooling:
|
||||
|
||||
```
|
||||
harness/
|
||||
├── chroots/ # Build environments
|
||||
│ ├── debian-bookworm-amd64/
|
||||
│ ├── debian-bullseye-amd64/
|
||||
│ ├── ubuntu-noble-amd64/
|
||||
│ └── alpine-3.19-amd64/
|
||||
├── lifter-matcher/ # Binary analysis tools
|
||||
│ ├── ghidra/ # Ghidra installation
|
||||
│ ├── bsim-server/ # BSim database server
|
||||
│ └── semantic-diffing/ # Semantic diff tools
|
||||
├── sbom-canonicalizer/ # SBOM normalization
|
||||
│ └── config/
|
||||
└── verifier/ # Standalone verifier
|
||||
├── stella-verifier # Verifier binary
|
||||
└── trust-profiles/ # Trust profiles
|
||||
```
|
||||
|
||||
## Evidence Directory Structure
|
||||
|
||||
Generated bundles for audit/compliance:
|
||||
|
||||
```
|
||||
evidence/
|
||||
├── openssl-DSA-5678-1-bundle.oci.tar
|
||||
├── curl-USN-1234-1-bundle.oci.tar
|
||||
└── manifests/
|
||||
└── inventory.json
|
||||
```
|
||||
|
||||
### Bundle Internal Structure (OCI Format)
|
||||
|
||||
```
|
||||
openssl-DSA-5678-1-bundle.oci.tar/
|
||||
├── oci-layout # OCI layout version
|
||||
├── index.json # OCI index with referrers
|
||||
├── blobs/
|
||||
│ └── sha256/
|
||||
│ ├── <manifest> # Bundle manifest
|
||||
│ ├── <sbom-pre> # Pre-patch SBOM
|
||||
│ ├── <sbom-post> # Post-patch SBOM
|
||||
│ ├── <binary-pre> # Pre-patch binary
|
||||
│ ├── <binary-post> # Post-patch binary
|
||||
│ ├── <delta-sig> # DSSE delta-sig predicate
|
||||
│ ├── <provenance> # Build provenance
|
||||
│ └── <timestamp> # RFC 3161 timestamp
|
||||
└── manifest.json # Signed bundle manifest
|
||||
```
|
||||
|
||||
## Bench Directory Structure
|
||||
|
||||
Benchmark data and KPI baselines:
|
||||
|
||||
```
|
||||
bench/
|
||||
├── baselines/
|
||||
│ ├── current.json # Active KPI baseline
|
||||
│ └── archive/ # Historical baselines
|
||||
│ ├── baseline-20260115.json
|
||||
│ └── baseline-20260108.json
|
||||
├── results/
|
||||
│ ├── 20260122120000.json # Validation run results
|
||||
│ └── ...
|
||||
└── reports/
|
||||
└── regression-report-*.md
|
||||
```
|
||||
|
||||
### Baseline File Format
|
||||
|
||||
```json
|
||||
{
|
||||
"baselineId": "baseline-20260122120000",
|
||||
"createdAt": "2026-01-22T12:00:00Z",
|
||||
"source": "abc123def456",
|
||||
"description": "Post-semantic-diffing-v2 baseline",
|
||||
"precision": 0.95,
|
||||
"recall": 0.92,
|
||||
"falseNegativeRate": 0.08,
|
||||
"deterministicReplayRate": 1.0,
|
||||
"ttfrpP95Ms": 150,
|
||||
"additionalKpis": {}
|
||||
}
|
||||
```
|
||||
|
||||
## File Naming Conventions
|
||||
|
||||
| Type | Pattern | Example |
|
||||
|------|---------|---------|
|
||||
| Advisory ID (Debian) | `DSA-<number>-<revision>` | `DSA-5678-1` |
|
||||
| Advisory ID (Ubuntu) | `USN-<number>-<revision>` | `USN-1234-1` |
|
||||
| Advisory ID (Alpine) | `CVE-<year>-<number>` | `CVE-2022-37434` |
|
||||
| Bundle file | `<pkg>-<advisory>-bundle.oci.tar` | `openssl-DSA-5678-1-bundle.oci.tar` |
|
||||
| Baseline file | `baseline-<timestamp>.json` | `baseline-20260122120000.json` |
|
||||
| Results file | `<timestamp>.json` | `20260122120000.json` |
|
||||
|
||||
## Metadata Files
|
||||
|
||||
### advisory.json
|
||||
|
||||
```json
|
||||
{
|
||||
"advisoryId": "DSA-5678-1",
|
||||
"cves": ["CVE-2024-1234", "CVE-2024-5678"],
|
||||
"package": "openssl",
|
||||
"vulnerableVersions": ["3.0.10-1"],
|
||||
"fixedVersions": ["3.0.11-1"],
|
||||
"severity": "high",
|
||||
"publishedAt": "2024-11-15T00:00:00Z",
|
||||
"summary": "Multiple vulnerabilities in OpenSSL"
|
||||
}
|
||||
```
|
||||
|
||||
### pair-manifest.json
|
||||
|
||||
```json
|
||||
{
|
||||
"pairId": "openssl-DSA-5678-1",
|
||||
"package": "openssl",
|
||||
"distribution": "debian",
|
||||
"suite": "bookworm",
|
||||
"architecture": "amd64",
|
||||
"preVersion": "3.0.10-1",
|
||||
"postVersion": "3.0.11-1",
|
||||
"binaries": [
|
||||
"libssl3",
|
||||
"libcrypto3"
|
||||
],
|
||||
"createdAt": "2026-01-15T10:00:00Z",
|
||||
"validatedAt": "2026-01-22T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### ground-truth.json
|
||||
|
||||
```json
|
||||
{
|
||||
"pairId": "openssl-DSA-5678-1",
|
||||
"binary": "libcrypto.so.3",
|
||||
"functions": [
|
||||
{
|
||||
"name": "EVP_DigestInit_ex",
|
||||
"preAddress": "0x12345",
|
||||
"postAddress": "0x12347",
|
||||
"status": "modified",
|
||||
"confidence": 1.0
|
||||
},
|
||||
{
|
||||
"name": "EVP_DigestUpdate",
|
||||
"preAddress": "0x12400",
|
||||
"postAddress": "0x12400",
|
||||
"status": "unchanged",
|
||||
"confidence": 1.0
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"generatedBy": "manual-annotation",
|
||||
"reviewedBy": "security-team",
|
||||
"reviewedAt": "2026-01-20T14:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Access Patterns
|
||||
|
||||
### Read-Only Access
|
||||
- Validation harness reads corpus pairs
|
||||
- CI reads baselines for regression checks
|
||||
- Auditors read evidence bundles
|
||||
|
||||
### Write Access
|
||||
- Corpus ingestion adds new pairs
|
||||
- Baseline update writes new baseline files
|
||||
- Bundle export creates evidence bundles
|
||||
|
||||
### Sync Access
|
||||
- Mirror sync updates upstream caches
|
||||
- Scheduled jobs refresh OSV database
|
||||
|
||||
## Storage Requirements
|
||||
|
||||
| Component | Typical Size | Growth Rate |
|
||||
|-----------|--------------|-------------|
|
||||
| Corpus (per pair) | 50-500 MB | N/A |
|
||||
| Mirrors (Debian) | 10-50 GB | Monthly |
|
||||
| Mirrors (Ubuntu) | 5-20 GB | Monthly |
|
||||
| Mirrors (Alpine) | 1-5 GB | Monthly |
|
||||
| OSV Database | 500 MB | Weekly |
|
||||
| Evidence bundles | 100-500 MB each | Per pair |
|
||||
| Baselines | < 10 KB each | Per run |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Ground Truth Corpus Overview](ground-truth-corpus.md)
|
||||
- [Golden Corpus Maintenance](golden-corpus-maintenance.md)
|
||||
- [Corpus Ingestion Operations](corpus-ingestion-operations.md)
|
||||
- [Golden Corpus Operations Runbook](../../runbooks/golden-corpus-operations.md)
|
||||
492
docs/modules/binary-index/golden-corpus-maintenance.md
Normal file
492
docs/modules/binary-index/golden-corpus-maintenance.md
Normal file
@@ -0,0 +1,492 @@
|
||||
# Golden Corpus Maintenance
|
||||
|
||||
Sprint: SPRINT_20260121_036_BinaryIndex_golden_corpus_bundle_verification
|
||||
Task: GCB-006 - Document corpus folder layout and maintenance procedures
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes maintenance procedures for the golden corpus, including:
|
||||
- Mirror synchronization
|
||||
- Baseline management
|
||||
- Evidence bundle generation
|
||||
- Health monitoring
|
||||
|
||||
## Mirror Synchronization
|
||||
|
||||
### Automated Sync Schedule
|
||||
|
||||
Mirror sync should be automated via cron jobs or CI scheduled workflows.
|
||||
|
||||
#### Recommended Schedule
|
||||
|
||||
| Mirror | Frequency | Rationale |
|
||||
|--------|-----------|-----------|
|
||||
| Debian archive | Daily | Security updates published daily |
|
||||
| Debian buildinfo | Daily | Matches archive updates |
|
||||
| Ubuntu archive | Daily | Security updates published daily |
|
||||
| Ubuntu USN index | Hourly | USN metadata changes frequently |
|
||||
| Alpine secdb | Daily | Less frequent updates |
|
||||
| OSV database | Hourly | Aggregates multiple sources |
|
||||
|
||||
### Sync Scripts
|
||||
|
||||
#### Debian Mirror Sync
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# sync-debian-mirrors.sh
|
||||
# Syncs Debian archives and buildinfo
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
MIRRORS_ROOT="${MIRRORS_ROOT:-/data/golden-corpus/mirrors}"
|
||||
DEBIAN_MIRROR="${DEBIAN_MIRROR:-https://snapshot.debian.org}"
|
||||
BUILDINFO_URL="${BUILDINFO_URL:-https://buildinfos.debian.net}"
|
||||
|
||||
# Packages to mirror (security-relevant)
|
||||
PACKAGES=(openssl curl zlib glibc libxml2 libpng)
|
||||
|
||||
# Sync source packages
|
||||
for pkg in "${PACKAGES[@]}"; do
|
||||
echo "Syncing Debian sources for: $pkg"
|
||||
|
||||
# Create package directory
|
||||
mkdir -p "$MIRRORS_ROOT/debian/archive/pool/main/${pkg:0:1}/$pkg"
|
||||
|
||||
# Download available versions
|
||||
rsync -avz --progress \
|
||||
"rsync://snapshot.debian.org/snapshot/debian/pool/main/${pkg:0:1}/$pkg/" \
|
||||
"$MIRRORS_ROOT/debian/archive/pool/main/${pkg:0:1}/$pkg/"
|
||||
done
|
||||
|
||||
# Sync buildinfo files
|
||||
for pkg in "${PACKAGES[@]}"; do
|
||||
echo "Syncing buildinfo for: $pkg"
|
||||
|
||||
mkdir -p "$MIRRORS_ROOT/debian/buildinfo/$pkg"
|
||||
|
||||
# Use wget to fetch buildinfo index and files
|
||||
wget -r -np -nH --cut-dirs=2 -P "$MIRRORS_ROOT/debian/buildinfo/$pkg" \
|
||||
"$BUILDINFO_URL/api/v1/buildinfo/$pkg/" || true
|
||||
done
|
||||
|
||||
echo "Debian mirror sync complete"
|
||||
date > "$MIRRORS_ROOT/debian/.last-sync"
|
||||
```
|
||||
|
||||
#### Ubuntu Mirror Sync
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# sync-ubuntu-mirrors.sh
|
||||
# Syncs Ubuntu archives and USN metadata
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
MIRRORS_ROOT="${MIRRORS_ROOT:-/data/golden-corpus/mirrors}"
|
||||
UBUNTU_ARCHIVE="https://archive.ubuntu.com/ubuntu"
|
||||
USN_API="https://ubuntu.com/security/notices.json"
|
||||
|
||||
# Sync USN database
|
||||
echo "Syncing Ubuntu USN database..."
|
||||
mkdir -p "$MIRRORS_ROOT/ubuntu/usn-index"
|
||||
curl -sSL "$USN_API" -o "$MIRRORS_ROOT/ubuntu/usn-index/usn-db.json.tmp"
|
||||
mv "$MIRRORS_ROOT/ubuntu/usn-index/usn-db.json.tmp" "$MIRRORS_ROOT/ubuntu/usn-index/usn-db.json"
|
||||
|
||||
# Sync packages (similar to Debian)
|
||||
PACKAGES=(openssl curl zlib1g libxml2)
|
||||
|
||||
for pkg in "${PACKAGES[@]}"; do
|
||||
echo "Syncing Ubuntu sources for: $pkg"
|
||||
mkdir -p "$MIRRORS_ROOT/ubuntu/archive/pool/main/${pkg:0:1}/$pkg"
|
||||
# ... sync logic
|
||||
done
|
||||
|
||||
echo "Ubuntu mirror sync complete"
|
||||
date > "$MIRRORS_ROOT/ubuntu/.last-sync"
|
||||
```
|
||||
|
||||
#### Alpine SecDB Sync
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# sync-alpine-secdb.sh
|
||||
# Syncs Alpine security database
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
MIRRORS_ROOT="${MIRRORS_ROOT:-/data/golden-corpus/mirrors}"
|
||||
ALPINE_SECDB="https://secdb.alpinelinux.org"
|
||||
|
||||
mkdir -p "$MIRRORS_ROOT/alpine/secdb"
|
||||
|
||||
# Download all security databases
|
||||
for branch in v3.17 v3.18 v3.19 v3.20 edge; do
|
||||
for repo in main community; do
|
||||
echo "Syncing Alpine secdb: $branch/$repo"
|
||||
curl -sSL "$ALPINE_SECDB/$branch/$repo.json" \
|
||||
-o "$MIRRORS_ROOT/alpine/secdb/${branch}-${repo}.json" || true
|
||||
done
|
||||
done
|
||||
|
||||
echo "Alpine secdb sync complete"
|
||||
date > "$MIRRORS_ROOT/alpine/.last-sync"
|
||||
```
|
||||
|
||||
#### OSV Database Sync
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# sync-osv.sh
|
||||
# Syncs OSV vulnerability database
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
MIRRORS_ROOT="${MIRRORS_ROOT:-/data/golden-corpus/mirrors}"
|
||||
OSV_URL="https://osv-vulnerabilities.storage.googleapis.com"
|
||||
|
||||
mkdir -p "$MIRRORS_ROOT/osv"
|
||||
|
||||
# Download full database
|
||||
echo "Downloading OSV all.zip..."
|
||||
curl -sSL "$OSV_URL/all.zip" -o "$MIRRORS_ROOT/osv/all.zip.tmp"
|
||||
mv "$MIRRORS_ROOT/osv/all.zip.tmp" "$MIRRORS_ROOT/osv/all.zip"
|
||||
|
||||
# Extract ecosystem-specific databases
|
||||
for ecosystem in Debian Ubuntu Alpine; do
|
||||
mkdir -p "$MIRRORS_ROOT/osv/$ecosystem"
|
||||
unzip -o -q "$MIRRORS_ROOT/osv/all.zip" "$ecosystem/*" -d "$MIRRORS_ROOT/osv/" || true
|
||||
done
|
||||
|
||||
echo "OSV sync complete"
|
||||
date > "$MIRRORS_ROOT/osv/.last-sync"
|
||||
```
|
||||
|
||||
### Cron Configuration
|
||||
|
||||
```cron
|
||||
# /etc/cron.d/golden-corpus-sync
|
||||
|
||||
# Mirror sync jobs
|
||||
0 */4 * * * corpus /opt/golden-corpus/scripts/sync-debian-mirrors.sh >> /var/log/corpus/debian-sync.log 2>&1
|
||||
0 */4 * * * corpus /opt/golden-corpus/scripts/sync-ubuntu-mirrors.sh >> /var/log/corpus/ubuntu-sync.log 2>&1
|
||||
0 6 * * * corpus /opt/golden-corpus/scripts/sync-alpine-secdb.sh >> /var/log/corpus/alpine-sync.log 2>&1
|
||||
0 * * * * corpus /opt/golden-corpus/scripts/sync-osv.sh >> /var/log/corpus/osv-sync.log 2>&1
|
||||
|
||||
# Health check
|
||||
*/15 * * * * corpus /opt/golden-corpus/scripts/check-mirror-health.sh >> /var/log/corpus/health.log 2>&1
|
||||
```
|
||||
|
||||
## Baseline Management
|
||||
|
||||
### When to Update Baselines
|
||||
|
||||
Update the KPI baseline when:
|
||||
1. Algorithm improvements are merged (expected KPI improvement)
|
||||
2. New corpus pairs are added (may change baseline metrics)
|
||||
3. False positives/negatives are corrected in ground truth
|
||||
4. Major version upgrades of analysis tools
|
||||
|
||||
### Baseline Update Procedure
|
||||
|
||||
#### 1. Run Full Validation
|
||||
|
||||
```bash
|
||||
# Run validation on the full corpus
|
||||
stella groundtruth validate run \
|
||||
--matcher semantic-diffing \
|
||||
--output bench/results/$(date +%Y%m%d%H%M%S).json \
|
||||
--verbose
|
||||
```
|
||||
|
||||
#### 2. Review Results
|
||||
|
||||
```bash
|
||||
# Check metrics
|
||||
stella groundtruth validate metrics --run-id latest
|
||||
|
||||
# Compare against current baseline
|
||||
stella groundtruth validate check \
|
||||
--results bench/results/latest.json \
|
||||
--baseline bench/baselines/current.json
|
||||
```
|
||||
|
||||
#### 3. Update Baseline
|
||||
|
||||
Only if regression check passes or improvements are expected:
|
||||
|
||||
```bash
|
||||
# Archive current baseline
|
||||
cp bench/baselines/current.json \
|
||||
bench/baselines/archive/baseline-$(date +%Y%m%d).json
|
||||
|
||||
# Update baseline
|
||||
stella groundtruth baseline update \
|
||||
--from-results bench/results/latest.json \
|
||||
--output bench/baselines/current.json \
|
||||
--description "Post algorithm-v2.3 update" \
|
||||
--source "$(git rev-parse HEAD)"
|
||||
```
|
||||
|
||||
#### 4. Commit and Document
|
||||
|
||||
```bash
|
||||
# Commit the baseline update
|
||||
git add bench/baselines/
|
||||
git commit -m "chore(bench): update golden corpus baseline
|
||||
|
||||
Reason: Algorithm v2.3 improvements
|
||||
Previous baseline: baseline-20260115.json
|
||||
|
||||
Metrics:
|
||||
- Precision: 0.95 -> 0.97 (+2pp)
|
||||
- Recall: 0.92 -> 0.94 (+2pp)
|
||||
- FN Rate: 0.08 -> 0.06 (-2pp)
|
||||
- Determinism: 100%
|
||||
- TTFRP p95: 150ms -> 140ms (-7%)"
|
||||
|
||||
git push
|
||||
```
|
||||
|
||||
### Baseline Rollback
|
||||
|
||||
If a baseline update causes issues:
|
||||
|
||||
```bash
|
||||
# Restore previous baseline
|
||||
cp bench/baselines/archive/baseline-20260115.json \
|
||||
bench/baselines/current.json
|
||||
|
||||
git add bench/baselines/current.json
|
||||
git commit -m "revert(bench): rollback baseline to 20260115"
|
||||
git push
|
||||
```
|
||||
|
||||
## Evidence Bundle Generation
|
||||
|
||||
### Manual Bundle Export
|
||||
|
||||
```bash
|
||||
# Export bundle for specific packages
|
||||
stella groundtruth bundle export \
|
||||
--packages openssl,curl,zlib \
|
||||
--distros debian,ubuntu \
|
||||
--output evidence/security-bundle-$(date +%Y%m%d).tar.gz \
|
||||
--sign-with-cosign \
|
||||
--include-debug \
|
||||
--include-kpis \
|
||||
--include-timestamps
|
||||
```
|
||||
|
||||
### Automated Bundle Generation
|
||||
|
||||
Schedule bundle generation for compliance reporting:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# generate-compliance-bundles.sh
|
||||
# Run monthly for audit evidence
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
EVIDENCE_DIR="/data/golden-corpus/evidence"
|
||||
MONTH=$(date +%Y%m)
|
||||
|
||||
# Generate bundles for each distro
|
||||
for distro in debian ubuntu alpine; do
|
||||
stella groundtruth bundle export \
|
||||
--distros "$distro" \
|
||||
--packages all \
|
||||
--output "$EVIDENCE_DIR/$distro-bundle-$MONTH.tar.gz" \
|
||||
--sign-with-cosign \
|
||||
--include-kpis \
|
||||
--include-timestamps
|
||||
done
|
||||
|
||||
# Create manifest
|
||||
echo "{\"month\": \"$MONTH\", \"bundles\": [\"debian\", \"ubuntu\", \"alpine\"]}" \
|
||||
> "$EVIDENCE_DIR/manifest-$MONTH.json"
|
||||
```
|
||||
|
||||
### Bundle Verification
|
||||
|
||||
Always verify bundles after generation:
|
||||
|
||||
```bash
|
||||
# Verify bundle integrity
|
||||
stella groundtruth bundle import \
|
||||
--input evidence/security-bundle-20260122.tar.gz \
|
||||
--verify \
|
||||
--trusted-keys /etc/stellaops/trusted-keys.pub \
|
||||
--trust-profile /etc/stellaops/trust-profiles/global.json \
|
||||
--output verification-report.md
|
||||
```
|
||||
|
||||
## Health Monitoring
|
||||
|
||||
### Doctor Checks
|
||||
|
||||
Run Doctor checks regularly to validate corpus health:
|
||||
|
||||
```bash
|
||||
# Run all corpus-related checks
|
||||
stella doctor --check "check.binaryanalysis.corpus.*"
|
||||
|
||||
# Specific checks
|
||||
stella doctor --check check.binaryanalysis.corpus.mirror.freshness
|
||||
stella doctor --check check.binaryanalysis.corpus.kpi.baseline
|
||||
stella doctor --check check.binaryanalysis.debuginfod.availability
|
||||
```
|
||||
|
||||
### Health Check Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# check-mirror-health.sh
|
||||
# Validates mirror freshness and connectivity
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
MIRRORS_ROOT="${MIRRORS_ROOT:-/data/golden-corpus/mirrors}"
|
||||
STALE_THRESHOLD_DAYS=7
|
||||
ALERTS=""
|
||||
|
||||
check_mirror() {
|
||||
local mirror_name=$1
|
||||
local last_sync_file=$2
|
||||
local max_age=$3
|
||||
|
||||
if [[ ! -f "$last_sync_file" ]]; then
|
||||
ALERTS+="CRITICAL: $mirror_name has never been synced\n"
|
||||
return
|
||||
fi
|
||||
|
||||
local last_sync=$(cat "$last_sync_file")
|
||||
local last_sync_epoch=$(date -d "$last_sync" +%s)
|
||||
local now_epoch=$(date +%s)
|
||||
local age_days=$(( (now_epoch - last_sync_epoch) / 86400 ))
|
||||
|
||||
if [[ $age_days -gt $max_age ]]; then
|
||||
ALERTS+="WARNING: $mirror_name is $age_days days old (threshold: $max_age)\n"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check each mirror
|
||||
check_mirror "Debian" "$MIRRORS_ROOT/debian/.last-sync" $STALE_THRESHOLD_DAYS
|
||||
check_mirror "Ubuntu" "$MIRRORS_ROOT/ubuntu/.last-sync" $STALE_THRESHOLD_DAYS
|
||||
check_mirror "Alpine" "$MIRRORS_ROOT/alpine/.last-sync" $STALE_THRESHOLD_DAYS
|
||||
check_mirror "OSV" "$MIRRORS_ROOT/osv/.last-sync" 1 # OSV should be hourly
|
||||
|
||||
# Check connectivity
|
||||
for url in \
|
||||
"https://snapshot.debian.org" \
|
||||
"https://buildinfos.debian.net" \
|
||||
"https://ubuntu.com/security/notices.json" \
|
||||
"https://secdb.alpinelinux.org"; do
|
||||
|
||||
if ! curl -sSf --connect-timeout 5 "$url" > /dev/null 2>&1; then
|
||||
ALERTS+="ERROR: Cannot reach $url\n"
|
||||
fi
|
||||
done
|
||||
|
||||
# Report results
|
||||
if [[ -n "$ALERTS" ]]; then
|
||||
echo -e "Golden Corpus Health Issues:\n$ALERTS"
|
||||
# Send alert (customize for your alerting system)
|
||||
# curl -X POST -d "$ALERTS" https://alerts.example.com/webhook
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "All mirrors healthy at $(date)"
|
||||
```
|
||||
|
||||
### Monitoring Metrics
|
||||
|
||||
Export these metrics to your monitoring system:
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `corpus.mirrors.age_seconds` | Time since last mirror sync | > 7 days |
|
||||
| `corpus.pairs.total` | Total number of security pairs | N/A (info) |
|
||||
| `corpus.validation.precision` | Latest precision rate | < baseline - 0.01 |
|
||||
| `corpus.validation.recall` | Latest recall rate | < baseline - 0.01 |
|
||||
| `corpus.validation.determinism` | Deterministic replay rate | < 1.0 |
|
||||
| `corpus.bundle.count` | Number of evidence bundles | N/A (info) |
|
||||
| `corpus.baseline.age_days` | Days since baseline update | > 30 days |
|
||||
|
||||
### Prometheus Metrics Example
|
||||
|
||||
```yaml
|
||||
# prometheus-corpus-metrics.yaml
|
||||
groups:
|
||||
- name: golden-corpus
|
||||
rules:
|
||||
- alert: CorpusMirrorStale
|
||||
expr: corpus_mirror_age_seconds > 604800 # 7 days
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Corpus mirror {{ $labels.mirror }} is stale"
|
||||
|
||||
- alert: CorpusRegressionDetected
|
||||
expr: corpus_validation_precision < corpus_baseline_precision - 0.01
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Precision regression detected in golden corpus validation"
|
||||
|
||||
- alert: CorpusDeterminismFailure
|
||||
expr: corpus_validation_determinism < 1.0
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Non-deterministic replay detected"
|
||||
```
|
||||
|
||||
## Cleanup and Archival
|
||||
|
||||
### Archive Old Results
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# archive-old-results.sh
|
||||
# Archives results older than 90 days
|
||||
|
||||
RESULTS_DIR="/data/golden-corpus/bench/results"
|
||||
ARCHIVE_DIR="/data/golden-corpus/bench/archive"
|
||||
AGE_DAYS=90
|
||||
|
||||
mkdir -p "$ARCHIVE_DIR"
|
||||
|
||||
find "$RESULTS_DIR" -name "*.json" -mtime +$AGE_DAYS -exec \
|
||||
mv {} "$ARCHIVE_DIR/" \;
|
||||
|
||||
# Compress archived results by month
|
||||
cd "$ARCHIVE_DIR"
|
||||
for month in $(ls *.json | cut -c1-6 | sort -u); do
|
||||
tar -czf "results-$month.tar.gz" "${month}"*.json && \
|
||||
rm -f "${month}"*.json
|
||||
done
|
||||
```
|
||||
|
||||
### Prune Old Baselines
|
||||
|
||||
Keep only the last N baselines:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# prune-baselines.sh
|
||||
# Keeps only the 10 most recent baseline archives
|
||||
|
||||
BASELINE_ARCHIVE="/data/golden-corpus/bench/baselines/archive"
|
||||
KEEP_COUNT=10
|
||||
|
||||
cd "$BASELINE_ARCHIVE"
|
||||
ls -t baseline-*.json | tail -n +$((KEEP_COUNT + 1)) | xargs -r rm -f
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Golden Corpus Folder Layout](golden-corpus-layout.md)
|
||||
- [Ground Truth Corpus Overview](ground-truth-corpus.md)
|
||||
- [Golden Corpus Operations Runbook](../../runbooks/golden-corpus-operations.md)
|
||||
Reference in New Issue
Block a user