Files
git.stella-ops.org/docs/modules/binary-index/golden-set-schema.md
master 7f7eb8b228 Complete batch 012 (golden set diff) and 013 (advisory chat), fix build errors
Sprints completed:
- SPRINT_20260110_012_* (golden set diff layer - 10 sprints)
- SPRINT_20260110_013_* (advisory chat - 4 sprints)

Build fixes applied:
- Fix namespace conflicts with Microsoft.Extensions.Options.Options.Create
- Fix VexDecisionReachabilityIntegrationTests API drift (major rewrite)
- Fix VexSchemaValidationTests FluentAssertions method name
- Fix FixChainGateIntegrationTests ambiguous type references
- Fix AdvisoryAI test files required properties and namespace aliases
- Add stub types for CveMappingController (ICveSymbolMappingService)
- Fix VerdictBuilderService static context issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 10:09:07 +02:00

11 KiB

Golden Set Schema Documentation

Version: 1.0.0
Module: BinaryIndex.GoldenSet
Last Updated: 2026-01-10

Overview

Golden Sets are ground-truth definitions of vulnerability code-level manifestations. They capture the specific functions, basic block edges, sinks, and constants that characterize a vulnerability, enabling:

  • Deterministic vulnerability detection via fingerprint matching
  • Backport verification through pre/post patch comparison
  • Audit trail for security claims with content-addressed provenance

YAML Schema

Golden sets are stored as human-readable YAML files for git-friendliness and easy review.

Full Example

# GoldenSet.yaml schema v1.0.0
id: "CVE-2024-0727"
component: "openssl"

targets:
  - function: "PKCS12_parse"
    edges:
      - "bb3->bb7"
      - "bb7->bb9"
    sinks:
      - "memcpy"
      - "OPENSSL_malloc"
    constants:
      - "0x400"
      - "0xdeadbeef"
    taint_invariant: "len(field) <= 0x400 required before memcpy"
    source_file: "crypto/pkcs12/p12_kiss.c"
    source_line: 142

  - function: "PKCS12_unpack_p7data"
    edges:
      - "bb1->bb3"
    sinks:
      - "d2i_ASN1_OCTET_STRING"

witness:
  arguments:
    - "--file"
    - "<fuzz.bin>"
  invariant: "Malformed PKCS12 with oversized authsafe"
  poc_file_ref: "sha256:abc123def456abc123def456abc123def456abc123def456abc123def456abc123"

metadata:
  author_id: "security-team@example.com"
  created_at: "2025-01-10T12:00:00Z"
  source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727"
  reviewed_by: "senior-analyst@example.com"
  reviewed_at: "2025-01-11T09:00:00Z"
  tags:
    - "memory-corruption"
    - "heap-overflow"
    - "pkcs12"
  schema_version: "1.0.0"

Minimal Example

id: "CVE-2024-0727"
component: "openssl"
targets:
  - function: "vulnerable_function"
metadata:
  author_id: "analyst@example.com"
  created_at: "2025-01-10T12:00:00Z"
  source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727"

Field Reference

Root Fields

Field Type Required Description
id string Yes Vulnerability identifier (CVE-YYYY-NNNN or GHSA-xxxx-xxxx-xxxx)
component string Yes Affected component name (e.g., "openssl", "glibc")
targets array Yes List of vulnerable code targets (min 1)
witness object No Reproduction witness input
metadata object Yes Authorship and review metadata

Vulnerable Target Fields

Field Type Required Description
function string Yes Function name (symbol or demangled)
edges array No Basic block edges (format: "bbN->bbM")
sinks array No Sink functions reached (e.g., "memcpy")
constants array No Magic values identifying the vulnerability
taint_invariant string No Human-readable exploitation invariant
source_file string No Source file hint
source_line integer No Source line hint

Witness Input Fields

Field Type Required Description
arguments array No Command-line arguments to trigger vulnerability
invariant string No Human-readable precondition
poc_file_ref string No Content-addressed PoC file reference

Metadata Fields

Field Type Required Description
author_id string Yes Author identifier (email or handle)
created_at string Yes Creation timestamp (ISO 8601 UTC)
source_ref string Yes Advisory URL or commit hash
reviewed_by string No Reviewer identifier
reviewed_at string No Review timestamp (ISO 8601 UTC)
tags array No Classification tags
schema_version string No Schema version (default: "1.0.0")

JSON Schema

The following JSON Schema can be used for validation:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://stella-ops.org/schemas/golden-set/v1.0.0",
  "title": "Golden Set Definition",
  "type": "object",
  "required": ["id", "component", "targets", "metadata"],
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^CVE-\\d{4}-\\d{4,}$|^GHSA-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}$",
      "description": "Vulnerability identifier"
    },
    "component": {
      "type": "string",
      "minLength": 1,
      "description": "Affected component name"
    },
    "targets": {
      "type": "array",
      "minItems": 1,
      "items": { "$ref": "#/$defs/vulnerableTarget" },
      "description": "Vulnerable code targets"
    },
    "witness": { 
      "$ref": "#/$defs/witnessInput",
      "description": "Reproduction witness input"
    },
    "metadata": { 
      "$ref": "#/$defs/metadata",
      "description": "Authorship and review metadata"
    }
  },
  "$defs": {
    "vulnerableTarget": {
      "type": "object",
      "required": ["function"],
      "properties": {
        "function": { 
          "type": "string", 
          "minLength": 1,
          "description": "Function name"
        },
        "edges": {
          "type": "array",
          "items": { 
            "type": "string", 
            "pattern": "^bb\\d+->bb\\d+$" 
          },
          "description": "Basic block edges"
        },
        "sinks": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Sink functions"
        },
        "constants": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Magic values"
        },
        "taint_invariant": { 
          "type": "string",
          "description": "Exploitation invariant"
        },
        "source_file": { 
          "type": "string",
          "description": "Source file hint"
        },
        "source_line": { 
          "type": "integer", 
          "minimum": 1,
          "description": "Source line hint"
        }
      }
    },
    "witnessInput": {
      "type": "object",
      "properties": {
        "arguments": { 
          "type": "array", 
          "items": { "type": "string" },
          "description": "Command-line arguments"
        },
        "invariant": { 
          "type": "string",
          "description": "Human-readable precondition"
        },
        "poc_file_ref": { 
          "type": "string", 
          "pattern": "^sha256:[a-f0-9]{64}$",
          "description": "Content-addressed PoC reference"
        }
      }
    },
    "metadata": {
      "type": "object",
      "required": ["author_id", "created_at", "source_ref"],
      "properties": {
        "author_id": { 
          "type": "string",
          "description": "Author identifier"
        },
        "created_at": { 
          "type": "string", 
          "format": "date-time",
          "description": "Creation timestamp (ISO 8601)"
        },
        "source_ref": { 
          "type": "string", 
          "format": "uri",
          "description": "Advisory URL or commit hash"
        },
        "reviewed_by": { 
          "type": "string",
          "description": "Reviewer identifier"
        },
        "reviewed_at": { 
          "type": "string", 
          "format": "date-time",
          "description": "Review timestamp (ISO 8601)"
        },
        "tags": { 
          "type": "array", 
          "items": { "type": "string" },
          "description": "Classification tags"
        },
        "schema_version": { 
          "type": "string", 
          "pattern": "^\\d+\\.\\d+\\.\\d+$",
          "description": "Schema version"
        }
      }
    }
  }
}

Edge Format

Basic block edges follow the format bbN->bbM where:

  • bbN is the source basic block identifier
  • bbM is the target basic block identifier
  • The -> separator indicates control flow direction

Examples:

  • bb3->bb7 - Control flows from block 3 to block 7
  • bb7->bb9 - Control flows from block 7 to block 9
  • bb1->bb3 - Control flows from block 1 to block 3

Edge identifiers match common disassembler output (IDA, Ghidra, Binary Ninja).

Sink Registry

Known sinks are validated against the sink registry. Categories include:

Category Examples CWEs
memory memcpy, strcpy, free, malloc CWE-120, CWE-787, CWE-415, CWE-416
command_injection system, exec, popen CWE-78
code_injection dlopen, LoadLibrary CWE-427
path_traversal fopen, open CWE-22
network connect, send, recv CWE-918, CWE-319
sql_injection sqlite3_exec, mysql_query CWE-89
crypto EVP_DecryptUpdate, PKCS12_parse CWE-327, CWE-295

Unknown sinks generate validation warnings but do not block acceptance.

Content Addressing

Golden sets are content-addressed using SHA256:

  1. Definition is serialized to canonical JSON (sorted keys, no whitespace)
  2. SHA256 hash is computed over UTF-8 bytes
  3. Digest is formatted as sha256:<64-hex-chars>

Example: sha256:a1b2c3d4e5f6...

Content addressing enables:

  • Deduplication in storage
  • Audit trail verification
  • Immutable references in attestations

Status Workflow

Golden sets progress through these statuses:

Draft → InReview → Approved
          ↓
        Draft (if changes requested)
          
Approved → Deprecated (if CVE retracted)
         → Archived (for historical reference)
Status Description
Draft Initial creation, editable
InReview Submitted for review
Approved Active in corpus, used for detection
Deprecated CVE retracted or superseded
Archived Historical reference only

Best Practices

Authoring Golden Sets

  1. Start minimal - Begin with function name only, add edges/sinks as verified
  2. Use authoritative sources - NVD, vendor advisories, upstream commits
  3. Document invariants - Explain exploitation conditions in human-readable text
  4. Tag appropriately - Use consistent classification tags
  5. Review carefully - Treat golden sets like unit tests

Edge Selection

  1. Focus on vulnerable paths - Only include edges on the exploitation path
  2. Avoid over-specification - Fewer edges = more robust matching
  3. Document rationale - Explain why specific edges are included

Sink Selection

  1. Use known sinks - Prefer sinks from the registry
  2. Include all relevant sinks - List all sinks on the vulnerable path
  3. Order consistently - Alphabetical ordering aids diffing

API Reference

See StellaOps.BinaryIndex.GoldenSet for:

  • GoldenSetDefinition - Domain model
  • IGoldenSetValidator - Validation service
  • IGoldenSetStore - Storage interface
  • GoldenSetYamlSerializer - YAML serialization
  • ISinkRegistry - Sink lookup service