# Golden Set Schema Documentation > **Version:** 1.0.0 > **Module:** BinaryIndex.GoldenSet > **Last Updated:** 2026-01-10 ## Overview Golden Sets are ground-truth definitions of vulnerability code-level manifestations. They capture the specific functions, basic block edges, sinks, and constants that characterize a vulnerability, enabling: - **Deterministic vulnerability detection** via fingerprint matching - **Backport verification** through pre/post patch comparison - **Audit trail** for security claims with content-addressed provenance ## YAML Schema Golden sets are stored as human-readable YAML files for git-friendliness and easy review. ### Full Example ```yaml # GoldenSet.yaml schema v1.0.0 id: "CVE-2024-0727" component: "openssl" targets: - function: "PKCS12_parse" edges: - "bb3->bb7" - "bb7->bb9" sinks: - "memcpy" - "OPENSSL_malloc" constants: - "0x400" - "0xdeadbeef" taint_invariant: "len(field) <= 0x400 required before memcpy" source_file: "crypto/pkcs12/p12_kiss.c" source_line: 142 - function: "PKCS12_unpack_p7data" edges: - "bb1->bb3" sinks: - "d2i_ASN1_OCTET_STRING" witness: arguments: - "--file" - "" invariant: "Malformed PKCS12 with oversized authsafe" poc_file_ref: "sha256:abc123def456abc123def456abc123def456abc123def456abc123def456abc123" metadata: author_id: "security-team@example.com" created_at: "2025-01-10T12:00:00Z" source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727" reviewed_by: "senior-analyst@example.com" reviewed_at: "2025-01-11T09:00:00Z" tags: - "memory-corruption" - "heap-overflow" - "pkcs12" schema_version: "1.0.0" ``` ### Minimal Example ```yaml id: "CVE-2024-0727" component: "openssl" targets: - function: "vulnerable_function" metadata: author_id: "analyst@example.com" created_at: "2025-01-10T12:00:00Z" source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727" ``` ## Field Reference ### Root Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `id` | string | Yes | Vulnerability identifier (CVE-YYYY-NNNN or GHSA-xxxx-xxxx-xxxx) | | `component` | string | Yes | Affected component name (e.g., "openssl", "glibc") | | `targets` | array | Yes | List of vulnerable code targets (min 1) | | `witness` | object | No | Reproduction witness input | | `metadata` | object | Yes | Authorship and review metadata | ### Vulnerable Target Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `function` | string | Yes | Function name (symbol or demangled) | | `edges` | array | No | Basic block edges (format: "bbN->bbM") | | `sinks` | array | No | Sink functions reached (e.g., "memcpy") | | `constants` | array | No | Magic values identifying the vulnerability | | `taint_invariant` | string | No | Human-readable exploitation invariant | | `source_file` | string | No | Source file hint | | `source_line` | integer | No | Source line hint | ### Witness Input Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `arguments` | array | No | Command-line arguments to trigger vulnerability | | `invariant` | string | No | Human-readable precondition | | `poc_file_ref` | string | No | Content-addressed PoC file reference | ### Metadata Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `author_id` | string | Yes | Author identifier (email or handle) | | `created_at` | string | Yes | Creation timestamp (ISO 8601 UTC) | | `source_ref` | string | Yes | Advisory URL or commit hash | | `reviewed_by` | string | No | Reviewer identifier | | `reviewed_at` | string | No | Review timestamp (ISO 8601 UTC) | | `tags` | array | No | Classification tags | | `schema_version` | string | No | Schema version (default: "1.0.0") | ## JSON Schema The following JSON Schema can be used for validation: ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://stella-ops.org/schemas/golden-set/v1.0.0", "title": "Golden Set Definition", "type": "object", "required": ["id", "component", "targets", "metadata"], "properties": { "id": { "type": "string", "pattern": "^CVE-\\d{4}-\\d{4,}$|^GHSA-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}$", "description": "Vulnerability identifier" }, "component": { "type": "string", "minLength": 1, "description": "Affected component name" }, "targets": { "type": "array", "minItems": 1, "items": { "$ref": "#/$defs/vulnerableTarget" }, "description": "Vulnerable code targets" }, "witness": { "$ref": "#/$defs/witnessInput", "description": "Reproduction witness input" }, "metadata": { "$ref": "#/$defs/metadata", "description": "Authorship and review metadata" } }, "$defs": { "vulnerableTarget": { "type": "object", "required": ["function"], "properties": { "function": { "type": "string", "minLength": 1, "description": "Function name" }, "edges": { "type": "array", "items": { "type": "string", "pattern": "^bb\\d+->bb\\d+$" }, "description": "Basic block edges" }, "sinks": { "type": "array", "items": { "type": "string" }, "description": "Sink functions" }, "constants": { "type": "array", "items": { "type": "string" }, "description": "Magic values" }, "taint_invariant": { "type": "string", "description": "Exploitation invariant" }, "source_file": { "type": "string", "description": "Source file hint" }, "source_line": { "type": "integer", "minimum": 1, "description": "Source line hint" } } }, "witnessInput": { "type": "object", "properties": { "arguments": { "type": "array", "items": { "type": "string" }, "description": "Command-line arguments" }, "invariant": { "type": "string", "description": "Human-readable precondition" }, "poc_file_ref": { "type": "string", "pattern": "^sha256:[a-f0-9]{64}$", "description": "Content-addressed PoC reference" } } }, "metadata": { "type": "object", "required": ["author_id", "created_at", "source_ref"], "properties": { "author_id": { "type": "string", "description": "Author identifier" }, "created_at": { "type": "string", "format": "date-time", "description": "Creation timestamp (ISO 8601)" }, "source_ref": { "type": "string", "format": "uri", "description": "Advisory URL or commit hash" }, "reviewed_by": { "type": "string", "description": "Reviewer identifier" }, "reviewed_at": { "type": "string", "format": "date-time", "description": "Review timestamp (ISO 8601)" }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Classification tags" }, "schema_version": { "type": "string", "pattern": "^\\d+\\.\\d+\\.\\d+$", "description": "Schema version" } } } } } ``` ## Edge Format Basic block edges follow the format `bbN->bbM` where: - `bbN` is the source basic block identifier - `bbM` is the target basic block identifier - The `->` separator indicates control flow direction Examples: - `bb3->bb7` - Control flows from block 3 to block 7 - `bb7->bb9` - Control flows from block 7 to block 9 - `bb1->bb3` - Control flows from block 1 to block 3 Edge identifiers match common disassembler output (IDA, Ghidra, Binary Ninja). ## Sink Registry Known sinks are validated against the sink registry. Categories include: | Category | Examples | CWEs | |----------|----------|------| | `memory` | memcpy, strcpy, free, malloc | CWE-120, CWE-787, CWE-415, CWE-416 | | `command_injection` | system, exec, popen | CWE-78 | | `code_injection` | dlopen, LoadLibrary | CWE-427 | | `path_traversal` | fopen, open | CWE-22 | | `network` | connect, send, recv | CWE-918, CWE-319 | | `sql_injection` | sqlite3_exec, mysql_query | CWE-89 | | `crypto` | EVP_DecryptUpdate, PKCS12_parse | CWE-327, CWE-295 | Unknown sinks generate validation warnings but do not block acceptance. ## Content Addressing Golden sets are content-addressed using SHA256: 1. Definition is serialized to canonical JSON (sorted keys, no whitespace) 2. SHA256 hash is computed over UTF-8 bytes 3. Digest is formatted as `sha256:<64-hex-chars>` Example: `sha256:a1b2c3d4e5f6...` Content addressing enables: - Deduplication in storage - Audit trail verification - Immutable references in attestations ## Status Workflow Golden sets progress through these statuses: ``` Draft → InReview → Approved ↓ Draft (if changes requested) Approved → Deprecated (if CVE retracted) → Archived (for historical reference) ``` | Status | Description | |--------|-------------| | `Draft` | Initial creation, editable | | `InReview` | Submitted for review | | `Approved` | Active in corpus, used for detection | | `Deprecated` | CVE retracted or superseded | | `Archived` | Historical reference only | ## Best Practices ### Authoring Golden Sets 1. **Start minimal** - Begin with function name only, add edges/sinks as verified 2. **Use authoritative sources** - NVD, vendor advisories, upstream commits 3. **Document invariants** - Explain exploitation conditions in human-readable text 4. **Tag appropriately** - Use consistent classification tags 5. **Review carefully** - Treat golden sets like unit tests ### Edge Selection 1. **Focus on vulnerable paths** - Only include edges on the exploitation path 2. **Avoid over-specification** - Fewer edges = more robust matching 3. **Document rationale** - Explain why specific edges are included ### Sink Selection 1. **Use known sinks** - Prefer sinks from the registry 2. **Include all relevant sinks** - List all sinks on the vulnerable path 3. **Order consistently** - Alphabetical ordering aids diffing ## API Reference See [StellaOps.BinaryIndex.GoldenSet](../../../src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/) for: - `GoldenSetDefinition` - Domain model - `IGoldenSetValidator` - Validation service - `IGoldenSetStore` - Storage interface - `GoldenSetYamlSerializer` - YAML serialization - `ISinkRegistry` - Sink lookup service ## Related Documentation - [BinaryIndex Architecture](architecture.md) - [Delta Signature Matching](delta-signatures.md) - [VEX Evidence Generation](../vex-lens/architecture.md)