Files
git.stella-ops.org/docs/modules/binary-index/golden-set-schema.md
master 7f7eb8b228 Complete batch 012 (golden set diff) and 013 (advisory chat), fix build errors
Sprints completed:
- SPRINT_20260110_012_* (golden set diff layer - 10 sprints)
- SPRINT_20260110_013_* (advisory chat - 4 sprints)

Build fixes applied:
- Fix namespace conflicts with Microsoft.Extensions.Options.Options.Create
- Fix VexDecisionReachabilityIntegrationTests API drift (major rewrite)
- Fix VexSchemaValidationTests FluentAssertions method name
- Fix FixChainGateIntegrationTests ambiguous type references
- Fix AdvisoryAI test files required properties and namespace aliases
- Add stub types for CveMappingController (ICveSymbolMappingService)
- Fix VerdictBuilderService static context issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 10:09:07 +02:00

369 lines
11 KiB
Markdown

# Golden Set Schema Documentation
> **Version:** 1.0.0
> **Module:** BinaryIndex.GoldenSet
> **Last Updated:** 2026-01-10
## Overview
Golden Sets are ground-truth definitions of vulnerability code-level manifestations. They capture the specific functions, basic block edges, sinks, and constants that characterize a vulnerability, enabling:
- **Deterministic vulnerability detection** via fingerprint matching
- **Backport verification** through pre/post patch comparison
- **Audit trail** for security claims with content-addressed provenance
## YAML Schema
Golden sets are stored as human-readable YAML files for git-friendliness and easy review.
### Full Example
```yaml
# GoldenSet.yaml schema v1.0.0
id: "CVE-2024-0727"
component: "openssl"
targets:
- function: "PKCS12_parse"
edges:
- "bb3->bb7"
- "bb7->bb9"
sinks:
- "memcpy"
- "OPENSSL_malloc"
constants:
- "0x400"
- "0xdeadbeef"
taint_invariant: "len(field) <= 0x400 required before memcpy"
source_file: "crypto/pkcs12/p12_kiss.c"
source_line: 142
- function: "PKCS12_unpack_p7data"
edges:
- "bb1->bb3"
sinks:
- "d2i_ASN1_OCTET_STRING"
witness:
arguments:
- "--file"
- "<fuzz.bin>"
invariant: "Malformed PKCS12 with oversized authsafe"
poc_file_ref: "sha256:abc123def456abc123def456abc123def456abc123def456abc123def456abc123"
metadata:
author_id: "security-team@example.com"
created_at: "2025-01-10T12:00:00Z"
source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727"
reviewed_by: "senior-analyst@example.com"
reviewed_at: "2025-01-11T09:00:00Z"
tags:
- "memory-corruption"
- "heap-overflow"
- "pkcs12"
schema_version: "1.0.0"
```
### Minimal Example
```yaml
id: "CVE-2024-0727"
component: "openssl"
targets:
- function: "vulnerable_function"
metadata:
author_id: "analyst@example.com"
created_at: "2025-01-10T12:00:00Z"
source_ref: "https://nvd.nist.gov/vuln/detail/CVE-2024-0727"
```
## Field Reference
### Root Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Vulnerability identifier (CVE-YYYY-NNNN or GHSA-xxxx-xxxx-xxxx) |
| `component` | string | Yes | Affected component name (e.g., "openssl", "glibc") |
| `targets` | array | Yes | List of vulnerable code targets (min 1) |
| `witness` | object | No | Reproduction witness input |
| `metadata` | object | Yes | Authorship and review metadata |
### Vulnerable Target Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `function` | string | Yes | Function name (symbol or demangled) |
| `edges` | array | No | Basic block edges (format: "bbN->bbM") |
| `sinks` | array | No | Sink functions reached (e.g., "memcpy") |
| `constants` | array | No | Magic values identifying the vulnerability |
| `taint_invariant` | string | No | Human-readable exploitation invariant |
| `source_file` | string | No | Source file hint |
| `source_line` | integer | No | Source line hint |
### Witness Input Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `arguments` | array | No | Command-line arguments to trigger vulnerability |
| `invariant` | string | No | Human-readable precondition |
| `poc_file_ref` | string | No | Content-addressed PoC file reference |
### Metadata Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `author_id` | string | Yes | Author identifier (email or handle) |
| `created_at` | string | Yes | Creation timestamp (ISO 8601 UTC) |
| `source_ref` | string | Yes | Advisory URL or commit hash |
| `reviewed_by` | string | No | Reviewer identifier |
| `reviewed_at` | string | No | Review timestamp (ISO 8601 UTC) |
| `tags` | array | No | Classification tags |
| `schema_version` | string | No | Schema version (default: "1.0.0") |
## JSON Schema
The following JSON Schema can be used for validation:
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stella-ops.org/schemas/golden-set/v1.0.0",
"title": "Golden Set Definition",
"type": "object",
"required": ["id", "component", "targets", "metadata"],
"properties": {
"id": {
"type": "string",
"pattern": "^CVE-\\d{4}-\\d{4,}$|^GHSA-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}$",
"description": "Vulnerability identifier"
},
"component": {
"type": "string",
"minLength": 1,
"description": "Affected component name"
},
"targets": {
"type": "array",
"minItems": 1,
"items": { "$ref": "#/$defs/vulnerableTarget" },
"description": "Vulnerable code targets"
},
"witness": {
"$ref": "#/$defs/witnessInput",
"description": "Reproduction witness input"
},
"metadata": {
"$ref": "#/$defs/metadata",
"description": "Authorship and review metadata"
}
},
"$defs": {
"vulnerableTarget": {
"type": "object",
"required": ["function"],
"properties": {
"function": {
"type": "string",
"minLength": 1,
"description": "Function name"
},
"edges": {
"type": "array",
"items": {
"type": "string",
"pattern": "^bb\\d+->bb\\d+$"
},
"description": "Basic block edges"
},
"sinks": {
"type": "array",
"items": { "type": "string" },
"description": "Sink functions"
},
"constants": {
"type": "array",
"items": { "type": "string" },
"description": "Magic values"
},
"taint_invariant": {
"type": "string",
"description": "Exploitation invariant"
},
"source_file": {
"type": "string",
"description": "Source file hint"
},
"source_line": {
"type": "integer",
"minimum": 1,
"description": "Source line hint"
}
}
},
"witnessInput": {
"type": "object",
"properties": {
"arguments": {
"type": "array",
"items": { "type": "string" },
"description": "Command-line arguments"
},
"invariant": {
"type": "string",
"description": "Human-readable precondition"
},
"poc_file_ref": {
"type": "string",
"pattern": "^sha256:[a-f0-9]{64}$",
"description": "Content-addressed PoC reference"
}
}
},
"metadata": {
"type": "object",
"required": ["author_id", "created_at", "source_ref"],
"properties": {
"author_id": {
"type": "string",
"description": "Author identifier"
},
"created_at": {
"type": "string",
"format": "date-time",
"description": "Creation timestamp (ISO 8601)"
},
"source_ref": {
"type": "string",
"format": "uri",
"description": "Advisory URL or commit hash"
},
"reviewed_by": {
"type": "string",
"description": "Reviewer identifier"
},
"reviewed_at": {
"type": "string",
"format": "date-time",
"description": "Review timestamp (ISO 8601)"
},
"tags": {
"type": "array",
"items": { "type": "string" },
"description": "Classification tags"
},
"schema_version": {
"type": "string",
"pattern": "^\\d+\\.\\d+\\.\\d+$",
"description": "Schema version"
}
}
}
}
}
```
## Edge Format
Basic block edges follow the format `bbN->bbM` where:
- `bbN` is the source basic block identifier
- `bbM` is the target basic block identifier
- The `->` separator indicates control flow direction
Examples:
- `bb3->bb7` - Control flows from block 3 to block 7
- `bb7->bb9` - Control flows from block 7 to block 9
- `bb1->bb3` - Control flows from block 1 to block 3
Edge identifiers match common disassembler output (IDA, Ghidra, Binary Ninja).
## Sink Registry
Known sinks are validated against the sink registry. Categories include:
| Category | Examples | CWEs |
|----------|----------|------|
| `memory` | memcpy, strcpy, free, malloc | CWE-120, CWE-787, CWE-415, CWE-416 |
| `command_injection` | system, exec, popen | CWE-78 |
| `code_injection` | dlopen, LoadLibrary | CWE-427 |
| `path_traversal` | fopen, open | CWE-22 |
| `network` | connect, send, recv | CWE-918, CWE-319 |
| `sql_injection` | sqlite3_exec, mysql_query | CWE-89 |
| `crypto` | EVP_DecryptUpdate, PKCS12_parse | CWE-327, CWE-295 |
Unknown sinks generate validation warnings but do not block acceptance.
## Content Addressing
Golden sets are content-addressed using SHA256:
1. Definition is serialized to canonical JSON (sorted keys, no whitespace)
2. SHA256 hash is computed over UTF-8 bytes
3. Digest is formatted as `sha256:<64-hex-chars>`
Example: `sha256:a1b2c3d4e5f6...`
Content addressing enables:
- Deduplication in storage
- Audit trail verification
- Immutable references in attestations
## Status Workflow
Golden sets progress through these statuses:
```
Draft → InReview → Approved
Draft (if changes requested)
Approved → Deprecated (if CVE retracted)
→ Archived (for historical reference)
```
| Status | Description |
|--------|-------------|
| `Draft` | Initial creation, editable |
| `InReview` | Submitted for review |
| `Approved` | Active in corpus, used for detection |
| `Deprecated` | CVE retracted or superseded |
| `Archived` | Historical reference only |
## Best Practices
### Authoring Golden Sets
1. **Start minimal** - Begin with function name only, add edges/sinks as verified
2. **Use authoritative sources** - NVD, vendor advisories, upstream commits
3. **Document invariants** - Explain exploitation conditions in human-readable text
4. **Tag appropriately** - Use consistent classification tags
5. **Review carefully** - Treat golden sets like unit tests
### Edge Selection
1. **Focus on vulnerable paths** - Only include edges on the exploitation path
2. **Avoid over-specification** - Fewer edges = more robust matching
3. **Document rationale** - Explain why specific edges are included
### Sink Selection
1. **Use known sinks** - Prefer sinks from the registry
2. **Include all relevant sinks** - List all sinks on the vulnerable path
3. **Order consistently** - Alphabetical ordering aids diffing
## API Reference
See [StellaOps.BinaryIndex.GoldenSet](../../../src/BinaryIndex/__Libraries/StellaOps.BinaryIndex.GoldenSet/) for:
- `GoldenSetDefinition` - Domain model
- `IGoldenSetValidator` - Validation service
- `IGoldenSetStore` - Storage interface
- `GoldenSetYamlSerializer` - YAML serialization
- `ISinkRegistry` - Sink lookup service
## Related Documentation
- [BinaryIndex Architecture](architecture.md)
- [Delta Signature Matching](delta-signatures.md)
- [VEX Evidence Generation](../vex-lens/architecture.md)