# Golden Set Authoring Guide This document describes the authoring workflow for creating and curating Golden Sets - ground-truth definitions of vulnerability code-level manifestation facts used for binary vulnerability detection. ## Overview Golden Sets are YAML-based definitions that describe: - **Vulnerable functions** - Entry points where vulnerabilities manifest - **Sink functions** - Dangerous API calls that enable exploitation - **Edge patterns** - Control flow patterns indicating vulnerability presence - **Constants** - Magic numbers, buffer sizes, or version markers - **Witness inputs** - Example triggers for the vulnerability ## Architecture ``` ┌──────────────────────────────────────────────────────────────────────────────┐ │ Golden Set Authoring Pipeline │ ├──────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────┐ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ CVE/Advisory │───>│ Extractors │───>│ Draft Golden Set │ │ │ │ Sources │ │ (NVD/OSV/GHSA) │ │ │ │ │ └────────────────┘ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ │ v v │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Upstream Commit │ │ AI Enrichment │ │ │ │ Analyzer │───>│ Service │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ v │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Validator │<───│ Review Workflow │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ │ v v │ │ ┌─────────────────────────────────────────────┐ │ │ │ PostgreSQL Storage │ │ │ │ (content-addressed, versioned) │ │ │ └─────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────────┘ ``` ## Components ### 1. Extractors Extractors pull vulnerability data from advisory sources: ```csharp // Extract from NVD/OSV/GHSA var extractor = serviceProvider.GetRequiredService(); var result = await extractor.ExtractAsync( "CVE-2024-1234", "openssl", new ExtractionOptions { UseAiEnrichment = true, IncludeUpstreamCommits = true, IncludeRelatedCves = true }); ``` **Supported Sources:** - **NVD** - National Vulnerability Database - **OSV** - Open Source Vulnerabilities - **GHSA** - GitHub Security Advisories ### 2. Upstream Commit Analyzer Analyzes fix commits to extract: - Modified functions (from hunk headers) - Added constants (hex values, buffer sizes) - Added conditions (bounds checks, NULL checks) ```csharp var analyzer = serviceProvider.GetRequiredService(); // Parse commit URL var parsed = analyzer.ParseCommitUrl("https://github.com/curl/curl/commit/abc123"); // Analyze commits var result = await analyzer.AnalyzeAsync([ "https://github.com/curl/curl/commit/abc123", "https://github.com/curl/curl/commit/def456" ]); // Result contains: // - ModifiedFunctions: ["parse_header", "validate_length"] // - AddedConstants: ["0x1000", "sizeof(buffer)"] // - AddedConditions: ["bounds_check", "null_check"] ``` **Supported Platforms:** - GitHub (`github.com/owner/repo/commit/hash`) - GitLab (`gitlab.com/owner/repo/-/commit/hash`) - Bitbucket (`bitbucket.org/owner/repo/commits/hash`) ### 3. CWE-to-Sink Mapper Maps CWE classifications to relevant sink functions: ```csharp // Get sinks for buffer overflow CWEs var sinks = CweToSinkMapper.GetSinksForCwes(["CWE-120", "CWE-122"]); // Returns: ["memcpy", "strcpy", "sprintf", "gets", ...] // Get all mapped CWEs var cwes = CweToSinkMapper.GetMappedCwes(); ``` **Supported CWE Categories:** | Category | CWE IDs | Example Sinks | |----------|---------|---------------| | Buffer Overflow | CWE-120, CWE-121, CWE-122, CWE-787 | `memcpy`, `strcpy`, `sprintf` | | Format String | CWE-134 | `printf`, `fprintf`, `sprintf` | | Integer Overflow | CWE-190, CWE-191 | `malloc`, `calloc`, `realloc` | | Use After Free | CWE-416 | `free`, `delete`, `delete[]` | | Command Injection | CWE-78 | `system`, `popen`, `execve` | | SQL Injection | CWE-89 | `PQexec`, `mysql_query`, `sqlite3_exec` | | Path Traversal | CWE-22 | `fopen`, `open`, `access` | | NULL Pointer | CWE-476 | (dereference detection) | ### 4. AI Enrichment Service Optional AI-assisted enrichment using advisory text and commit analysis: ```csharp var enrichmentService = serviceProvider.GetRequiredService(); if (enrichmentService.IsAvailable) { var result = await enrichmentService.EnrichAsync( draftGoldenSet, new GoldenSetEnrichmentContext { CommitAnalysis = commitResult, CweIds = ["CWE-787"], AdvisoryText = "Buffer overflow in parse_header..." }); // Result.EnrichedDraft contains improved definition // Result.ActionsApplied describes what was added/refined } ``` **Enrichment Actions:** - `function_added` - New vulnerable function identified - `sink_added` - New sink function from CWE mapping - `constant_extracted` - Magic value from commits - `edge_suggested` - Control flow pattern suggested - `witness_hint_added` - Example trigger input ### 5. Review Workflow State machine for golden set curation: ``` Draft ──> InReview ──> Approved ──> Deprecated ──> Archived │ │ │ └───────────┴────────────┴── (can return to Draft) ``` ```csharp var reviewService = serviceProvider.GetRequiredService(); // Submit for review await reviewService.SubmitForReviewAsync("CVE-2024-1234", "author@example.com"); // Approve await reviewService.ApproveAsync("CVE-2024-1234", "reviewer@example.com", "LGTM"); // Or request changes await reviewService.RequestChangesAsync( "CVE-2024-1234", "reviewer@example.com", "Needs specific function name", [new ChangeRequest { Field = "targets[0].functionName", Suggestion = "parse_header" }]); ``` ## Golden Set Schema ```yaml # CVE-2024-1234.golden.yaml schema_version: "1.0" id: CVE-2024-1234 component: openssl targets: - function: parse_header sinks: - memcpy - strcpy constants: - "0x1000" - "sizeof(buffer)" edges: - bb1->bb2 # bounds check bypass witness: stdin: "AAAA..." argv: - "--vulnerable-option" env: BUFFER_SIZE: "99999" metadata: author_id: researcher@example.com source_ref: https://nvd.nist.gov/vuln/detail/CVE-2024-1234 created_at: 2024-01-15T10:30:00Z tags: - memory-corruption - heap-overflow ``` ## Configuration ```yaml # appsettings.yaml BinaryIndex: GoldenSet: SchemaVersion: "1.0" Validation: ValidateCveExists: true ValidateSinks: true StrictEdgeFormat: true OfflineMode: false Storage: PostgresSchema: golden_sets ConnectionStringName: BinaryIndex Caching: SinkRegistryCacheMinutes: 60 DefinitionCacheMinutes: 15 Authoring: EnableAiEnrichment: true EnableCommitAnalysis: true MaxCommitsToAnalyze: 5 AutoAcceptConfidenceThreshold: 0.8 ``` ## Service Registration ```csharp // Program.cs or Startup.cs services.AddGoldenSetServices(configuration); services.AddGoldenSetAuthoring(); services.AddGoldenSetPostgresStorage(); // Optional: Add HTTP client for commit analysis services.AddHttpClient("upstream-commits", client => { client.Timeout = TimeSpan.FromSeconds(30); client.DefaultRequestHeaders.Add("User-Agent", "StellaOps-GoldenSet/1.0"); }); ``` ## CLI Usage ```bash # Initialize a golden set from CVE stella scanner golden init CVE-2024-1234 --component openssl # With options stella scanner golden init CVE-2024-1234 \ --component openssl \ --output ./golden-sets/CVE-2024-1234.yaml \ --no-ai \ --store # Interactive mode for refinement stella scanner golden init CVE-2024-1234 --interactive # Export as JSON stella scanner golden init CVE-2024-1234 --json ``` ## Validation Rules 1. **CVE Format** - Must match `CVE-YYYY-NNNNN` or `GHSA-xxxx-xxxx-xxxx` 2. **Component Required** - Non-empty component name 3. **Targets Required** - At least one vulnerable target 4. **Sinks Validation** - Sinks must be in the sink registry 5. **Edge Format** - Must match `bbN->bbM` pattern (if strict mode) 6. **Constants Format** - Hex constants must be valid (`0x...`) ## Best Practices 1. **Start with Commit Analysis** - Fix commits are the most reliable source 2. **Use CWE Mapping** - Automatic sink suggestions based on vulnerability type 3. **Validate Locally** - Always validate before submitting for review 4. **Include Witness Data** - Example inputs help verify detection accuracy 5. **Tag Appropriately** - Use consistent tags for categorization 6. **Document Source** - Always include source_ref for traceability ## Metrics Track authoring quality with: - **Extraction Confidence** - Overall, per-source, per-field - **Enrichment Actions** - What was added automatically - **Review Iterations** - How many rounds before approval - **Detection Rate** - How well the golden set detects known-vulnerable binaries ## See Also - [Golden Set Schema Reference](../schemas/golden-set-schema.md) - [Sink Registry](../modules/scanner/sink-registry.md) - [Binary Analysis Architecture](../modules/scanner/architecture.md) - [Vulnerability Detection](../modules/scanner/vulnerability-detection.md)