git.stella-ops.org/docs/modules/scanner/golden-set-authoring.md

# Golden Set Authoring Guide

This document describes the authoring workflow for creating and curating Golden Sets - ground-truth definitions of vulnerability code-level manifestation facts used for binary vulnerability detection.

## Overview

Golden Sets are YAML-based definitions that describe:
- **Vulnerable functions** - Entry points where vulnerabilities manifest
- **Sink functions** - Dangerous API calls that enable exploitation
- **Edge patterns** - Control flow patterns indicating vulnerability presence
- **Constants** - Magic numbers, buffer sizes, or version markers
- **Witness inputs** - Example triggers for the vulnerability

## Architecture

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                         Golden Set Authoring Pipeline                        │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌────────────────┐    ┌─────────────────┐    ┌──────────────────────┐      │
│  │  CVE/Advisory  │───>│   Extractors    │───>│    Draft Golden Set  │      │
│  │    Sources     │    │  (NVD/OSV/GHSA) │    │                      │      │
│  └────────────────┘    └─────────────────┘    └──────────────────────┘      │
│                               │                          │                   │
│                               v                          v                   │
│                        ┌─────────────────┐    ┌──────────────────────┐      │
│                        │ Upstream Commit │    │   AI Enrichment      │      │
│                        │    Analyzer     │───>│   Service            │      │
│                        └─────────────────┘    └──────────────────────┘      │
│                                                          │                   │
│                                                          v                   │
│                        ┌─────────────────┐    ┌──────────────────────┐      │
│                        │   Validator     │<───│   Review Workflow    │      │
│                        └─────────────────┘    └──────────────────────┘      │
│                               │                          │                   │
│                               v                          v                   │
│                        ┌─────────────────────────────────────────────┐      │
│                        │           PostgreSQL Storage                │      │
│                        │    (content-addressed, versioned)           │      │
│                        └─────────────────────────────────────────────┘      │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
```

## Components

### 1. Extractors

Extractors pull vulnerability data from advisory sources:

```csharp
// Extract from NVD/OSV/GHSA
var extractor = serviceProvider.GetRequiredService<IGoldenSetExtractor>();

var result = await extractor.ExtractAsync(
    "CVE-2024-1234",
    "openssl",
    new ExtractionOptions
    {
        UseAiEnrichment = true,
        IncludeUpstreamCommits = true,
        IncludeRelatedCves = true
    });
```

**Supported Sources:**
- **NVD** - National Vulnerability Database
- **OSV** - Open Source Vulnerabilities
- **GHSA** - GitHub Security Advisories

### 2. Upstream Commit Analyzer

Analyzes fix commits to extract:
- Modified functions (from hunk headers)
- Added constants (hex values, buffer sizes)
- Added conditions (bounds checks, NULL checks)

```csharp
var analyzer = serviceProvider.GetRequiredService<IUpstreamCommitAnalyzer>();

// Parse commit URL
var parsed = analyzer.ParseCommitUrl("https://github.com/curl/curl/commit/abc123");

// Analyze commits
var result = await analyzer.AnalyzeAsync([
    "https://github.com/curl/curl/commit/abc123",
    "https://github.com/curl/curl/commit/def456"
]);

// Result contains:
// - ModifiedFunctions: ["parse_header", "validate_length"]
// - AddedConstants: ["0x1000", "sizeof(buffer)"]
// - AddedConditions: ["bounds_check", "null_check"]
```

**Supported Platforms:**
- GitHub (`github.com/owner/repo/commit/hash`)
- GitLab (`gitlab.com/owner/repo/-/commit/hash`)
- Bitbucket (`bitbucket.org/owner/repo/commits/hash`)

### 3. CWE-to-Sink Mapper

Maps CWE classifications to relevant sink functions:

```csharp
// Get sinks for buffer overflow CWEs
var sinks = CweToSinkMapper.GetSinksForCwes(["CWE-120", "CWE-122"]);
// Returns: ["memcpy", "strcpy", "sprintf", "gets", ...]

// Get all mapped CWEs
var cwes = CweToSinkMapper.GetMappedCwes();
```

**Supported CWE Categories:**
| Category | CWE IDs | Example Sinks |
|----------|---------|---------------|
| Buffer Overflow | CWE-120, CWE-121, CWE-122, CWE-787 | `memcpy`, `strcpy`, `sprintf` |
| Format String | CWE-134 | `printf`, `fprintf`, `sprintf` |
| Integer Overflow | CWE-190, CWE-191 | `malloc`, `calloc`, `realloc` |
| Use After Free | CWE-416 | `free`, `delete`, `delete[]` |
| Command Injection | CWE-78 | `system`, `popen`, `execve` |
| SQL Injection | CWE-89 | `PQexec`, `mysql_query`, `sqlite3_exec` |
| Path Traversal | CWE-22 | `fopen`, `open`, `access` |
| NULL Pointer | CWE-476 | (dereference detection) |

### 4. AI Enrichment Service

Optional AI-assisted enrichment using advisory text and commit analysis:

```csharp
var enrichmentService = serviceProvider.GetRequiredService<IGoldenSetEnrichmentService>();

if (enrichmentService.IsAvailable)
{
    var result = await enrichmentService.EnrichAsync(
        draftGoldenSet,
        new GoldenSetEnrichmentContext
        {
            CommitAnalysis = commitResult,
            CweIds = ["CWE-787"],
            AdvisoryText = "Buffer overflow in parse_header..."
        });

    // Result.EnrichedDraft contains improved definition
    // Result.ActionsApplied describes what was added/refined
}
```

**Enrichment Actions:**
- `function_added` - New vulnerable function identified
- `sink_added` - New sink function from CWE mapping
- `constant_extracted` - Magic value from commits
- `edge_suggested` - Control flow pattern suggested
- `witness_hint_added` - Example trigger input

### 5. Review Workflow

State machine for golden set curation:

```
   Draft ──> InReview ──> Approved ──> Deprecated ──> Archived
     │           │            │
     └───────────┴────────────┴── (can return to Draft)
```

```csharp
var reviewService = serviceProvider.GetRequiredService<IGoldenSetReviewService>();

// Submit for review
await reviewService.SubmitForReviewAsync("CVE-2024-1234", "author@example.com");

// Approve
await reviewService.ApproveAsync("CVE-2024-1234", "reviewer@example.com", "LGTM");

// Or request changes
await reviewService.RequestChangesAsync(
    "CVE-2024-1234",
    "reviewer@example.com",
    "Needs specific function name",
    [new ChangeRequest { Field = "targets[0].functionName", Suggestion = "parse_header" }]);
```

## Golden Set Schema

```yaml
# CVE-2024-1234.golden.yaml
schema_version: "1.0"
id: CVE-2024-1234
component: openssl

targets:
  - function: parse_header
    sinks:
      - memcpy
      - strcpy
    constants:
      - "0x1000"
      - "sizeof(buffer)"
    edges:
      - bb1->bb2  # bounds check bypass

witness:
  stdin: "AAAA..."
  argv:
    - "--vulnerable-option"
  env:
    BUFFER_SIZE: "99999"

metadata:
  author_id: researcher@example.com
  source_ref: https://nvd.nist.gov/vuln/detail/CVE-2024-1234
  created_at: 2024-01-15T10:30:00Z
  tags:
    - memory-corruption
    - heap-overflow
```

## Configuration

```yaml
# appsettings.yaml
BinaryIndex:
  GoldenSet:
    SchemaVersion: "1.0"
    Validation:
      ValidateCveExists: true
      ValidateSinks: true
      StrictEdgeFormat: true
      OfflineMode: false
    Storage:
      PostgresSchema: golden_sets
      ConnectionStringName: BinaryIndex
    Caching:
      SinkRegistryCacheMinutes: 60
      DefinitionCacheMinutes: 15
    Authoring:
      EnableAiEnrichment: true
      EnableCommitAnalysis: true
      MaxCommitsToAnalyze: 5
      AutoAcceptConfidenceThreshold: 0.8
```

## Service Registration

```csharp
// Program.cs or Startup.cs
services.AddGoldenSetServices(configuration);
services.AddGoldenSetAuthoring();
services.AddGoldenSetPostgresStorage();

// Optional: Add HTTP client for commit analysis
services.AddHttpClient("upstream-commits", client =>
{
    client.Timeout = TimeSpan.FromSeconds(30);
    client.DefaultRequestHeaders.Add("User-Agent", "StellaOps-GoldenSet/1.0");
});
```

## CLI Usage

```bash
# Initialize a golden set from CVE
stella scanner golden init CVE-2024-1234 --component openssl

# With options
stella scanner golden init CVE-2024-1234 \
    --component openssl \
    --output ./golden-sets/CVE-2024-1234.yaml \
    --no-ai \
    --store

# Interactive mode for refinement
stella scanner golden init CVE-2024-1234 --interactive

# Export as JSON
stella scanner golden init CVE-2024-1234 --json
```

## Validation Rules

1. **CVE Format** - Must match `CVE-YYYY-NNNNN` or `GHSA-xxxx-xxxx-xxxx`
2. **Component Required** - Non-empty component name
3. **Targets Required** - At least one vulnerable target
4. **Sinks Validation** - Sinks must be in the sink registry
5. **Edge Format** - Must match `bbN->bbM` pattern (if strict mode)
6. **Constants Format** - Hex constants must be valid (`0x...`)

## Best Practices

1. **Start with Commit Analysis** - Fix commits are the most reliable source
2. **Use CWE Mapping** - Automatic sink suggestions based on vulnerability type
3. **Validate Locally** - Always validate before submitting for review
4. **Include Witness Data** - Example inputs help verify detection accuracy
5. **Tag Appropriately** - Use consistent tags for categorization
6. **Document Source** - Always include source_ref for traceability

## Metrics

Track authoring quality with:
- **Extraction Confidence** - Overall, per-source, per-field
- **Enrichment Actions** - What was added automatically
- **Review Iterations** - How many rounds before approval
- **Detection Rate** - How well the golden set detects known-vulnerable binaries

## See Also

- [Golden Set Schema Reference](../schemas/golden-set-schema.md)
- [Sink Registry](../modules/scanner/sink-registry.md)
- [Binary Analysis Architecture](../modules/scanner/architecture.md)
- [Vulnerability Detection](../modules/scanner/vulnerability-detection.md)