Sprints completed: - SPRINT_20260110_012_* (golden set diff layer - 10 sprints) - SPRINT_20260110_013_* (advisory chat - 4 sprints) Build fixes applied: - Fix namespace conflicts with Microsoft.Extensions.Options.Options.Create - Fix VexDecisionReachabilityIntegrationTests API drift (major rewrite) - Fix VexSchemaValidationTests FluentAssertions method name - Fix FixChainGateIntegrationTests ambiguous type references - Fix AdvisoryAI test files required properties and namespace aliases - Add stub types for CveMappingController (ICveSymbolMappingService) - Fix VerdictBuilderService static context issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
312 lines
12 KiB
Markdown
312 lines
12 KiB
Markdown
# Golden Set Authoring Guide
|
|
|
|
This document describes the authoring workflow for creating and curating Golden Sets - ground-truth definitions of vulnerability code-level manifestation facts used for binary vulnerability detection.
|
|
|
|
## Overview
|
|
|
|
Golden Sets are YAML-based definitions that describe:
|
|
- **Vulnerable functions** - Entry points where vulnerabilities manifest
|
|
- **Sink functions** - Dangerous API calls that enable exploitation
|
|
- **Edge patterns** - Control flow patterns indicating vulnerability presence
|
|
- **Constants** - Magic numbers, buffer sizes, or version markers
|
|
- **Witness inputs** - Example triggers for the vulnerability
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ Golden Set Authoring Pipeline │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌────────────────┐ ┌─────────────────┐ ┌──────────────────────┐ │
|
|
│ │ CVE/Advisory │───>│ Extractors │───>│ Draft Golden Set │ │
|
|
│ │ Sources │ │ (NVD/OSV/GHSA) │ │ │ │
|
|
│ └────────────────┘ └─────────────────┘ └──────────────────────┘ │
|
|
│ │ │ │
|
|
│ v v │
|
|
│ ┌─────────────────┐ ┌──────────────────────┐ │
|
|
│ │ Upstream Commit │ │ AI Enrichment │ │
|
|
│ │ Analyzer │───>│ Service │ │
|
|
│ └─────────────────┘ └──────────────────────┘ │
|
|
│ │ │
|
|
│ v │
|
|
│ ┌─────────────────┐ ┌──────────────────────┐ │
|
|
│ │ Validator │<───│ Review Workflow │ │
|
|
│ └─────────────────┘ └──────────────────────┘ │
|
|
│ │ │ │
|
|
│ v v │
|
|
│ ┌─────────────────────────────────────────────┐ │
|
|
│ │ PostgreSQL Storage │ │
|
|
│ │ (content-addressed, versioned) │ │
|
|
│ └─────────────────────────────────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Components
|
|
|
|
### 1. Extractors
|
|
|
|
Extractors pull vulnerability data from advisory sources:
|
|
|
|
```csharp
|
|
// Extract from NVD/OSV/GHSA
|
|
var extractor = serviceProvider.GetRequiredService<IGoldenSetExtractor>();
|
|
|
|
var result = await extractor.ExtractAsync(
|
|
"CVE-2024-1234",
|
|
"openssl",
|
|
new ExtractionOptions
|
|
{
|
|
UseAiEnrichment = true,
|
|
IncludeUpstreamCommits = true,
|
|
IncludeRelatedCves = true
|
|
});
|
|
```
|
|
|
|
**Supported Sources:**
|
|
- **NVD** - National Vulnerability Database
|
|
- **OSV** - Open Source Vulnerabilities
|
|
- **GHSA** - GitHub Security Advisories
|
|
|
|
### 2. Upstream Commit Analyzer
|
|
|
|
Analyzes fix commits to extract:
|
|
- Modified functions (from hunk headers)
|
|
- Added constants (hex values, buffer sizes)
|
|
- Added conditions (bounds checks, NULL checks)
|
|
|
|
```csharp
|
|
var analyzer = serviceProvider.GetRequiredService<IUpstreamCommitAnalyzer>();
|
|
|
|
// Parse commit URL
|
|
var parsed = analyzer.ParseCommitUrl("https://github.com/curl/curl/commit/abc123");
|
|
|
|
// Analyze commits
|
|
var result = await analyzer.AnalyzeAsync([
|
|
"https://github.com/curl/curl/commit/abc123",
|
|
"https://github.com/curl/curl/commit/def456"
|
|
]);
|
|
|
|
// Result contains:
|
|
// - ModifiedFunctions: ["parse_header", "validate_length"]
|
|
// - AddedConstants: ["0x1000", "sizeof(buffer)"]
|
|
// - AddedConditions: ["bounds_check", "null_check"]
|
|
```
|
|
|
|
**Supported Platforms:**
|
|
- GitHub (`github.com/owner/repo/commit/hash`)
|
|
- GitLab (`gitlab.com/owner/repo/-/commit/hash`)
|
|
- Bitbucket (`bitbucket.org/owner/repo/commits/hash`)
|
|
|
|
### 3. CWE-to-Sink Mapper
|
|
|
|
Maps CWE classifications to relevant sink functions:
|
|
|
|
```csharp
|
|
// Get sinks for buffer overflow CWEs
|
|
var sinks = CweToSinkMapper.GetSinksForCwes(["CWE-120", "CWE-122"]);
|
|
// Returns: ["memcpy", "strcpy", "sprintf", "gets", ...]
|
|
|
|
// Get all mapped CWEs
|
|
var cwes = CweToSinkMapper.GetMappedCwes();
|
|
```
|
|
|
|
**Supported CWE Categories:**
|
|
| Category | CWE IDs | Example Sinks |
|
|
|----------|---------|---------------|
|
|
| Buffer Overflow | CWE-120, CWE-121, CWE-122, CWE-787 | `memcpy`, `strcpy`, `sprintf` |
|
|
| Format String | CWE-134 | `printf`, `fprintf`, `sprintf` |
|
|
| Integer Overflow | CWE-190, CWE-191 | `malloc`, `calloc`, `realloc` |
|
|
| Use After Free | CWE-416 | `free`, `delete`, `delete[]` |
|
|
| Command Injection | CWE-78 | `system`, `popen`, `execve` |
|
|
| SQL Injection | CWE-89 | `PQexec`, `mysql_query`, `sqlite3_exec` |
|
|
| Path Traversal | CWE-22 | `fopen`, `open`, `access` |
|
|
| NULL Pointer | CWE-476 | (dereference detection) |
|
|
|
|
### 4. AI Enrichment Service
|
|
|
|
Optional AI-assisted enrichment using advisory text and commit analysis:
|
|
|
|
```csharp
|
|
var enrichmentService = serviceProvider.GetRequiredService<IGoldenSetEnrichmentService>();
|
|
|
|
if (enrichmentService.IsAvailable)
|
|
{
|
|
var result = await enrichmentService.EnrichAsync(
|
|
draftGoldenSet,
|
|
new GoldenSetEnrichmentContext
|
|
{
|
|
CommitAnalysis = commitResult,
|
|
CweIds = ["CWE-787"],
|
|
AdvisoryText = "Buffer overflow in parse_header..."
|
|
});
|
|
|
|
// Result.EnrichedDraft contains improved definition
|
|
// Result.ActionsApplied describes what was added/refined
|
|
}
|
|
```
|
|
|
|
**Enrichment Actions:**
|
|
- `function_added` - New vulnerable function identified
|
|
- `sink_added` - New sink function from CWE mapping
|
|
- `constant_extracted` - Magic value from commits
|
|
- `edge_suggested` - Control flow pattern suggested
|
|
- `witness_hint_added` - Example trigger input
|
|
|
|
### 5. Review Workflow
|
|
|
|
State machine for golden set curation:
|
|
|
|
```
|
|
Draft ──> InReview ──> Approved ──> Deprecated ──> Archived
|
|
│ │ │
|
|
└───────────┴────────────┴── (can return to Draft)
|
|
```
|
|
|
|
```csharp
|
|
var reviewService = serviceProvider.GetRequiredService<IGoldenSetReviewService>();
|
|
|
|
// Submit for review
|
|
await reviewService.SubmitForReviewAsync("CVE-2024-1234", "author@example.com");
|
|
|
|
// Approve
|
|
await reviewService.ApproveAsync("CVE-2024-1234", "reviewer@example.com", "LGTM");
|
|
|
|
// Or request changes
|
|
await reviewService.RequestChangesAsync(
|
|
"CVE-2024-1234",
|
|
"reviewer@example.com",
|
|
"Needs specific function name",
|
|
[new ChangeRequest { Field = "targets[0].functionName", Suggestion = "parse_header" }]);
|
|
```
|
|
|
|
## Golden Set Schema
|
|
|
|
```yaml
|
|
# CVE-2024-1234.golden.yaml
|
|
schema_version: "1.0"
|
|
id: CVE-2024-1234
|
|
component: openssl
|
|
|
|
targets:
|
|
- function: parse_header
|
|
sinks:
|
|
- memcpy
|
|
- strcpy
|
|
constants:
|
|
- "0x1000"
|
|
- "sizeof(buffer)"
|
|
edges:
|
|
- bb1->bb2 # bounds check bypass
|
|
|
|
witness:
|
|
stdin: "AAAA..."
|
|
argv:
|
|
- "--vulnerable-option"
|
|
env:
|
|
BUFFER_SIZE: "99999"
|
|
|
|
metadata:
|
|
author_id: researcher@example.com
|
|
source_ref: https://nvd.nist.gov/vuln/detail/CVE-2024-1234
|
|
created_at: 2024-01-15T10:30:00Z
|
|
tags:
|
|
- memory-corruption
|
|
- heap-overflow
|
|
```
|
|
|
|
## Configuration
|
|
|
|
```yaml
|
|
# appsettings.yaml
|
|
BinaryIndex:
|
|
GoldenSet:
|
|
SchemaVersion: "1.0"
|
|
Validation:
|
|
ValidateCveExists: true
|
|
ValidateSinks: true
|
|
StrictEdgeFormat: true
|
|
OfflineMode: false
|
|
Storage:
|
|
PostgresSchema: golden_sets
|
|
ConnectionStringName: BinaryIndex
|
|
Caching:
|
|
SinkRegistryCacheMinutes: 60
|
|
DefinitionCacheMinutes: 15
|
|
Authoring:
|
|
EnableAiEnrichment: true
|
|
EnableCommitAnalysis: true
|
|
MaxCommitsToAnalyze: 5
|
|
AutoAcceptConfidenceThreshold: 0.8
|
|
```
|
|
|
|
## Service Registration
|
|
|
|
```csharp
|
|
// Program.cs or Startup.cs
|
|
services.AddGoldenSetServices(configuration);
|
|
services.AddGoldenSetAuthoring();
|
|
services.AddGoldenSetPostgresStorage();
|
|
|
|
// Optional: Add HTTP client for commit analysis
|
|
services.AddHttpClient("upstream-commits", client =>
|
|
{
|
|
client.Timeout = TimeSpan.FromSeconds(30);
|
|
client.DefaultRequestHeaders.Add("User-Agent", "StellaOps-GoldenSet/1.0");
|
|
});
|
|
```
|
|
|
|
## CLI Usage
|
|
|
|
```bash
|
|
# Initialize a golden set from CVE
|
|
stella scanner golden init CVE-2024-1234 --component openssl
|
|
|
|
# With options
|
|
stella scanner golden init CVE-2024-1234 \
|
|
--component openssl \
|
|
--output ./golden-sets/CVE-2024-1234.yaml \
|
|
--no-ai \
|
|
--store
|
|
|
|
# Interactive mode for refinement
|
|
stella scanner golden init CVE-2024-1234 --interactive
|
|
|
|
# Export as JSON
|
|
stella scanner golden init CVE-2024-1234 --json
|
|
```
|
|
|
|
## Validation Rules
|
|
|
|
1. **CVE Format** - Must match `CVE-YYYY-NNNNN` or `GHSA-xxxx-xxxx-xxxx`
|
|
2. **Component Required** - Non-empty component name
|
|
3. **Targets Required** - At least one vulnerable target
|
|
4. **Sinks Validation** - Sinks must be in the sink registry
|
|
5. **Edge Format** - Must match `bbN->bbM` pattern (if strict mode)
|
|
6. **Constants Format** - Hex constants must be valid (`0x...`)
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with Commit Analysis** - Fix commits are the most reliable source
|
|
2. **Use CWE Mapping** - Automatic sink suggestions based on vulnerability type
|
|
3. **Validate Locally** - Always validate before submitting for review
|
|
4. **Include Witness Data** - Example inputs help verify detection accuracy
|
|
5. **Tag Appropriately** - Use consistent tags for categorization
|
|
6. **Document Source** - Always include source_ref for traceability
|
|
|
|
## Metrics
|
|
|
|
Track authoring quality with:
|
|
- **Extraction Confidence** - Overall, per-source, per-field
|
|
- **Enrichment Actions** - What was added automatically
|
|
- **Review Iterations** - How many rounds before approval
|
|
- **Detection Rate** - How well the golden set detects known-vulnerable binaries
|
|
|
|
## See Also
|
|
|
|
- [Golden Set Schema Reference](../schemas/golden-set-schema.md)
|
|
- [Sink Registry](../modules/scanner/sink-registry.md)
|
|
- [Binary Analysis Architecture](../modules/scanner/architecture.md)
|
|
- [Vulnerability Detection](../modules/scanner/vulnerability-detection.md)
|