# Corpus Contribution Guide

**Sprint:** SPRINT_3500_0003_0001  
**Task:** CORPUS-014 - Document corpus contribution guide

## Overview

The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics.

## Corpus Structure

```
datasets/reachability/
├── corpus.json                # Index of all samples
├── schemas/
│   └── corpus-sample.v1.json  # JSON schema for samples
├── samples/
│   ├── gt-0001/               # Sample directory
│   │   ├── sample.json        # Sample metadata
│   │   ├── expected.json      # Expected findings
│   │   ├── sbom.json          # Input SBOM
│   │   └── source/            # Optional source files
│   └── ...
└── baselines/
    └── v1.0.0.json            # Baseline metrics
```

## Sample Format

### sample.json

```json
{
  "id": "gt-0001",
  "name": "Python SQL Injection - Reachable",
  "description": "Flask app with reachable SQL injection via user input",
  "language": "python",
  "ecosystem": "pypi",
  "scenario": "webapi",
  "entrypoints": ["app.py:main"],
  "reachability_tier": "tainted_sink",
  "created_at": "2025-01-15T00:00:00Z",
  "author": "security-team",
  "tags": ["sql-injection", "flask", "reachable"]
}
```

### expected.json

```json
{
  "findings": [
    {
      "vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection.param_concat",
      "sink_class": "sql",
      "location_hint": "app.py:42"
    }
  ]
}
```

## Contributing a Sample

### Step 1: Choose a Scenario

Select a scenario that is not well-covered in the corpus:

| Scenario | Description | Example |
|----------|-------------|---------|
| `webapi` | Web application endpoint | Flask, FastAPI, Express |
| `cli` | Command-line tool | argparse, click, commander |
| `job` | Background/scheduled job | Celery, cron script |
| `lib` | Library code | Reusable package |

### Step 2: Create Sample Directory

```bash
cd datasets/reachability/samples
mkdir gt-NNNN
cd gt-NNNN
```

Use the next available sample ID (check `corpus.json` for the highest).

### Step 3: Create Minimal Reproducible Case

**Requirements:**
- Smallest possible code to demonstrate the vulnerability
- Real or realistic vulnerability (use CVE when possible)
- Clear entrypoint definition
- Deterministic behavior (no network, no randomness)

**Example Python Sample:**

```python
# app.py - gt-0001
from flask import Flask, request
import sqlite3

app = Flask(__name__)

@app.route("/user")
def get_user():
    user_id = request.args.get("id")  # Taint source
    conn = sqlite3.connect(":memory:")
    # SQL injection: user_id flows to query without sanitization
    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")  # Taint sink
    return str(result.fetchall())

if __name__ == "__main__":
    app.run()
```

### Step 4: Define Expected Findings

Create `expected.json` with all expected findings:

```json
{
  "findings": [
    {
      "vuln_key": "CWE-89:pkg:pypi/flask@2.0.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection",
      "sink_class": "sql",
      "location_hint": "app.py:13",
      "notes": "User input from request.args flows to sqlite3.execute"
    }
  ]
}
```

### Step 5: Create SBOM

Generate or create an SBOM for the sample:

```json
{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "version": 1,
  "components": [
    {
      "type": "library",
      "name": "flask",
      "version": "2.0.0",
      "purl": "pkg:pypi/flask@2.0.0"
    },
    {
      "type": "library",
      "name": "sqlite3",
      "version": "3.39.0",
      "purl": "pkg:pypi/sqlite3@3.39.0"
    }
  ]
}
```

### Step 6: Update Corpus Index

Add entry to `corpus.json`:

```json
{
  "id": "gt-0001",
  "path": "samples/gt-0001",
  "language": "python",
  "tier": "tainted_sink",
  "scenario": "webapi",
  "expected_count": 1
}
```

### Step 7: Validate Locally

```bash
# Run corpus validation
dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \
  --filter "FullyQualifiedName~CorpusFixtureTests"

# Run benchmark
stellaops bench corpus run --sample gt-0001 --verbose
```

## Tier Guidelines

### Imported Tier Samples

For `imported` tier samples:
- Vulnerability in a dependency
- No execution path to vulnerable code
- Package is in lockfile but not called

**Example:** Unused dependency with known CVE.

### Executed Tier Samples

For `executed` tier samples:
- Vulnerable code is called from entrypoint
- No user-controlled data reaches the vulnerability
- Static or coverage analysis proves execution

**Example:** Hardcoded SQL query (no injection).

### Tainted→Sink Tier Samples

For `tainted_sink` tier samples:
- User-controlled input reaches vulnerable code
- Clear source → sink data flow
- Include sink class taxonomy

**Example:** User input to SQL query, command execution, etc.

## Sink Classes

When contributing `tainted_sink` samples, specify the sink class:

| Sink Class | Description | Examples |
|------------|-------------|----------|
| `sql` | SQL injection | sqlite3.execute, cursor.execute |
| `command` | Command injection | os.system, subprocess.run |
| `ssrf` | Server-side request forgery | requests.get, urllib.urlopen |
| `path` | Path traversal | open(), os.path.join |
| `deser` | Deserialization | pickle.loads, yaml.load |
| `eval` | Code evaluation | eval(), exec() |
| `xxe` | XML external entity | lxml.parse, ET.parse |
| `xss` | Cross-site scripting | innerHTML, document.write |

## Quality Criteria

Samples must meet these criteria:

- [ ] **Deterministic**: Same input → same output
- [ ] **Minimal**: Smallest code to demonstrate
- [ ] **Documented**: Clear description and notes
- [ ] **Validated**: Passes local tests
- [ ] **Realistic**: Based on real vulnerability patterns
- [ ] **Self-contained**: No external network calls

## Negative Samples

Include "negative" samples where scanner should NOT find vulnerabilities:

```json
{
  "id": "gt-0050",
  "name": "Python SQL - Properly Sanitized",
  "tier": "imported",
  "expected_count": 0,
  "notes": "Uses parameterized queries, no injection possible"
}
```

## Review Process

1. Create PR with new sample(s)
2. CI runs validation tests
3. Security team reviews expected findings
4. QA team verifies determinism
5. Merge and update baseline

## Updating Baselines

After adding samples, update baseline metrics:

```bash
# Generate new baseline
stellaops bench corpus run --all --output baselines/v1.1.0.json

# Compare to previous
stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json
```

## FAQ

### How many samples should I contribute?

Start with 2-3 high-quality samples covering different aspects of the same vulnerability class.

### Can I use synthetic vulnerabilities?

Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly.

### What if my sample has multiple findings?

Include all expected findings in `expected.json`. Multi-finding samples are valuable for testing.

### How do I test tier classification?

Run with verbose output:
```bash
stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence
```

## Related Documentation

- [Tiered Precision Curves](../benchmarks/tiered-precision-curves.md)
- [Reachability Analysis](../product-advisories/14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md)
- [Corpus Index Schema](../../datasets/reachability/schemas/corpus-sample.v1.json)