# Corpus Contribution Guide **Sprint:** SPRINT_3500_0003_0001 **Task:** CORPUS-014 - Document corpus contribution guide ## Overview The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics. ## Corpus Structure ``` datasets/reachability/ ├── corpus.json # Index of all samples ├── schemas/ │ └── corpus-sample.v1.json # JSON schema for samples ├── samples/ │ ├── gt-0001/ # Sample directory │ │ ├── sample.json # Sample metadata │ │ ├── expected.json # Expected findings │ │ ├── sbom.json # Input SBOM │ │ └── source/ # Optional source files │ └── ... └── baselines/ └── v1.0.0.json # Baseline metrics ``` ## Sample Format ### sample.json ```json { "id": "gt-0001", "name": "Python SQL Injection - Reachable", "description": "Flask app with reachable SQL injection via user input", "language": "python", "ecosystem": "pypi", "scenario": "webapi", "entrypoints": ["app.py:main"], "reachability_tier": "tainted_sink", "created_at": "2025-01-15T00:00:00Z", "author": "security-team", "tags": ["sql-injection", "flask", "reachable"] } ``` ### expected.json ```json { "findings": [ { "vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0", "tier": "tainted_sink", "rule_key": "py.sql.injection.param_concat", "sink_class": "sql", "location_hint": "app.py:42" } ] } ``` ## Contributing a Sample ### Step 1: Choose a Scenario Select a scenario that is not well-covered in the corpus: | Scenario | Description | Example | |----------|-------------|---------| | `webapi` | Web application endpoint | Flask, FastAPI, Express | | `cli` | Command-line tool | argparse, click, commander | | `job` | Background/scheduled job | Celery, cron script | | `lib` | Library code | Reusable package | ### Step 2: Create Sample Directory ```bash cd datasets/reachability/samples mkdir gt-NNNN cd gt-NNNN ``` Use the next available sample ID (check `corpus.json` for the highest). ### Step 3: Create Minimal Reproducible Case **Requirements:** - Smallest possible code to demonstrate the vulnerability - Real or realistic vulnerability (use CVE when possible) - Clear entrypoint definition - Deterministic behavior (no network, no randomness) **Example Python Sample:** ```python # app.py - gt-0001 from flask import Flask, request import sqlite3 app = Flask(__name__) @app.route("/user") def get_user(): user_id = request.args.get("id") # Taint source conn = sqlite3.connect(":memory:") # SQL injection: user_id flows to query without sanitization result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}") # Taint sink return str(result.fetchall()) if __name__ == "__main__": app.run() ``` ### Step 4: Define Expected Findings Create `expected.json` with all expected findings: ```json { "findings": [ { "vuln_key": "CWE-89:pkg:pypi/flask@2.0.0", "tier": "tainted_sink", "rule_key": "py.sql.injection", "sink_class": "sql", "location_hint": "app.py:13", "notes": "User input from request.args flows to sqlite3.execute" } ] } ``` ### Step 5: Create SBOM Generate or create an SBOM for the sample: ```json { "bomFormat": "CycloneDX", "specVersion": "1.6", "version": 1, "components": [ { "type": "library", "name": "flask", "version": "2.0.0", "purl": "pkg:pypi/flask@2.0.0" }, { "type": "library", "name": "sqlite3", "version": "3.39.0", "purl": "pkg:pypi/sqlite3@3.39.0" } ] } ``` ### Step 6: Update Corpus Index Add entry to `corpus.json`: ```json { "id": "gt-0001", "path": "samples/gt-0001", "language": "python", "tier": "tainted_sink", "scenario": "webapi", "expected_count": 1 } ``` ### Step 7: Validate Locally ```bash # Run corpus validation dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \ --filter "FullyQualifiedName~CorpusFixtureTests" # Run benchmark stellaops bench corpus run --sample gt-0001 --verbose ``` ## Tier Guidelines ### Imported Tier Samples For `imported` tier samples: - Vulnerability in a dependency - No execution path to vulnerable code - Package is in lockfile but not called **Example:** Unused dependency with known CVE. ### Executed Tier Samples For `executed` tier samples: - Vulnerable code is called from entrypoint - No user-controlled data reaches the vulnerability - Static or coverage analysis proves execution **Example:** Hardcoded SQL query (no injection). ### Tainted→Sink Tier Samples For `tainted_sink` tier samples: - User-controlled input reaches vulnerable code - Clear source → sink data flow - Include sink class taxonomy **Example:** User input to SQL query, command execution, etc. ## Sink Classes When contributing `tainted_sink` samples, specify the sink class: | Sink Class | Description | Examples | |------------|-------------|----------| | `sql` | SQL injection | sqlite3.execute, cursor.execute | | `command` | Command injection | os.system, subprocess.run | | `ssrf` | Server-side request forgery | requests.get, urllib.urlopen | | `path` | Path traversal | open(), os.path.join | | `deser` | Deserialization | pickle.loads, yaml.load | | `eval` | Code evaluation | eval(), exec() | | `xxe` | XML external entity | lxml.parse, ET.parse | | `xss` | Cross-site scripting | innerHTML, document.write | ## Quality Criteria Samples must meet these criteria: - [ ] **Deterministic**: Same input → same output - [ ] **Minimal**: Smallest code to demonstrate - [ ] **Documented**: Clear description and notes - [ ] **Validated**: Passes local tests - [ ] **Realistic**: Based on real vulnerability patterns - [ ] **Self-contained**: No external network calls ## Negative Samples Include "negative" samples where scanner should NOT find vulnerabilities: ```json { "id": "gt-0050", "name": "Python SQL - Properly Sanitized", "tier": "imported", "expected_count": 0, "notes": "Uses parameterized queries, no injection possible" } ``` ## Review Process 1. Create PR with new sample(s) 2. CI runs validation tests 3. Security team reviews expected findings 4. QA team verifies determinism 5. Merge and update baseline ## Updating Baselines After adding samples, update baseline metrics: ```bash # Generate new baseline stellaops bench corpus run --all --output baselines/v1.1.0.json # Compare to previous stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json ``` ## FAQ ### How many samples should I contribute? Start with 2-3 high-quality samples covering different aspects of the same vulnerability class. ### Can I use synthetic vulnerabilities? Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly. ### What if my sample has multiple findings? Include all expected findings in `expected.json`. Multi-finding samples are valuable for testing. ### How do I test tier classification? Run with verbose output: ```bash stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence ``` ## Related Documentation - [Tiered Precision Curves](../benchmarks/tiered-precision-curves.md) - [Reachability Analysis](../product-advisories/14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md) - [Corpus Index Schema](../../datasets/reachability/schemas/corpus-sample.v1.json)