Files
git.stella-ops.org/docs/contributing/corpus-contribution-guide.md
master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00

7.5 KiB

Corpus Contribution Guide

Sprint: SPRINT_3500_0003_0001
Task: CORPUS-014 - Document corpus contribution guide

Overview

The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics.

Corpus Structure

datasets/reachability/
├── corpus.json                # Index of all samples
├── schemas/
│   └── corpus-sample.v1.json  # JSON schema for samples
├── samples/
│   ├── gt-0001/               # Sample directory
│   │   ├── sample.json        # Sample metadata
│   │   ├── expected.json      # Expected findings
│   │   ├── sbom.json          # Input SBOM
│   │   └── source/            # Optional source files
│   └── ...
└── baselines/
    └── v1.0.0.json            # Baseline metrics

Sample Format

sample.json

{
  "id": "gt-0001",
  "name": "Python SQL Injection - Reachable",
  "description": "Flask app with reachable SQL injection via user input",
  "language": "python",
  "ecosystem": "pypi",
  "scenario": "webapi",
  "entrypoints": ["app.py:main"],
  "reachability_tier": "tainted_sink",
  "created_at": "2025-01-15T00:00:00Z",
  "author": "security-team",
  "tags": ["sql-injection", "flask", "reachable"]
}

expected.json

{
  "findings": [
    {
      "vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection.param_concat",
      "sink_class": "sql",
      "location_hint": "app.py:42"
    }
  ]
}

Contributing a Sample

Step 1: Choose a Scenario

Select a scenario that is not well-covered in the corpus:

Scenario Description Example
webapi Web application endpoint Flask, FastAPI, Express
cli Command-line tool argparse, click, commander
job Background/scheduled job Celery, cron script
lib Library code Reusable package

Step 2: Create Sample Directory

cd datasets/reachability/samples
mkdir gt-NNNN
cd gt-NNNN

Use the next available sample ID (check corpus.json for the highest).

Step 3: Create Minimal Reproducible Case

Requirements:

  • Smallest possible code to demonstrate the vulnerability
  • Real or realistic vulnerability (use CVE when possible)
  • Clear entrypoint definition
  • Deterministic behavior (no network, no randomness)

Example Python Sample:

# app.py - gt-0001
from flask import Flask, request
import sqlite3

app = Flask(__name__)

@app.route("/user")
def get_user():
    user_id = request.args.get("id")  # Taint source
    conn = sqlite3.connect(":memory:")
    # SQL injection: user_id flows to query without sanitization
    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")  # Taint sink
    return str(result.fetchall())

if __name__ == "__main__":
    app.run()

Step 4: Define Expected Findings

Create expected.json with all expected findings:

{
  "findings": [
    {
      "vuln_key": "CWE-89:pkg:pypi/flask@2.0.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection",
      "sink_class": "sql",
      "location_hint": "app.py:13",
      "notes": "User input from request.args flows to sqlite3.execute"
    }
  ]
}

Step 5: Create SBOM

Generate or create an SBOM for the sample:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "version": 1,
  "components": [
    {
      "type": "library",
      "name": "flask",
      "version": "2.0.0",
      "purl": "pkg:pypi/flask@2.0.0"
    },
    {
      "type": "library",
      "name": "sqlite3",
      "version": "3.39.0",
      "purl": "pkg:pypi/sqlite3@3.39.0"
    }
  ]
}

Step 6: Update Corpus Index

Add entry to corpus.json:

{
  "id": "gt-0001",
  "path": "samples/gt-0001",
  "language": "python",
  "tier": "tainted_sink",
  "scenario": "webapi",
  "expected_count": 1
}

Step 7: Validate Locally

# Run corpus validation
dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \
  --filter "FullyQualifiedName~CorpusFixtureTests"

# Run benchmark
stellaops bench corpus run --sample gt-0001 --verbose

Tier Guidelines

Imported Tier Samples

For imported tier samples:

  • Vulnerability in a dependency
  • No execution path to vulnerable code
  • Package is in lockfile but not called

Example: Unused dependency with known CVE.

Executed Tier Samples

For executed tier samples:

  • Vulnerable code is called from entrypoint
  • No user-controlled data reaches the vulnerability
  • Static or coverage analysis proves execution

Example: Hardcoded SQL query (no injection).

Tainted→Sink Tier Samples

For tainted_sink tier samples:

  • User-controlled input reaches vulnerable code
  • Clear source → sink data flow
  • Include sink class taxonomy

Example: User input to SQL query, command execution, etc.

Sink Classes

When contributing tainted_sink samples, specify the sink class:

Sink Class Description Examples
sql SQL injection sqlite3.execute, cursor.execute
command Command injection os.system, subprocess.run
ssrf Server-side request forgery requests.get, urllib.urlopen
path Path traversal open(), os.path.join
deser Deserialization pickle.loads, yaml.load
eval Code evaluation eval(), exec()
xxe XML external entity lxml.parse, ET.parse
xss Cross-site scripting innerHTML, document.write

Quality Criteria

Samples must meet these criteria:

  • Deterministic: Same input → same output
  • Minimal: Smallest code to demonstrate
  • Documented: Clear description and notes
  • Validated: Passes local tests
  • Realistic: Based on real vulnerability patterns
  • Self-contained: No external network calls

Negative Samples

Include "negative" samples where scanner should NOT find vulnerabilities:

{
  "id": "gt-0050",
  "name": "Python SQL - Properly Sanitized",
  "tier": "imported",
  "expected_count": 0,
  "notes": "Uses parameterized queries, no injection possible"
}

Review Process

  1. Create PR with new sample(s)
  2. CI runs validation tests
  3. Security team reviews expected findings
  4. QA team verifies determinism
  5. Merge and update baseline

Updating Baselines

After adding samples, update baseline metrics:

# Generate new baseline
stellaops bench corpus run --all --output baselines/v1.1.0.json

# Compare to previous
stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json

FAQ

How many samples should I contribute?

Start with 2-3 high-quality samples covering different aspects of the same vulnerability class.

Can I use synthetic vulnerabilities?

Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly.

What if my sample has multiple findings?

Include all expected findings in expected.json. Multi-finding samples are valuable for testing.

How do I test tier classification?

Run with verbose output:

stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence