Files

master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.

2025-12-17 18:02:37 +02:00

7.5 KiB

Raw Blame History

Corpus Contribution Guide

Sprint: SPRINT_3500_0003_0001
Task: CORPUS-014 - Document corpus contribution guide

Overview

The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics.

Corpus Structure

datasets/reachability/
├── corpus.json                # Index of all samples
├── schemas/
│   └── corpus-sample.v1.json  # JSON schema for samples
├── samples/
│   ├── gt-0001/               # Sample directory
│   │   ├── sample.json        # Sample metadata
│   │   ├── expected.json      # Expected findings
│   │   ├── sbom.json          # Input SBOM
│   │   └── source/            # Optional source files
│   └── ...
└── baselines/
    └── v1.0.0.json            # Baseline metrics

Sample Format

sample.json

{
  "id": "gt-0001",
  "name": "Python SQL Injection - Reachable",
  "description": "Flask app with reachable SQL injection via user input",
  "language": "python",
  "ecosystem": "pypi",
  "scenario": "webapi",
  "entrypoints": ["app.py:main"],
  "reachability_tier": "tainted_sink",
  "created_at": "2025-01-15T00:00:00Z",
  "author": "security-team",
  "tags": ["sql-injection", "flask", "reachable"]
}

expected.json

{
  "findings": [
    {
      "vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection.param_concat",
      "sink_class": "sql",
      "location_hint": "app.py:42"
    }
  ]
}

Contributing a Sample

Step 1: Choose a Scenario

Select a scenario that is not well-covered in the corpus:

Scenario	Description	Example
`webapi`	Web application endpoint	Flask, FastAPI, Express
`cli`	Command-line tool	argparse, click, commander
`job`	Background/scheduled job	Celery, cron script
`lib`	Library code	Reusable package

Step 2: Create Sample Directory

cd datasets/reachability/samples
mkdir gt-NNNN
cd gt-NNNN

Use the next available sample ID (check corpus.json for the highest).

Step 3: Create Minimal Reproducible Case

Requirements:

Smallest possible code to demonstrate the vulnerability
Real or realistic vulnerability (use CVE when possible)
Clear entrypoint definition
Deterministic behavior (no network, no randomness)

Example Python Sample:

# app.py - gt-0001
from flask import Flask, request
import sqlite3

app = Flask(__name__)

@app.route("/user")
def get_user():
    user_id = request.args.get("id")  # Taint source
    conn = sqlite3.connect(":memory:")
    # SQL injection: user_id flows to query without sanitization
    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")  # Taint sink
    return str(result.fetchall())

if __name__ == "__main__":
    app.run()

Step 4: Define Expected Findings

Create expected.json with all expected findings:

{
  "findings": [
    {
      "vuln_key": "CWE-89:pkg:pypi/flask@2.0.0",
      "tier": "tainted_sink",
      "rule_key": "py.sql.injection",
      "sink_class": "sql",
      "location_hint": "app.py:13",
      "notes": "User input from request.args flows to sqlite3.execute"
    }
  ]
}

Step 5: Create SBOM

Generate or create an SBOM for the sample:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "version": 1,
  "components": [
    {
      "type": "library",
      "name": "flask",
      "version": "2.0.0",
      "purl": "pkg:pypi/flask@2.0.0"
    },
    {
      "type": "library",
      "name": "sqlite3",
      "version": "3.39.0",
      "purl": "pkg:pypi/sqlite3@3.39.0"
    }
  ]
}

Step 6: Update Corpus Index

Add entry to corpus.json:

{
  "id": "gt-0001",
  "path": "samples/gt-0001",
  "language": "python",
  "tier": "tainted_sink",
  "scenario": "webapi",
  "expected_count": 1
}

Step 7: Validate Locally

# Run corpus validation
dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \
  --filter "FullyQualifiedName~CorpusFixtureTests"

# Run benchmark
stellaops bench corpus run --sample gt-0001 --verbose

Tier Guidelines

Imported Tier Samples

For imported tier samples:

Vulnerability in a dependency
No execution path to vulnerable code
Package is in lockfile but not called

Example: Unused dependency with known CVE.

Executed Tier Samples

For executed tier samples:

Vulnerable code is called from entrypoint
No user-controlled data reaches the vulnerability
Static or coverage analysis proves execution

Example: Hardcoded SQL query (no injection).

Tainted→Sink Tier Samples

For tainted_sink tier samples:

User-controlled input reaches vulnerable code
Clear source → sink data flow
Include sink class taxonomy

Example: User input to SQL query, command execution, etc.

Sink Classes

When contributing tainted_sink samples, specify the sink class:

Sink Class	Description	Examples
`sql`	SQL injection	sqlite3.execute, cursor.execute
`command`	Command injection	os.system, subprocess.run
`ssrf`	Server-side request forgery	requests.get, urllib.urlopen
`path`	Path traversal	open(), os.path.join
`deser`	Deserialization	pickle.loads, yaml.load
`eval`	Code evaluation	eval(), exec()
`xxe`	XML external entity	lxml.parse, ET.parse
`xss`	Cross-site scripting	innerHTML, document.write

Quality Criteria

Samples must meet these criteria:

Deterministic: Same input → same output
Minimal: Smallest code to demonstrate
Documented: Clear description and notes
Validated: Passes local tests
Realistic: Based on real vulnerability patterns
Self-contained: No external network calls

Negative Samples

Include "negative" samples where scanner should NOT find vulnerabilities:

{
  "id": "gt-0050",
  "name": "Python SQL - Properly Sanitized",
  "tier": "imported",
  "expected_count": 0,
  "notes": "Uses parameterized queries, no injection possible"
}

Review Process

Create PR with new sample(s)
CI runs validation tests
Security team reviews expected findings
QA team verifies determinism
Merge and update baseline

Updating Baselines

After adding samples, update baseline metrics:

# Generate new baseline
stellaops bench corpus run --all --output baselines/v1.1.0.json

# Compare to previous
stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json

FAQ

How many samples should I contribute?

Start with 2-3 high-quality samples covering different aspects of the same vulnerability class.

Can I use synthetic vulnerabilities?

Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly.

What if my sample has multiple findings?

Include all expected findings in expected.json. Multi-finding samples are valuable for testing.

How do I test tier classification?

Run with verbose output:

stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence

7.5 KiB Raw Blame History