feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
This commit is contained in:
301
docs/contributing/corpus-contribution-guide.md
Normal file
301
docs/contributing/corpus-contribution-guide.md
Normal file
@@ -0,0 +1,301 @@
|
||||
# Corpus Contribution Guide
|
||||
|
||||
**Sprint:** SPRINT_3500_0003_0001
|
||||
**Task:** CORPUS-014 - Document corpus contribution guide
|
||||
|
||||
## Overview
|
||||
|
||||
The Ground-Truth Corpus is a collection of validated test samples used to measure scanner accuracy. Each sample has known reachability status and expected findings, enabling deterministic quality metrics.
|
||||
|
||||
## Corpus Structure
|
||||
|
||||
```
|
||||
datasets/reachability/
|
||||
├── corpus.json # Index of all samples
|
||||
├── schemas/
|
||||
│ └── corpus-sample.v1.json # JSON schema for samples
|
||||
├── samples/
|
||||
│ ├── gt-0001/ # Sample directory
|
||||
│ │ ├── sample.json # Sample metadata
|
||||
│ │ ├── expected.json # Expected findings
|
||||
│ │ ├── sbom.json # Input SBOM
|
||||
│ │ └── source/ # Optional source files
|
||||
│ └── ...
|
||||
└── baselines/
|
||||
└── v1.0.0.json # Baseline metrics
|
||||
```
|
||||
|
||||
## Sample Format
|
||||
|
||||
### sample.json
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "gt-0001",
|
||||
"name": "Python SQL Injection - Reachable",
|
||||
"description": "Flask app with reachable SQL injection via user input",
|
||||
"language": "python",
|
||||
"ecosystem": "pypi",
|
||||
"scenario": "webapi",
|
||||
"entrypoints": ["app.py:main"],
|
||||
"reachability_tier": "tainted_sink",
|
||||
"created_at": "2025-01-15T00:00:00Z",
|
||||
"author": "security-team",
|
||||
"tags": ["sql-injection", "flask", "reachable"]
|
||||
}
|
||||
```
|
||||
|
||||
### expected.json
|
||||
|
||||
```json
|
||||
{
|
||||
"findings": [
|
||||
{
|
||||
"vuln_key": "CVE-2024-1234:pkg:pypi/sqlalchemy@1.4.0",
|
||||
"tier": "tainted_sink",
|
||||
"rule_key": "py.sql.injection.param_concat",
|
||||
"sink_class": "sql",
|
||||
"location_hint": "app.py:42"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Contributing a Sample
|
||||
|
||||
### Step 1: Choose a Scenario
|
||||
|
||||
Select a scenario that is not well-covered in the corpus:
|
||||
|
||||
| Scenario | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `webapi` | Web application endpoint | Flask, FastAPI, Express |
|
||||
| `cli` | Command-line tool | argparse, click, commander |
|
||||
| `job` | Background/scheduled job | Celery, cron script |
|
||||
| `lib` | Library code | Reusable package |
|
||||
|
||||
### Step 2: Create Sample Directory
|
||||
|
||||
```bash
|
||||
cd datasets/reachability/samples
|
||||
mkdir gt-NNNN
|
||||
cd gt-NNNN
|
||||
```
|
||||
|
||||
Use the next available sample ID (check `corpus.json` for the highest).
|
||||
|
||||
### Step 3: Create Minimal Reproducible Case
|
||||
|
||||
**Requirements:**
|
||||
- Smallest possible code to demonstrate the vulnerability
|
||||
- Real or realistic vulnerability (use CVE when possible)
|
||||
- Clear entrypoint definition
|
||||
- Deterministic behavior (no network, no randomness)
|
||||
|
||||
**Example Python Sample:**
|
||||
|
||||
```python
|
||||
# app.py - gt-0001
|
||||
from flask import Flask, request
|
||||
import sqlite3
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@app.route("/user")
|
||||
def get_user():
|
||||
user_id = request.args.get("id") # Taint source
|
||||
conn = sqlite3.connect(":memory:")
|
||||
# SQL injection: user_id flows to query without sanitization
|
||||
result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}") # Taint sink
|
||||
return str(result.fetchall())
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
```
|
||||
|
||||
### Step 4: Define Expected Findings
|
||||
|
||||
Create `expected.json` with all expected findings:
|
||||
|
||||
```json
|
||||
{
|
||||
"findings": [
|
||||
{
|
||||
"vuln_key": "CWE-89:pkg:pypi/flask@2.0.0",
|
||||
"tier": "tainted_sink",
|
||||
"rule_key": "py.sql.injection",
|
||||
"sink_class": "sql",
|
||||
"location_hint": "app.py:13",
|
||||
"notes": "User input from request.args flows to sqlite3.execute"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Step 5: Create SBOM
|
||||
|
||||
Generate or create an SBOM for the sample:
|
||||
|
||||
```json
|
||||
{
|
||||
"bomFormat": "CycloneDX",
|
||||
"specVersion": "1.6",
|
||||
"version": 1,
|
||||
"components": [
|
||||
{
|
||||
"type": "library",
|
||||
"name": "flask",
|
||||
"version": "2.0.0",
|
||||
"purl": "pkg:pypi/flask@2.0.0"
|
||||
},
|
||||
{
|
||||
"type": "library",
|
||||
"name": "sqlite3",
|
||||
"version": "3.39.0",
|
||||
"purl": "pkg:pypi/sqlite3@3.39.0"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Step 6: Update Corpus Index
|
||||
|
||||
Add entry to `corpus.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "gt-0001",
|
||||
"path": "samples/gt-0001",
|
||||
"language": "python",
|
||||
"tier": "tainted_sink",
|
||||
"scenario": "webapi",
|
||||
"expected_count": 1
|
||||
}
|
||||
```
|
||||
|
||||
### Step 7: Validate Locally
|
||||
|
||||
```bash
|
||||
# Run corpus validation
|
||||
dotnet test tests/reachability/StellaOps.Reachability.FixtureTests \
|
||||
--filter "FullyQualifiedName~CorpusFixtureTests"
|
||||
|
||||
# Run benchmark
|
||||
stellaops bench corpus run --sample gt-0001 --verbose
|
||||
```
|
||||
|
||||
## Tier Guidelines
|
||||
|
||||
### Imported Tier Samples
|
||||
|
||||
For `imported` tier samples:
|
||||
- Vulnerability in a dependency
|
||||
- No execution path to vulnerable code
|
||||
- Package is in lockfile but not called
|
||||
|
||||
**Example:** Unused dependency with known CVE.
|
||||
|
||||
### Executed Tier Samples
|
||||
|
||||
For `executed` tier samples:
|
||||
- Vulnerable code is called from entrypoint
|
||||
- No user-controlled data reaches the vulnerability
|
||||
- Static or coverage analysis proves execution
|
||||
|
||||
**Example:** Hardcoded SQL query (no injection).
|
||||
|
||||
### Tainted→Sink Tier Samples
|
||||
|
||||
For `tainted_sink` tier samples:
|
||||
- User-controlled input reaches vulnerable code
|
||||
- Clear source → sink data flow
|
||||
- Include sink class taxonomy
|
||||
|
||||
**Example:** User input to SQL query, command execution, etc.
|
||||
|
||||
## Sink Classes
|
||||
|
||||
When contributing `tainted_sink` samples, specify the sink class:
|
||||
|
||||
| Sink Class | Description | Examples |
|
||||
|------------|-------------|----------|
|
||||
| `sql` | SQL injection | sqlite3.execute, cursor.execute |
|
||||
| `command` | Command injection | os.system, subprocess.run |
|
||||
| `ssrf` | Server-side request forgery | requests.get, urllib.urlopen |
|
||||
| `path` | Path traversal | open(), os.path.join |
|
||||
| `deser` | Deserialization | pickle.loads, yaml.load |
|
||||
| `eval` | Code evaluation | eval(), exec() |
|
||||
| `xxe` | XML external entity | lxml.parse, ET.parse |
|
||||
| `xss` | Cross-site scripting | innerHTML, document.write |
|
||||
|
||||
## Quality Criteria
|
||||
|
||||
Samples must meet these criteria:
|
||||
|
||||
- [ ] **Deterministic**: Same input → same output
|
||||
- [ ] **Minimal**: Smallest code to demonstrate
|
||||
- [ ] **Documented**: Clear description and notes
|
||||
- [ ] **Validated**: Passes local tests
|
||||
- [ ] **Realistic**: Based on real vulnerability patterns
|
||||
- [ ] **Self-contained**: No external network calls
|
||||
|
||||
## Negative Samples
|
||||
|
||||
Include "negative" samples where scanner should NOT find vulnerabilities:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "gt-0050",
|
||||
"name": "Python SQL - Properly Sanitized",
|
||||
"tier": "imported",
|
||||
"expected_count": 0,
|
||||
"notes": "Uses parameterized queries, no injection possible"
|
||||
}
|
||||
```
|
||||
|
||||
## Review Process
|
||||
|
||||
1. Create PR with new sample(s)
|
||||
2. CI runs validation tests
|
||||
3. Security team reviews expected findings
|
||||
4. QA team verifies determinism
|
||||
5. Merge and update baseline
|
||||
|
||||
## Updating Baselines
|
||||
|
||||
After adding samples, update baseline metrics:
|
||||
|
||||
```bash
|
||||
# Generate new baseline
|
||||
stellaops bench corpus run --all --output baselines/v1.1.0.json
|
||||
|
||||
# Compare to previous
|
||||
stellaops bench corpus compare baselines/v1.0.0.json baselines/v1.1.0.json
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
### How many samples should I contribute?
|
||||
|
||||
Start with 2-3 high-quality samples covering different aspects of the same vulnerability class.
|
||||
|
||||
### Can I use synthetic vulnerabilities?
|
||||
|
||||
Yes, but prefer real CVE patterns when possible. Synthetic samples should document the vulnerability pattern clearly.
|
||||
|
||||
### What if my sample has multiple findings?
|
||||
|
||||
Include all expected findings in `expected.json`. Multi-finding samples are valuable for testing.
|
||||
|
||||
### How do I test tier classification?
|
||||
|
||||
Run with verbose output:
|
||||
```bash
|
||||
stellaops bench corpus run --sample gt-NNNN --verbose --show-evidence
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Tiered Precision Curves](../benchmarks/tiered-precision-curves.md)
|
||||
- [Reachability Analysis](../product-advisories/14-Dec-2025%20-%20Reachability%20Analysis%20Technical%20Reference.md)
|
||||
- [Corpus Index Schema](../../datasets/reachability/schemas/corpus-sample.v1.json)
|
||||
Reference in New Issue
Block a user