feat: add security sink detection patterns for JavaScript/TypeScript

- Introduced `sink-detect.js` with various security sink detection patterns categorized by type (e.g., command injection, SQL injection, file operations).
- Implemented functions to build a lookup map for fast sink detection and to match sink calls against known patterns.
- Added `package-lock.json` for dependency management.
This commit is contained in:
StellaOps Bot
2025-12-22 23:21:21 +02:00
parent 3ba7157b00
commit 5146204f1b
529 changed files with 73579 additions and 5985 deletions

View File

@@ -0,0 +1,197 @@
# Router Chaos Testing Runbook
**Sprint:** SPRINT_5100_0005_0001
**Last Updated:** 2025-12-22
## Overview
This document describes the chaos testing approach for the StellaOps router, focusing on backpressure handling, graceful degradation under load, and recovery behavior.
## Test Categories
### 1. Load Testing (k6)
**Location:** `tests/load/router/`
#### Spike Test Scenarios
| Scenario | Rate | Duration | Purpose |
|----------|------|----------|---------|
| Baseline | 100 req/s | 1 min | Establish normal operation |
| 10x Spike | 1000 req/s | 30s | Moderate overload |
| 50x Spike | 5000 req/s | 30s | Severe overload |
| Recovery | 100 req/s | 2 min | Measure recovery time |
#### Running Load Tests
```bash
# Install k6
brew install k6 # macOS
# or
choco install k6 # Windows
# Run spike test against local router
k6 run tests/load/router/spike-test.js \
-e ROUTER_URL=http://localhost:8080
# Run against staging
k6 run tests/load/router/spike-test.js \
-e ROUTER_URL=https://router.staging.stellaops.io
# Output results to JSON
k6 run tests/load/router/spike-test.js \
--out json=results.json
```
### 2. Backpressure Verification
**Location:** `tests/chaos/BackpressureVerificationTests.cs`
Tests verify:
- HTTP 429 responses include `Retry-After` header
- HTTP 503 responses include `Retry-After` header
- Retry-After values are reasonable (1-60 seconds)
- No data loss during throttling
#### Expected Behavior
| Load Level | Expected Response | Retry-After |
|------------|-------------------|-------------|
| Normal | 200 OK | N/A |
| High (>80% capacity) | 429 Too Many Requests | 1-10s |
| Critical (>95% capacity) | 503 Service Unavailable | 10-60s |
### 3. Recovery Testing
**Location:** `tests/chaos/RecoveryTests.cs`
Tests verify:
- Router recovers within 30 seconds after load drops
- No request queue corruption
- Metrics return to baseline
#### Recovery Thresholds
| Metric | Target | Critical |
|--------|--------|----------|
| P95 Recovery Time | <15s | <30s |
| P99 Recovery Time | <25s | <45s |
| Data Loss | 0% | 0% |
### 4. Valkey Failure Injection
**Location:** `tests/chaos/ValkeyFailureTests.cs`
Tests verify router behavior when Valkey (cache/session store) fails:
- Graceful degradation to stateless mode
- No crashes or hangs
- Proper error logging
- Recovery when Valkey returns
#### Failure Scenarios
| Scenario | Expected Behavior |
|----------|-------------------|
| Valkey unreachable | Fallback to direct processing |
| Valkey slow (>500ms) | Timeout and continue |
| Valkey returns | Resume normal caching |
## CI Integration
**Workflow:** `.gitea/workflows/router-chaos.yml`
The chaos tests run:
- On every PR to `main` that touches router code
- Nightly against staging environment
- Before production deployments
### Workflow Stages
1. **Build** - Compile router and test projects
2. **Unit Tests** - Run BackpressureVerificationTests
3. **Integration Tests** - Run RecoveryTests, ValkeyFailureTests
4. **Load Tests** - Run k6 spike scenarios (staging only)
5. **Report** - Upload results as artifacts
## Interpreting Results
### Success Criteria
| Metric | Pass | Fail |
|--------|------|------|
| Request success rate during normal load | >=99% | <95% |
| Throttle response rate during spike | >0% (expected) | 0% (no backpressure) |
| Recovery time P95 | <30s | >=45s |
| Data loss | 0% | >0% |
### Common Failure Patterns
#### No Throttling Under Load
**Symptom:** 0% throttled requests during 50x spike
**Cause:** Backpressure not configured or circuit breaker disabled
**Fix:** Check router configuration `backpressure.enabled=true`
#### Slow Recovery
**Symptom:** Recovery time >45s
**Cause:** Request queue not draining properly
**Fix:** Check `maxQueueSize` and `drainTimeoutSeconds` settings
#### Missing Retry-After Header
**Symptom:** 429/503 without Retry-After
**Cause:** Header middleware not applied
**Fix:** Ensure `UseRetryAfterMiddleware()` is in pipeline
## Metrics & Dashboards
### Key Metrics to Monitor
```promql
# Throttle rate
rate(http_requests_total{status="429"}[5m]) / rate(http_requests_total[5m])
# Recovery time
histogram_quantile(0.95, rate(request_recovery_seconds_bucket[5m]))
# Queue depth
router_request_queue_depth
```
### Alert Thresholds
| Alert | Condition | Severity |
|-------|-----------|----------|
| High Throttle Rate | throttle_rate > 10% for 5m | Warning |
| Extended Throttle | throttle_rate > 50% for 2m | Critical |
| Slow Recovery | p95_recovery > 30s | Warning |
| No Recovery | p99_recovery > 60s | Critical |
## Troubleshooting
### Test Environment Setup
```bash
# Start router locally
docker-compose up router valkey
# Verify router health
curl http://localhost:8080/health
# Verify Valkey connection
docker exec -it valkey redis-cli ping
```
### Debug Mode
```bash
# Run tests with verbose logging
dotnet test tests/chaos/ --logger "console;verbosity=detailed"
# k6 with debug output
k6 run tests/load/router/spike-test.js --verbose
```
## References
- [Router Architecture](../modules/router/architecture.md)
- [Backpressure Design](../product-advisories/15-Dec-2025%20-%20Designing%20202%20+%20Retry-After%20Backpressure%20Control.md)
- [Testing Strategy](../product-advisories/20-Dec-2025%20-%20Testing%20strategy.md)

View File

@@ -0,0 +1,253 @@
# Trust Lattice Operations Runbook
> **Version**: 1.0.0
> **Last Updated**: 2025-12-22
> **Audience**: Operations and Support teams
---
## 1. Overview
The Trust Lattice is a VEX claim scoring framework that produces explainable, deterministic verdicts. This runbook covers operational procedures for monitoring, troubleshooting, and maintaining the system.
---
## 2. System Components
| Component | Service | Purpose |
|-----------|---------|---------|
| TrustVector | Excititor | 3-component trust scoring (P/C/R) |
| ClaimScoreMerger | Policy | Merge scored claims into verdicts |
| PolicyGates | Policy | Enforce trust thresholds |
| VerdictManifest | Authority | Store signed verdicts |
| Calibration | Excititor | Adjust trust vectors over time |
---
## 3. Monitoring
### 3.1 Key Metrics
| Metric | Alert Threshold | Description |
|--------|-----------------|-------------|
| `trustlattice_score_latency_p95` | > 100ms | Claim scoring latency |
| `trustlattice_merge_conflicts_total` | Rate increase | Claims with status conflicts |
| `policy_gate_failures_total` | Rate increase | Gate rejections |
| `verdict_manifest_replay_failures` | > 0 | Non-deterministic verdicts |
| `calibration_drift_percent` | > 10% | Trust vector drift from baseline |
### 3.2 Dashboards
Access dashboards at:
- Grafana: `https://<grafana>/d/trustlattice`
- Prometheus queries:
```promql
# Average claim score by source class
avg(trustlattice_claim_score) by (source_class)
# Gate failure rate
rate(policy_gate_failures_total[5m])
# Confidence distribution
histogram_quantile(0.5, trustlattice_verdict_confidence_bucket)
```
### 3.3 Log Queries
Key log entries (Loki/ELK):
```
# Claim scoring
{app="excititor"} |= "ClaimScore computed"
# Gate failures
{app="policy"} |= "Gate failed" | json | gate_name != ""
# Verdict replay failures
{app="authority"} |= "Replay mismatch"
```
---
## 4. Common Operations
### 4.1 Viewing Current Trust Vectors
```bash
# Via CLI
stella trustvector list --source-class vendor
# Via API
curl -H "Authorization: Bearer $TOKEN" \
https://api.example.com/api/v1/trustlattice/vectors
```
### 4.2 Inspecting a Verdict
```bash
# Get verdict details
stella verdict show verd:acme:abc123:CVE-2025-12345:1734873600
# Verify verdict replay
stella verdict replay verd:acme:abc123:CVE-2025-12345:1734873600
```
### 4.3 Viewing Gate Configuration
```bash
# List enabled gates
stella gates list --environment production
# Show gate thresholds
stella gates show minimumConfidence --environment production
```
### 4.4 Triggering Manual Calibration
```bash
# Trigger calibration epoch for a source
stella calibration run --source vendor:redhat \
--start 2025-11-01 --end 2025-12-01
# View calibration history
stella calibration history vendor:redhat
```
---
## 5. Emergency Procedures
### 5.1 High Gate Failure Rate
**Symptoms:**
- Spike in `policy_gate_failures_total`
- Many builds failing due to low confidence
**Steps:**
1. Check if VEX source is unavailable:
```bash
stella vex source status vendor:redhat
```
2. If source is stale, consider temporary threshold reduction:
```bash
# Edit etc/policy-gates.yaml
gates:
minimumConfidence:
thresholds:
production: 0.60 # Reduced from 0.75
```
3. Restart Policy Engine to apply changes
4. Monitor and restore threshold once source recovers
### 5.2 Verdict Replay Failures
**Symptoms:**
- `verdict_manifest_replay_failures` > 0
- Audit compliance check failures
**Steps:**
1. Identify failing verdict:
```bash
stella verdict list --replay-status failed --limit 10
```
2. Compare original and replayed inputs:
```bash
stella verdict diff <manifestId>
```
3. Common causes:
- VEX document modified after verdict
- Clock drift during evaluation
- Policy configuration changed
4. For clock drift, verify NTP synchronization:
```bash
timedatectl status
```
### 5.3 Trust Vector Drift Emergency
**Symptoms:**
- `calibration_drift_percent` > 20%
- Sudden confidence changes across many assets
**Steps:**
1. Freeze calibration:
```bash
stella calibration freeze vendor:redhat
```
2. Investigate recent calibration epochs:
```bash
stella calibration history vendor:redhat --epochs 5
```
3. If false positive rate increased, rollback:
```bash
stella calibration rollback vendor:redhat --to-epoch 41
```
4. Unfreeze after investigation:
```bash
stella calibration unfreeze vendor:redhat
```
---
## 6. Configuration
### 6.1 Configuration Files
| File | Purpose |
|------|---------|
| `etc/trust-lattice.yaml` | Trust vector weights and defaults |
| `etc/policy-gates.yaml` | Gate thresholds and rules |
| `etc/excititor-calibration.yaml` | Calibration parameters |
### 6.2 Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `TRUSTLATTICE_WEIGHTS_PROVENANCE` | 0.45 | Provenance weight |
| `TRUSTLATTICE_WEIGHTS_COVERAGE` | 0.35 | Coverage weight |
| `TRUSTLATTICE_FRESHNESS_HALFLIFE` | 90 | Freshness half-life (days) |
| `GATES_MINIMUM_CONFIDENCE_PROD` | 0.75 | Production confidence threshold |
| `CALIBRATION_LEARNING_RATE` | 0.02 | Calibration learning rate |
---
## 7. Maintenance Tasks
### 7.1 Daily
- [ ] Review gate failure alerts
- [ ] Check verdict replay success rate
- [ ] Monitor trust vector stability
### 7.2 Weekly
- [ ] Review calibration epoch results
- [ ] Analyze conflict rate trends
- [ ] Update trust vectors for new sources
### 7.3 Monthly
- [ ] Audit high-drift sources
- [ ] Review and tune gate thresholds
- [ ] Clean up expired verdict manifests
---
## 8. Contact
- **On-call**: #trustlattice-oncall (Slack)
- **Escalation**: VEX Guild Lead
- **Documentation**: `docs/modules/excititor/trust-lattice.md`
---
*Document Version: 1.0.0*
*Sprint: 7100.0003.0002*

View File

@@ -0,0 +1,405 @@
# Trust Lattice Troubleshooting Guide
> **Version**: 1.0.0
> **Last Updated**: 2025-12-22
> **Audience**: Support and Development teams
---
## Quick Reference
| Symptom | Likely Cause | Section |
|---------|--------------|---------|
| Low confidence scores | Stale VEX data or missing sources | [2.1](#21-low-confidence-scores) |
| Gate failures blocking builds | Threshold too high or source issues | [2.2](#22-gate-failures) |
| Verdict replay mismatches | Non-deterministic inputs | [2.3](#23-verdict-replay-failures) |
| Unexpected trust changes | Calibration drift | [2.4](#24-calibration-issues) |
| Conflicting verdicts | Multi-source disagreement | [2.5](#25-claim-conflicts) |
---
## 1. Diagnostic Commands
### 1.1 Check System Health
```bash
# Excititor health
curl https://api.example.com/excititor/health
# Policy Engine health
curl https://api.example.com/policy/health
# Authority health
curl https://api.example.com/authority/health
```
### 1.2 Trace a Verdict
```bash
# Get detailed verdict explanation
stella verdict explain <manifestId>
# Output includes:
# - All claims considered
# - Trust vector scores
# - Strength/freshness multipliers
# - Gate evaluation results
# - Conflict detection
```
### 1.3 Check VEX Source Status
```bash
# List all sources with status
stella vex source list
# Check specific source
stella vex source status vendor:redhat
# Sample output:
# Source: vendor:redhat
# Status: healthy
# Last fetch: 2025-12-22T10:00:00Z
# Documents: 15234
# Freshness: 2.3 hours
```
---
## 2. Common Issues
### 2.1 Low Confidence Scores
**Symptoms:**
- Verdicts have confidence < 0.5
- Many "under_investigation" statuses
**Diagnosis:**
1. Check claim freshness:
```bash
stella claim analyze --cve CVE-2025-12345 --asset sha256:abc123
# Look for:
# - Freshness multiplier < 0.5 (claim older than 180 days)
# - No high-trust sources
```
2. Check trust vector values:
```bash
stella trustvector show vendor:redhat
# Low scores indicate:
# - Signature verification issues (P)
# - Poor scope matching (C)
# - Non-deterministic outputs (R)
```
3. Check for missing VEX coverage:
```bash
stella vex coverage --purl pkg:npm/lodash@4.17.21
# No claims? Source may not cover this package
```
**Resolution:**
- If freshness is low: Check if source is publishing updates
- If trust vector is low: Review source verification settings
- If coverage is missing: Add additional VEX sources
### 2.2 Gate Failures
**Symptoms:**
- Builds failing with "Gate: MinimumConfidenceGate FAILED"
- Policy violations despite VEX claims
**Diagnosis:**
1. Check gate thresholds:
```bash
stella gates show minimumConfidence
# Thresholds:
# production: 0.75
# staging: 0.60
# development: 0.40
```
2. Compare with verdict confidence:
```bash
stella verdict show <manifestId> | grep confidence
# confidence: 0.68 <- Below 0.75 production threshold
```
3. Check which gate failed:
```bash
stella verdict gates <manifestId>
# Gates:
# MinimumConfidenceGate: FAILED (0.68 < 0.75)
# SourceQuotaGate: PASSED
# UnknownsBudgetGate: PASSED
```
**Resolution:**
- Temporary: Lower threshold (with approval)
- Long-term: Add corroborating VEX sources
- If single-source: Check SourceQuotaGate corroboration
### 2.3 Verdict Replay Failures
**Symptoms:**
- Replay verification returns success: false
- Audit failures due to non-determinism
**Diagnosis:**
1. Get detailed diff:
```bash
stella verdict replay --diff <manifestId>
# Differences:
# result.confidence: 0.82 -> 0.79
# inputs.vexDocumentDigests[2]: sha256:abc... (missing)
```
2. Common causes:
| Difference | Likely Cause |
|------------|--------------|
| VEX digest mismatch | Document was modified after verdict |
| Confidence delta | Clock cutoff drift (freshness calc) |
| Missing claims | Source was unavailable during replay |
| Different status | Policy version changed |
3. Check input availability:
```bash
# Verify all pinned inputs exist
stella cas verify --digest sha256:abc123
```
**Resolution:**
- Clock drift: Ensure NTP synchronization across nodes
- Missing inputs: Restore from backup or acknowledge drift
- Policy change: Compare policy hashes between original and replay
### 2.4 Calibration Issues
**Symptoms:**
- Trust vectors changed unexpectedly
- Accuracy metrics declining
**Diagnosis:**
1. Review recent calibrations:
```bash
stella calibration history vendor:redhat --epochs 5
# Epoch 42: accuracy=0.92, delta=(-0.02, +0.02, 0)
# Epoch 41: accuracy=0.94, delta=(-0.01, +0.01, 0)
```
2. Check comparison results:
```bash
stella calibration epoch 42 --details
# Total claims: 1500
# Correct: 1380
# False positives: 45
# False negatives: 75
# Detected bias: OptimisticBias
```
3. Check for data quality issues:
```bash
# Look for corrupted truth data
stella calibration validate-truth --epoch 42
```
**Resolution:**
- High false positive: Reduce provenance score
- High false negative: Review coverage matching
- Data quality issue: Re-run with corrected truth set
- Emergency: Rollback to previous epoch
### 2.5 Claim Conflicts
**Symptoms:**
- Verdicts show hasConflicts: true
- Confidence reduced due to conflict penalty
**Diagnosis:**
1. View conflict details:
```bash
stella verdict conflicts <manifestId>
# Conflicts:
# vendor:redhat claims: not_affected
# hub:osv claims: affected
# Conflict penalty applied: 0.25
```
2. Investigate source disagreement:
```bash
# Get raw claims from each source
stella vex claim --source vendor:redhat --cve CVE-2025-12345
stella vex claim --source hub:osv --cve CVE-2025-12345
```
3. Check claim timestamps:
```bash
# Older claim may be outdated
stella claim compare vendor:redhat hub:osv --cve CVE-2025-12345
```
**Resolution:**
- If one source is stale: Flag for review
- If genuine disagreement: Higher-trust source wins (by design)
- If persistent: Consider source override in policy
---
## 3. Performance Issues
### 3.1 Slow Claim Scoring
**Symptoms:**
- Scoring latency > 100ms
- Timeouts during high load
**Diagnosis:**
```bash
# Check scoring performance
stella perf scoring --samples 100
# Look for:
# - Cache miss rate
# - Trust vector lookups
# - Freshness calculation overhead
```
**Resolution:**
- Enable trust vector caching
- Pre-compute freshness for common cutoffs
- Scale Excititor horizontally
### 3.2 Slow Verdict Replay
**Symptoms:**
- Replay verification > 5 seconds
- Timeout during audit
**Diagnosis:**
```bash
# Check input retrieval time
stella verdict replay --timing <manifestId>
# Timing:
# Input fetch: 3.2s
# Score compute: 0.1s
# Merge: 0.05s
# Total: 3.35s
```
**Resolution:**
- Ensure CAS storage is local or cached
- Pre-warm verdict cache for critical assets
- Increase timeout for large manifests
---
## 4. Integration Issues
### 4.1 VEX Source Not Recognized
**Symptoms:**
- Claims from source not included in verdicts
- Source shows as "unknown" class
**Resolution:**
1. Register source in configuration:
```yaml
# etc/trust-lattice.yaml
sources:
- id: vendor:newvendor
class: vendor
trustVector:
provenance: 0.85
coverage: 0.70
replayability: 0.60
```
2. Reload configuration:
```bash
stella config reload --service excititor
```
### 4.2 Gate Not Evaluating
**Symptoms:**
- Expected gate not appearing in results
- Gate shows as "disabled"
**Resolution:**
1. Check gate configuration:
```bash
stella gates list --show-disabled
```
2. Enable gate:
```yaml
# etc/policy-gates.yaml
gates:
minimumConfidence:
enabled: true # Ensure this is true
```
---
## 5. Support Information
### 5.1 Collecting Diagnostic Bundle
```bash
stella support bundle --include trust-lattice \
--since 1h --output /tmp/diag.zip
```
Bundle includes:
- Trust vector snapshots
- Recent verdicts
- Gate evaluations
- Calibration history
- System metrics
### 5.2 Log Locations
| Service | Log Path |
|---------|----------|
| Excititor | `/var/log/stellaops/excititor.log` |
| Policy | `/var/log/stellaops/policy.log` |
| Authority | `/var/log/stellaops/authority.log` |
### 5.3 Contact
- **Support**: support@stella-ops.org
- **Documentation**: `docs/modules/excititor/trust-lattice.md`
- **GitHub Issues**: https://github.com/stella-ops/stella-ops/issues
---
*Document Version: 1.0.0*
*Sprint: 7100.0003.0002*