partly or unimplemented features - now implemented

2026-02-09 08:53:51 +02:00
parent 1bf6bbf395
commit 4bdc298ec1
674 changed files with 90194 additions and 2271 deletions
--- a/docs/modules/gateway/architecture.md
+++ b/docs/modules/gateway/architecture.md
@@ -44,8 +44,10 @@ src/Gateway/
 │   ├── Middleware/
 │   │   ├── TenantMiddleware.cs             # Tenant context extraction
 │   │   ├── RequestRoutingMiddleware.cs     # HTTP → binary routing
-│   │   ├── AuthenticationMiddleware.cs     # DPoP/mTLS validation
-│   │   └── RateLimitingMiddleware.cs       # Per-tenant throttling
+│   │   ├── SenderConstraintMiddleware.cs   # DPoP/mTLS validation
+│   │   ├── IdentityHeaderPolicyMiddleware.cs # Identity header sanitization
+│   │   ├── CorrelationIdMiddleware.cs      # Request correlation
+│   │   └── HealthCheckMiddleware.cs        # Health probe handling
 │   ├── Services/
 │   │   ├── GatewayHostedService.cs         # Transport lifecycle
 │   │   ├── OpenApiAggregationService.cs    # Spec aggregation
@@ -329,9 +331,37 @@ gateway:

 ### Rate Limiting

- Per-tenant: Configurable requests/minute
- Per-identity: Burst protection
- Global: Circuit breaker for overload
+Gateway uses the Router's dual-window rate limiting middleware with circuit breaker:
+
+- **Instance-level** (in-memory): Per-router-instance limits using sliding window counters
+  - High-precision sub-second buckets for fair rate distribution
+  - No external dependencies; always available
+- **Environment-level** (Valkey-backed): Cross-instance limits for distributed deployments
+  - Atomic Lua scripts for consistent counting across instances
+  - Circuit breaker pattern for fail-open behavior when Valkey is unavailable
+- **Activation gate**: Environment-level checks only activate above traffic threshold (configurable)
+- **Response headers**: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
+
+Configuration via `appsettings.yaml`:
+```yaml
+rate_limiting:
+  process_back_pressure_when_more_than_per_5min: 5000
+  for_instance:
+    rules:
+      - max_requests: 100
+        per_seconds: 1
+      - max_requests: 1000
+        per_seconds: 60
+  for_environment:
+    valkey_connection: "localhost:6379"
+    rules:
+      - max_requests: 10000
+        per_seconds: 60
+    circuit_breaker:
+      failure_threshold: 3
+      timeout_seconds: 30
+      half_open_timeout: 10
+```

 ---

@@ -443,12 +473,80 @@ spec:
 | Feature | Sprint | Status |
 |---------|--------|--------|
 | Core implementation | 3600.0001.0001 | TODO |
+| Performance Testing Pipeline | 038 | DONE |
 | WebSocket support | Future | Planned |
 | gRPC passthrough | Future | Planned |
 | GraphQL aggregation | Future | Exploration |

 ---

+## 14) Performance Testing Pipeline (k6 + Prometheus + Correlation IDs)
+
+### Overview
+
+The Gateway includes a comprehensive performance testing pipeline with k6 load tests,
+Prometheus metric instrumentation, and Grafana dashboards for performance curve modelling.
+
+### k6 Scenarios (A–G)
+
+| Scenario | Purpose | VUs | Duration | Key Metric |
+|----------|---------|-----|----------|------------|
+| A — Health Baseline | Sub-ms health probe overhead | 10 | 1 min | P95 < 10 ms |
+| B — OpenAPI Aggregation | Spec cache under concurrent readers | 50 | 75 s | P95 < 200 ms |
+| C — Routing Throughput | Mixed-method routing at target RPS | 200 | 2 min | P50 < 2 ms, P99 < 5 ms |
+| D — Correlation ID | Propagation overhead measurement | 20 | 1 min | P95 < 5 ms overhead |
+| E — Rate Limit Boundary | Enforcement correctness at boundary | 100 | 1 min | Retry-After header |
+| F — Connection Ramp | Transport saturation (ramp to 1000 VUs) | 1000 | 2 min | No 503 responses |
+| G — Steady-State Soak | Memory leak / resource exhaustion | 50 | 10 min | Stable memory |
+
+Run all scenarios:
+```bash
+k6 run --env BASE_URL=https://gateway.stella-ops.local src/Gateway/__Tests/load/gateway_performance.k6.js
+```
+
+Run a single scenario:
+```bash
+k6 run --env BASE_URL=https://gateway.stella-ops.local --env SCENARIO=scenario_c_routing_throughput src/Gateway/__Tests/load/gateway_performance.k6.js
+```
+
+### Performance Metrics (GatewayPerformanceMetrics)
+
+Meter: `StellaOps.Gateway.Performance`
+
+| Instrument | Type | Unit | Description |
+|------------|------|------|-------------|
+| `gateway.requests.total` | Counter | — | Total requests processed |
+| `gateway.errors.total` | Counter | — | Errors (4xx/5xx) |
+| `gateway.ratelimit.total` | Counter | — | Rate-limited requests (429) |
+| `gateway.request.duration` | Histogram | ms | Full request duration |
+| `gateway.auth.duration` | Histogram | ms | Auth middleware duration |
+| `gateway.transport.duration` | Histogram | ms | TCP/TLS transport duration |
+| `gateway.routing.duration` | Histogram | ms | Instance selection duration |
+
+### Grafana Dashboard
+
+Dashboard: `devops/telemetry/dashboards/stella-ops-gateway-performance.json`
+UID: `stella-ops-gateway-performance`
+
+Panels:
+1. **Overview row** — P50/P99 gauges, error rate, RPS
+2. **Latency Distribution** — Percentile time series (overall + per-service)
+3. **Throughput & Rate Limiting** — RPS by service, rate-limited requests by route
+4. **Pipeline Breakdown** — Auth/Routing/Transport P95 breakdown, errors by status
+5. **Connections & Resources** — Active connections, endpoints, memory usage
+
+### C# Models
+
+| Type | Purpose |
+|------|---------|
+| `GatewayPerformanceObservation` | Single request observation (all pipeline phases) |
+| `PerformanceScenarioConfig` | Scenario definition with SLO thresholds |
+| `PerformanceCurvePoint` | Aggregated window data with computed RPS/error rate |
+| `PerformanceTestSummary` | Complete test run result with threshold violations |
+| `GatewayPerformanceMetrics` | OTel service emitting Prometheus-compatible metrics |
+
+---
+
 ## 14) References

 - Router Architecture: `docs/modules/router/architecture.md`