feat(docs): Add comprehensive documentation for Vexer, Vulnerability Explorer, and Zastava modules

- Introduced AGENTS.md, README.md, TASKS.md, and implementation_plan.md for Vexer, detailing mission, responsibilities, key components, and operational notes.
- Established similar documentation structure for Vulnerability Explorer and Zastava modules, including their respective workflows, integrations, and observability notes.
- Created risk scoring profiles documentation outlining the core workflow, factor model, governance, and deliverables.
- Ensured all modules adhere to the Aggregation-Only Contract and maintain determinism and provenance in outputs.
This commit is contained in:
2025-10-30 00:09:39 +02:00
parent 3154c67978
commit 7b5bdcf4d3
503 changed files with 16136 additions and 54638 deletions

View File

@@ -0,0 +1,83 @@
# Authority Monitoring & Alerting Playbook
## Telemetry Sources
- **Traces:** Activity source `StellaOps.Authority` emits spans for every token flow (`authority.token.validate_*`, `authority.token.handle_*`, `authority.token.validate_access`). Key tags include `authority.endpoint`, `authority.grant_type`, `authority.username`, `authority.client_id`, and `authority.identity_provider`.
- **Metrics:** OpenTelemetry instrumentation (`AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`, custom meter `StellaOps.Authority`) exports:
- `http.server.request.duration` histogram (`http_route`, `http_status_code`, `authority.endpoint` tag via `aspnetcore` enrichment).
- `process.runtime.gc.*`, `process.runtime.dotnet.*` (from `AddRuntimeInstrumentation`).
- **Logs:** Serilog writes structured events to stdout. Notable templates:
- `"Password grant verification failed ..."` and `"Plugin {PluginName} denied access ... due to lockout"` (lockout spike detector).
- `"Password grant validation failed for {Username}: provider '{Provider}' does not support MFA required for exception approvals."` (identifies users attempting `exceptions:approve` without MFA support; tie to fresh-auth errors).
- `"Client credentials validation failed for {ClientId}: exception scopes require tenant assignment."` (signals misconfigured exception service identities).
- `"Granting StellaOps bypass for remote {RemoteIp}"` (bypass usage).
- `"Rate limit exceeded for path {Path} from {RemoteIp}"` (limiter alerts).
## Prometheus Metrics to Collect
| Metric | Query | Purpose |
| --- | --- | --- |
| `token_requests_total` | `sum by (grant_type, status) (rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Token issuance volume per grant type (`grant_type` comes via `authority.grant_type` span attribute → Exemplars in Grafana). |
| `token_failure_ratio` | `sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token", http_status_code=~"4..|5.."}[5m])) / sum(rate(http_server_duration_seconds_count{service_name="stellaops-authority", http_route="/token"}[5m]))` | Alert when >5% for 10min. |
| `authorize_rate_limit_hits` | `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority", limiter="authority-token"}[5m]))` | Detect rate limiting saturations (requires OTEL ASP.NET rate limiter exporter). |
| `lockout_events` | `sum by (plugin) (rate(log_messages_total{app="stellaops-authority", level="Warning", message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[5m]))` | Derived from Loki/Promtail log counter. |
| `bypass_usage_total` | `sum(rate(log_messages_total{app="stellaops-authority", level="Information", message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m]))` | Track trusted bypass invocations. |
> **Exporter note:** Enable `aspnetcore` meters (`dotnet-counters` name `Microsoft.AspNetCore.Hosting`), or configure the OpenTelemetry Collector `metrics` pipeline with `metric_statements` to remap histogram counts into the shown series.
## Alert Rules
1. **Token Failure Surge**
- _Expression_: `token_failure_ratio > 0.05`
- _For_: `10m`
- _Labels_: `severity="critical"`
- _Annotations_: Include `topk(5, sum by (authority_identity_provider) (increase(authority_token_rejections_total[10m])))` as diagnostic hint (requires span → metric transformation).
2. **Lockout Spike**
- _Expression_: `sum(rate(log_messages_total{message_template="Plugin {PluginName} denied access for {Username} due to lockout (retry after {RetryAfter})."}[15m])) > 10`
- _For_: `15m`
- Investigate credential stuffing; consider temporarily tightening `RateLimiting.Token`.
3. **Bypass Threshold**
- _Expression_: `sum(rate(log_messages_total{message_template="Granting StellaOps bypass for remote {RemoteIp}; required scopes {RequiredScopes}."}[5m])) > 1`
- _For_: `5m`
- Alert severity `warning` — verify the calling host list.
4. **Rate Limiter Saturation**
- _Expression_: `sum(rate(aspnetcore_rate_limiting_rejections_total{service_name="stellaops-authority"}[5m])) > 0`
- Escalate if sustained for 5min; confirm trusted clients arent misconfigured.
## Grafana Dashboard
- Import `docs/modules/authority/operations/grafana-dashboard.json` to provision baseline panels:
- **Token Success vs Failure** stacked rate visualization split by grant type.
- **Rate Limiter Hits** bar chart showing `authority-token` and `authority-authorize`.
- **Bypass & Lockout Events** dual-stat panel using Loki-derived counters.
- **Trace Explorer Link** panel links to `StellaOps.Authority` span search pre-filtered by `authority.grant_type`.
## Collector Configuration Snippets
```yaml
receivers:
otlp:
protocols:
http:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
processors:
batch:
attributes/token_grant:
actions:
- key: grant_type
action: upsert
from_attribute: authority.grant_type
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes/token_grant, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
```
## Operational Checklist
- [ ] Confirm `STELLAOPS_AUTHORITY__OBSERVABILITY__EXPORTERS` enables OTLP in production builds.
- [ ] Ensure Promtail captures container stdout with Serilog structured formatting.
- [ ] Periodically validate alert noise by running load tests that trigger the rate limiter.
- [ ] Include dashboard JSON in Offline Kit for air-gapped clusters; update version header when metrics change.