Add Policy DSL Validator, Schema Exporter, and Simulation Smoke tools
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Implemented PolicyDslValidator with command-line options for strict mode and JSON output.
- Created PolicySchemaExporter to generate JSON schemas for policy-related models.
- Developed PolicySimulationSmoke tool to validate policy simulations against expected outcomes.
- Added project files and necessary dependencies for each tool.
- Ensured proper error handling and usage instructions across tools.
This commit is contained in:
2025-10-27 08:00:11 +02:00
parent 651b8e0fa3
commit 96d52884e8
712 changed files with 49449 additions and 6124 deletions

1
deploy/telemetry/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
certs/

View File

@@ -0,0 +1,35 @@
# Telemetry Collector Assets
These assets provision the default OpenTelemetry Collector instance required by
`DEVOPS-OBS-50-001`. The collector acts as the secured ingest point for traces,
metrics, and logs emitted by StellaOps services.
## Contents
| File | Purpose |
| ---- | ------- |
| `otel-collector-config.yaml` | Baseline collector configuration (mutual TLS, OTLP receivers, Prometheus exporter). |
| `storage/prometheus.yaml` | Prometheus scrape configuration tuned for the collector and service tenants. |
| `storage/tempo.yaml` | Tempo configuration with multitenancy, WAL, and compaction settings. |
| `storage/loki.yaml` | Loki configuration enabling multitenant log ingestion with retention policies. |
| `storage/tenants/*.yaml` | Per-tenant overrides for Tempo and Loki rate/retention controls. |
## Development workflow
1. Generate development certificates (collector + client) using
`ops/devops/telemetry/generate_dev_tls.sh`.
2. Launch the collector via `docker compose -f docker-compose.telemetry.yaml up`.
3. Launch the storage backends (Prometheus, Tempo, Loki) via
`docker compose -f docker-compose.telemetry-storage.yaml up`.
4. Run the smoke test: `python ops/devops/telemetry/smoke_otel_collector.py`.
5. Explore the storage configuration (`storage/README.md`) to tune retention/limits.
The smoke test sends OTLP traffic over TLS and asserts the collector accepted
traces, metrics, and logs by scraping the Prometheus metrics endpoint.
## Kubernetes
The Helm chart consumes the same configuration (see `values.yaml`). Provide TLS
material via a secret referenced by `telemetry.collector.tls.secretName`,
containing `ca.crt`, `tls.crt`, and `tls.key`. Client certificates are required
for ingestion and should be issued by the same CA.

View File

@@ -0,0 +1,67 @@
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: ${STELLAOPS_OTEL_TLS_CERT:?STELLAOPS_OTEL_TLS_CERT not set}
key_file: ${STELLAOPS_OTEL_TLS_KEY:?STELLAOPS_OTEL_TLS_KEY not set}
client_ca_file: ${STELLAOPS_OTEL_TLS_CA:?STELLAOPS_OTEL_TLS_CA not set}
require_client_certificate: ${STELLAOPS_OTEL_REQUIRE_CLIENT_CERT:true}
http:
endpoint: 0.0.0.0:4318
tls:
cert_file: ${STELLAOPS_OTEL_TLS_CERT:?STELLAOPS_OTEL_TLS_CERT not set}
key_file: ${STELLAOPS_OTEL_TLS_KEY:?STELLAOPS_OTEL_TLS_KEY not set}
client_ca_file: ${STELLAOPS_OTEL_TLS_CA:?STELLAOPS_OTEL_TLS_CA not set}
require_client_certificate: ${STELLAOPS_OTEL_REQUIRE_CLIENT_CERT:true}
processors:
attributes/tenant-tag:
actions:
- key: tenant.id
action: insert
value: ${STELLAOPS_TENANT_ID:unknown}
batch:
send_batch_size: 1024
timeout: 5s
exporters:
logging:
verbosity: normal
prometheus:
endpoint: ${STELLAOPS_OTEL_PROMETHEUS_ENDPOINT:0.0.0.0:9464}
enable_open_metrics: true
metric_expiration: 5m
tls:
cert_file: ${STELLAOPS_OTEL_TLS_CERT:?STELLAOPS_OTEL_TLS_CERT not set}
key_file: ${STELLAOPS_OTEL_TLS_KEY:?STELLAOPS_OTEL_TLS_KEY not set}
client_ca_file: ${STELLAOPS_OTEL_TLS_CA:?STELLAOPS_OTEL_TLS_CA not set}
# Additional OTLP exporters can be configured by extending this section at runtime.
# For example, set STELLAOPS_OTEL_UPSTREAM_ENDPOINT and mount certificates, then
# add the exporter via a sidecar overlay.
extensions:
health_check:
endpoint: ${STELLAOPS_OTEL_HEALTH_ENDPOINT:0.0.0.0:13133}
pprof:
endpoint: ${STELLAOPS_OTEL_PPROF_ENDPOINT:0.0.0.0:1777}
service:
telemetry:
logs:
level: ${STELLAOPS_OTEL_LOG_LEVEL:info}
extensions: [health_check, pprof]
pipelines:
traces:
receivers: [otlp]
processors: [attributes/tenant-tag, batch]
exporters: [logging]
metrics:
receivers: [otlp]
processors: [attributes/tenant-tag, batch]
exporters: [logging, prometheus]
logs:
receivers: [otlp]
processors: [attributes/tenant-tag, batch]
exporters: [logging]

View File

@@ -0,0 +1,33 @@
# Telemetry Storage Stack
Configuration snippets for the default StellaOps observability backends used in
staging and production environments. The stack comprises:
- **Prometheus** for metrics (scraping the collector's Prometheus exporter)
- **Tempo** for traces (OTLP ingest via mTLS)
- **Loki** for logs (HTTP ingest with tenant isolation)
## Files
| Path | Description |
| ---- | ----------- |
| `prometheus.yaml` | Scrape configuration for the collector (mTLS + bearer token placeholder). |
| `tempo.yaml` | Tempo configuration with multitenancy enabled and local storage paths. |
| `loki.yaml` | Loki configuration enabling per-tenant overrides and boltdb-shipper storage. |
| `tenants/tempo-overrides.yaml` | Example tenant overrides for Tempo (retention, limits). |
| `tenants/loki-overrides.yaml` | Example tenant overrides for Loki (rate limits, retention). |
| `auth/` | Placeholder directory for Prometheus bearer token files (e.g., `token`). |
These configurations are referenced by the Docker Compose overlay
(`deploy/compose/docker-compose.telemetry-storage.yaml`) and the staging rollout documented in
`docs/ops/telemetry-storage.md`. Adjust paths, credentials, and overrides before running in
connected environments. Place the Prometheus bearer token in `auth/token` when using the
Compose overlay (the directory contains a `.gitkeep` placeholder and is gitignored by default).
## Security
- Both Tempo and Loki require mutual TLS.
- Prometheus uses mTLS plus a bearer token that should be minted by Authority.
- Update the overrides files to enforce per-tenant retention/ingestion limits.
For comprehensive deployment steps see `docs/ops/telemetry-storage.md`.

View File

View File

@@ -0,0 +1,48 @@
auth_enabled: true
server:
http_listen_port: 3100
log_level: info
common:
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /var/loki
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
filesystem:
directory: /var/loki/chunks
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/index_cache
shared_store: filesystem
ruler:
storage:
type: local
local:
directory: /var/loki/rules
rule_path: /tmp/loki-rules
enable_api: true
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_entries_limit_per_query: 5000
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_tenant_override_config: /etc/telemetry/tenants/loki-overrides.yaml

View File

@@ -0,0 +1,19 @@
global:
scrape_interval: 15s
evaluation_interval: 30s
scrape_configs:
- job_name: "stellaops-otel-collector"
scheme: https
metrics_path: /
tls_config:
ca_file: ${PROMETHEUS_TLS_CA_FILE:-/etc/telemetry/tls/ca.crt}
cert_file: ${PROMETHEUS_TLS_CERT_FILE:-/etc/telemetry/tls/client.crt}
key_file: ${PROMETHEUS_TLS_KEY_FILE:-/etc/telemetry/tls/client.key}
insecure_skip_verify: false
authorization:
type: Bearer
credentials_file: ${PROMETHEUS_BEARER_TOKEN_FILE:-/etc/telemetry/auth/token}
static_configs:
- targets:
- ${PROMETHEUS_COLLECTOR_TARGET:-stellaops-otel-collector:9464}

View File

@@ -0,0 +1,56 @@
multitenancy_enabled: true
usage_report:
reporting_enabled: false
server:
http_listen_port: 3200
log_level: info
distributor:
receivers:
otlp:
protocols:
grpc:
tls:
cert_file: ${TEMPO_TLS_CERT_FILE:-/etc/telemetry/tls/server.crt}
key_file: ${TEMPO_TLS_KEY_FILE:-/etc/telemetry/tls/server.key}
client_ca_file: ${TEMPO_TLS_CA_FILE:-/etc/telemetry/tls/ca.crt}
require_client_cert: true
http:
tls:
cert_file: ${TEMPO_TLS_CERT_FILE:-/etc/telemetry/tls/server.crt}
key_file: ${TEMPO_TLS_KEY_FILE:-/etc/telemetry/tls/server.key}
client_ca_file: ${TEMPO_TLS_CA_FILE:-/etc/telemetry/tls/ca.crt}
require_client_cert: true
ingester:
lifecycler:
ring:
instance_availability_zone: ${TEMPO_ZONE:-zone-a}
trace_idle_period: 10s
max_block_bytes: 1_048_576
compactor:
compaction:
block_retention: 168h
metrics_generator:
registry:
external_labels:
cluster: stellaops
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
metrics:
backend: prometheus
overrides:
defaults:
ingestion_rate_limit_bytes: 1048576
max_traces_per_user: 200000
per_tenant_override_config: /etc/telemetry/tenants/tempo-overrides.yaml

View File

@@ -0,0 +1,19 @@
# Example Loki per-tenant overrides
# Adjust according to https://grafana.com/docs/loki/latest/configuration/#limits_config
stellaops-dev:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_global_streams_per_user: 5000
retention_period: 168h
stellaops-stage:
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
max_global_streams_per_user: 10000
retention_period: 336h
__default__:
ingestion_rate_mb: 5
ingestion_burst_size_mb: 10
retention_period: 72h

View File

@@ -0,0 +1,16 @@
# Example Tempo per-tenant overrides
# Consult https://grafana.com/docs/tempo/latest/configuration/#limits-configuration
# before applying in production.
stellaops-dev:
traces_per_second_limit: 100000
max_bytes_per_trace: 10485760
max_search_bytes_per_trace: 20971520
stellaops-stage:
traces_per_second_limit: 200000
max_bytes_per_trace: 20971520
__default__:
traces_per_second_limit: 50000
max_bytes_per_trace: 5242880