# Policy observability Purpose - Capture Policy Engine metrics, logs, traces, and incident workflows. Metrics - policy_run_seconds{tenant,policy,mode} - policy_run_queue_depth{tenant} - policy_run_failures_total{tenant,policy,reason} - policy_run_retries_total{tenant,policy} - policy_run_inputs_pending_bytes{tenant} - policy_rules_fired_total{tenant,policy,rule} - policy_vex_overrides_total{tenant,policy,vendor,justification} - policy_suppressions_total{tenant,policy,action} - policy_selection_batch_duration_seconds{tenant,policy} - policy_materialization_conflicts_total{tenant,policy} - policy_api_requests_total{endpoint,method,status} - policy_api_latency_seconds{endpoint,method} - policy_api_rate_limited_total{endpoint} - policy_queue_leases_active{tenant} - policy_queue_lease_expirations_total{tenant} - policy_delta_backlog_age_seconds{tenant,source} Logs - Structured JSON with policyId, policyVersion, tenant, runId, rule, traceId, env.sealed. - Categories: policy.run, policy.evaluate, policy.materialize, policy.simulate, policy.lifecycle. - Rule-hit logs sample at 1% by default; incident mode raises to 100%. Traces - policy.api, policy.select, policy.evaluate, policy.materialize, policy.simulate. - Trace context propagated to CLI and UI. Alerts - PolicyRunSlaBreach: p95 policy_run_seconds too high. - PolicyQueueStuck: policy_delta_backlog_age_seconds > 600. - DeterminismMismatch: ERR_POL_004 or replay diff. - SimulationDrift: simulation exit 20 over threshold. - VexOverrideSpike and SuppressionSurge. Incident mode - POST /api/policy/incidents/activate toggles sampling to 100%. - Retention extends to 30 days during incident. - policy.incident.activated event emitted. Integration points - Authority metrics for scope_denied events. - Concelier and Excititor trace propagation via gRPC metadata. - Offline kits export metrics and logs snapshots.