# P0 Product Metrics > **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics > **Task:** P0M-007 - Documentation This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health. ## Overview These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them." | Metric | Target | Alert Threshold | |--------|--------|-----------------| | Time to First Verified Release | P90 < 4 hours | P90 > 24 hours | | Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour | | Support Minutes per Customer | Trend toward 0 | > 30 min/month | | Determinism Regressions | Zero | Any policy-level | --- ## Metric 1: Time to First Verified Release **Name:** `stella_time_to_first_verified_release_seconds` **Type:** Histogram ### Definition Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded). ### Labels | Label | Values | Description | |-------|--------|-------------| | `tenant` | (varies) | Tenant identifier | | `deployment_type` | `fresh`, `upgrade` | Type of installation | ### Histogram Buckets 5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week) ### Collection Points 1. **Install timestamp** - Recorded on first Authority service startup 2. **First promotion** - Recorded in Release Orchestrator on first verified promotion ### Why This Matters A short time-to-first-release indicates: - Good onboarding experience - Clear documentation - Sensible default configurations - Working integrations ### Dashboard Usage The Grafana dashboard shows: - Histogram heatmap of time distribution - P50/P90/P99 statistics - Trend over time ### Alert Response **Warning (P90 > 4 hours):** 1. Review recent onboarding experiences 2. Check for common configuration issues 3. Review documentation clarity **Critical (P90 > 24 hours):** 1. Investigate blocked customers 2. Check for integration failures 3. Consider guided onboarding assistance --- ## Metric 2: Mean Time to Answer "Why Blocked" **Name:** `stella_why_blocked_latency_seconds` **Type:** Histogram ### Definition Time from block decision to user viewing explanation (via CLI, UI, or API). ### Labels | Label | Values | Description | |-------|--------|-------------| | `tenant` | (varies) | Tenant identifier | | `surface` | `cli`, `ui`, `api` | Interface used to view explanation | | `resolution_type` | `immediate`, `delayed` | Same session vs different session | ### Histogram Buckets 1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h ### Collection Points 1. **Block decision** - Timestamp stored in verdict 2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked ### Why This Matters Short "why blocked" latency indicates: - Clear block messaging - Discoverable explanation tools - Good explainability UX Long latency may indicate: - Users confused about where to find answers - Documentation gaps - UX friction ### Dashboard Usage The Grafana dashboard shows: - Histogram heatmap of latency distribution - Trend line over time - Breakdown by surface (CLI vs UI vs API) ### Alert Response **Warning (P90 > 5 minutes):** 1. Review block notification messaging 2. Check CLI command discoverability 3. Verify UI links are prominent **Critical (P90 > 1 hour):** 1. Investigate user flows 2. Add proactive notifications 3. Review documentation and help text --- ## Metric 3: Support Minutes per Customer **Name:** `stella_support_burden_minutes_total` **Type:** Counter ### Definition Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking. ### Labels | Label | Values | Description | |-------|--------|-------------| | `tenant` | (varies) | Tenant identifier | | `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category | | `month` | YYYY-MM | Month of support | ### Collection Log support interactions using: ```bash stella ops support log --tenant --minutes --category ``` Or via API: ```bash POST /v1/ops/support/log { "tenant": "acme-corp", "minutes": 15, "category": "config" } ``` ### Why This Matters This metric tracks operational scalability. For solo-scaled operations: - Support burden should trend toward zero - High support minutes indicate product gaps - Categories identify areas needing improvement ### Dashboard Usage The Grafana dashboard shows: - Stacked bar chart by category - Monthly trend per tenant - Total support burden ### Alert Response **Warning (> 30 min/month per tenant):** 1. Review support interactions for patterns 2. Identify documentation gaps 3. Create runbooks for common issues **Critical (> 60 min/month per tenant):** 1. Escalate to product for feature work 2. Consider dedicated support time 3. Prioritize automation --- ## Metric 4: Determinism Regressions **Name:** `stella_determinism_regressions_total` **Type:** Counter ### Definition Count of detected determinism failures in production (same inputs produced different outputs). ### Labels | Label | Values | Description | |-------|--------|-------------| | `tenant` | (varies) | Tenant identifier | | `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression | | `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression | ### Severity Tiers | Tier | Description | Impact | |------|-------------|--------| | `bitwise` | Byte-for-byte output differs | Low - cosmetic | | `semantic` | Output semantically differs | Medium - potential confusion | | `policy` | Policy decision differs | **Critical** - audit risk | ### Collection Points 1. **Scheduled verification jobs** - Regular determinism checks 2. **Replay verification failures** - User-initiated replays 3. **CI golden test failures** - Development-time detection ### Why This Matters Determinism is a core moat. Regressions indicate: - Non-deterministic code introduced - External dependency changes - Time-sensitive logic bugs **Policy-level regressions are audit-breaking** and must be fixed immediately. ### Dashboard Usage The Grafana dashboard shows: - Counter with severity breakdown - Alert status indicator - Historical trend ### Alert Response **Warning (any bitwise/semantic):** 1. Review recent deployments 2. Check for dependency updates 3. Investigate affected component **Critical (any policy):** 1. **Immediate investigation required** 2. Consider rollback 3. Review all recent policy decisions 4. Notify affected customers --- ## Dashboard Access The P0 metrics dashboard is available at: ``` /grafana/d/stella-p0-metrics ``` Or directly: ```bash stella ops dashboard p0 ``` ### Dashboard Features - **Tenant selector** - Filter by specific tenant - **Time range** - Adjust analysis window - **SLO indicators** - Green/yellow/red status - **Drill-down links** - Navigate to detailed views --- ## Alerting Configuration Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`. ### Alert Channels Configure alert destinations in Grafana: - Slack/Teams for warnings - PagerDuty for critical alerts - Email for summaries ### Silencing Alerts During maintenance windows: ```bash stella ops alerts silence --duration 2h --reason "Planned maintenance" ``` --- ## Implementation Notes ### Source Files | Component | Location | |-----------|----------| | Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` | | Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` | | Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` | | Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` | ### Adding Custom Metrics To add additional P0-level metrics: 1. Define in `P0ProductMetrics.cs` 2. Add collection points in relevant services 3. Create dashboard panel in Grafana JSON 4. Add alert rules 5. Update this documentation --- ## Related - [Observability Guide](observability.md) - [Alerting Configuration](alerting.md) - [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md) --- _Last updated: 2026-01-17 (UTC)_