Files
git.stella-ops.org/docs/modules/telemetry/guides/p0-metrics.md

8.1 KiB

P0 Product Metrics

Sprint: SPRINT_20260117_028_Telemetry_p0_metrics
Task: P0M-007 - Documentation

This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.

Overview

These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."

Metric Target Alert Threshold
Time to First Verified Release P90 < 4 hours P90 > 24 hours
Mean Time to Answer "Why Blocked" P90 < 5 minutes P90 > 1 hour
Support Minutes per Customer Trend toward 0 > 30 min/month
Determinism Regressions Zero Any policy-level

Metric 1: Time to First Verified Release

Name: stella_time_to_first_verified_release_seconds
Type: Histogram

Definition

Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).

Labels

Label Values Description
tenant (varies) Tenant identifier
deployment_type fresh, upgrade Type of installation

Histogram Buckets

5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)

Collection Points

  1. Install timestamp - Recorded on first Authority service startup
  2. First promotion - Recorded in Release Orchestrator on first verified promotion

Why This Matters

A short time-to-first-release indicates:

  • Good onboarding experience
  • Clear documentation
  • Sensible default configurations
  • Working integrations

Dashboard Usage

The Grafana dashboard shows:

  • Histogram heatmap of time distribution
  • P50/P90/P99 statistics
  • Trend over time

Alert Response

Warning (P90 > 4 hours):

  1. Review recent onboarding experiences
  2. Check for common configuration issues
  3. Review documentation clarity

Critical (P90 > 24 hours):

  1. Investigate blocked customers
  2. Check for integration failures
  3. Consider guided onboarding assistance

Metric 2: Mean Time to Answer "Why Blocked"

Name: stella_why_blocked_latency_seconds
Type: Histogram

Definition

Time from block decision to user viewing explanation (via CLI, UI, or API).

Labels

Label Values Description
tenant (varies) Tenant identifier
surface cli, ui, api Interface used to view explanation
resolution_type immediate, delayed Same session vs different session

Histogram Buckets

1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h

Collection Points

  1. Block decision - Timestamp stored in verdict
  2. Explanation view - Tracked when stella explain block or UI equivalent invoked

Why This Matters

Short "why blocked" latency indicates:

  • Clear block messaging
  • Discoverable explanation tools
  • Good explainability UX

Long latency may indicate:

  • Users confused about where to find answers
  • Documentation gaps
  • UX friction

Dashboard Usage

The Grafana dashboard shows:

  • Histogram heatmap of latency distribution
  • Trend line over time
  • Breakdown by surface (CLI vs UI vs API)

Alert Response

Warning (P90 > 5 minutes):

  1. Review block notification messaging
  2. Check CLI command discoverability
  3. Verify UI links are prominent

Critical (P90 > 1 hour):

  1. Investigate user flows
  2. Add proactive notifications
  3. Review documentation and help text

Metric 3: Support Minutes per Customer

Name: stella_support_burden_minutes_total
Type: Counter

Definition

Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.

Labels

Label Values Description
tenant (varies) Tenant identifier
category install, config, policy, integration, bug, other Support category
month YYYY-MM Month of support

Collection

Log support interactions using:

stella ops support log --tenant <id> --minutes <n> --category <cat>

Or via API:

POST /v1/ops/support/log
{
  "tenant": "acme-corp",
  "minutes": 15,
  "category": "config"
}

Why This Matters

This metric tracks operational scalability. For solo-scaled operations:

  • Support burden should trend toward zero
  • High support minutes indicate product gaps
  • Categories identify areas needing improvement

Dashboard Usage

The Grafana dashboard shows:

  • Stacked bar chart by category
  • Monthly trend per tenant
  • Total support burden

Alert Response

Warning (> 30 min/month per tenant):

  1. Review support interactions for patterns
  2. Identify documentation gaps
  3. Create runbooks for common issues

Critical (> 60 min/month per tenant):

  1. Escalate to product for feature work
  2. Consider dedicated support time
  3. Prioritize automation

Metric 4: Determinism Regressions

Name: stella_determinism_regressions_total
Type: Counter

Definition

Count of detected determinism failures in production (same inputs produced different outputs).

Labels

Label Values Description
tenant (varies) Tenant identifier
component scanner, policy, attestor, export Component with regression
severity bitwise, semantic, policy Fidelity tier of regression

Severity Tiers

Tier Description Impact
bitwise Byte-for-byte output differs Low - cosmetic
semantic Output semantically differs Medium - potential confusion
policy Policy decision differs Critical - audit risk

Collection Points

  1. Scheduled verification jobs - Regular determinism checks
  2. Replay verification failures - User-initiated replays
  3. CI golden test failures - Development-time detection

Why This Matters

Determinism is a core moat. Regressions indicate:

  • Non-deterministic code introduced
  • External dependency changes
  • Time-sensitive logic bugs

Policy-level regressions are audit-breaking and must be fixed immediately.

Dashboard Usage

The Grafana dashboard shows:

  • Counter with severity breakdown
  • Alert status indicator
  • Historical trend

Alert Response

Warning (any bitwise/semantic):

  1. Review recent deployments
  2. Check for dependency updates
  3. Investigate affected component

Critical (any policy):

  1. Immediate investigation required
  2. Consider rollback
  3. Review all recent policy decisions
  4. Notify affected customers

Dashboard Access

The P0 metrics dashboard is available at:

/grafana/d/stella-p0-metrics

Or directly:

stella ops dashboard p0

Dashboard Features

  • Tenant selector - Filter by specific tenant
  • Time range - Adjust analysis window
  • SLO indicators - Green/yellow/red status
  • Drill-down links - Navigate to detailed views

Alerting Configuration

Alerts are configured in devops/telemetry/alerts/stella-p0-alerts.yml.

Alert Channels

Configure alert destinations in Grafana:

  • Slack/Teams for warnings
  • PagerDuty for critical alerts
  • Email for summaries

Silencing Alerts

During maintenance windows:

stella ops alerts silence --duration 2h --reason "Planned maintenance"

Implementation Notes

Source Files

Component Location
Metric definitions src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs
Install timestamp src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs
Dashboard template devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json
Alert rules devops/telemetry/alerts/stella-p0-alerts.yml

Adding Custom Metrics

To add additional P0-level metrics:

  1. Define in P0ProductMetrics.cs
  2. Add collection points in relevant services
  3. Create dashboard panel in Grafana JSON
  4. Add alert rules
  5. Update this documentation


Last updated: 2026-01-17 (UTC)