git.stella-ops.org/docs/modules/telemetry/guides/p0-metrics.md

# P0 Product Metrics

> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
> **Task:** P0M-007 - Documentation

This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.

## Overview

These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."

| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
| Determinism Regressions | Zero | Any policy-level |

---

## Metric 1: Time to First Verified Release

**Name:** `stella_time_to_first_verified_release_seconds`
**Type:** Histogram

### Definition

Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).

### Labels

| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `deployment_type` | `fresh`, `upgrade` | Type of installation |

### Histogram Buckets

5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)

### Collection Points

1. **Install timestamp** - Recorded on first Authority service startup
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion

### Why This Matters

A short time-to-first-release indicates:
- Good onboarding experience
- Clear documentation
- Sensible default configurations
- Working integrations

### Dashboard Usage

The Grafana dashboard shows:
- Histogram heatmap of time distribution
- P50/P90/P99 statistics
- Trend over time

### Alert Response

**Warning (P90 > 4 hours):**
1. Review recent onboarding experiences
2. Check for common configuration issues
3. Review documentation clarity

**Critical (P90 > 24 hours):**
1. Investigate blocked customers
2. Check for integration failures
3. Consider guided onboarding assistance

---

## Metric 2: Mean Time to Answer "Why Blocked"

**Name:** `stella_why_blocked_latency_seconds`
**Type:** Histogram

### Definition

Time from block decision to user viewing explanation (via CLI, UI, or API).

### Labels

| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |

### Histogram Buckets

1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h

### Collection Points

1. **Block decision** - Timestamp stored in verdict
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked

### Why This Matters

Short "why blocked" latency indicates:
- Clear block messaging
- Discoverable explanation tools
- Good explainability UX

Long latency may indicate:
- Users confused about where to find answers
- Documentation gaps
- UX friction

### Dashboard Usage

The Grafana dashboard shows:
- Histogram heatmap of latency distribution
- Trend line over time
- Breakdown by surface (CLI vs UI vs API)

### Alert Response

**Warning (P90 > 5 minutes):**
1. Review block notification messaging
2. Check CLI command discoverability
3. Verify UI links are prominent

**Critical (P90 > 1 hour):**
1. Investigate user flows
2. Add proactive notifications
3. Review documentation and help text

---

## Metric 3: Support Minutes per Customer

**Name:** `stella_support_burden_minutes_total`
**Type:** Counter

### Definition

Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.

### Labels

| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
| `month` | YYYY-MM | Month of support |

### Collection

Log support interactions using:

```bash
stella ops support log --tenant <id> --minutes <n> --category <cat>
```

Or via API:

```bash
POST /v1/ops/support/log
{
  "tenant": "acme-corp",
  "minutes": 15,
  "category": "config"
}
```

### Why This Matters

This metric tracks operational scalability. For solo-scaled operations:
- Support burden should trend toward zero
- High support minutes indicate product gaps
- Categories identify areas needing improvement

### Dashboard Usage

The Grafana dashboard shows:
- Stacked bar chart by category
- Monthly trend per tenant
- Total support burden

### Alert Response

**Warning (> 30 min/month per tenant):**
1. Review support interactions for patterns
2. Identify documentation gaps
3. Create runbooks for common issues

**Critical (> 60 min/month per tenant):**
1. Escalate to product for feature work
2. Consider dedicated support time
3. Prioritize automation

---

## Metric 4: Determinism Regressions

**Name:** `stella_determinism_regressions_total`
**Type:** Counter

### Definition

Count of detected determinism failures in production (same inputs produced different outputs).

### Labels

| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |

### Severity Tiers

| Tier | Description | Impact |
|------|-------------|--------|
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
| `semantic` | Output semantically differs | Medium - potential confusion |
| `policy` | Policy decision differs | **Critical** - audit risk |

### Collection Points

1. **Scheduled verification jobs** - Regular determinism checks
2. **Replay verification failures** - User-initiated replays
3. **CI golden test failures** - Development-time detection

### Why This Matters

Determinism is a core moat. Regressions indicate:
- Non-deterministic code introduced
- External dependency changes
- Time-sensitive logic bugs

**Policy-level regressions are audit-breaking** and must be fixed immediately.

### Dashboard Usage

The Grafana dashboard shows:
- Counter with severity breakdown
- Alert status indicator
- Historical trend

### Alert Response

**Warning (any bitwise/semantic):**
1. Review recent deployments
2. Check for dependency updates
3. Investigate affected component

**Critical (any policy):**
1. **Immediate investigation required**
2. Consider rollback
3. Review all recent policy decisions
4. Notify affected customers

---

## Dashboard Access

The P0 metrics dashboard is available at:

```
/grafana/d/stella-p0-metrics
```

Or directly:
```bash
stella ops dashboard p0
```

### Dashboard Features

- **Tenant selector** - Filter by specific tenant
- **Time range** - Adjust analysis window
- **SLO indicators** - Green/yellow/red status
- **Drill-down links** - Navigate to detailed views

---

## Alerting Configuration

Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.

### Alert Channels

Configure alert destinations in Grafana:
- Slack/Teams for warnings
- PagerDuty for critical alerts
- Email for summaries

### Silencing Alerts

During maintenance windows:
```bash
stella ops alerts silence --duration 2h --reason "Planned maintenance"
```

---

## Implementation Notes

### Source Files

| Component | Location |
|-----------|----------|
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |

### Adding Custom Metrics

To add additional P0-level metrics:

1. Define in `P0ProductMetrics.cs`
2. Add collection points in relevant services
3. Create dashboard panel in Grafana JSON
4. Add alert rules
5. Update this documentation

---

## Related

- [Observability Guide](observability.md)
- [Alerting Configuration](alerting.md)
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)

---

_Last updated: 2026-01-17 (UTC)_