8.1 KiB
P0 Product Metrics
Sprint: SPRINT_20260117_028_Telemetry_p0_metrics
Task: P0M-007 - Documentation
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
Overview
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
| Metric | Target | Alert Threshold |
|---|---|---|
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
| Determinism Regressions | Zero | Any policy-level |
Metric 1: Time to First Verified Release
Name: stella_time_to_first_verified_release_seconds
Type: Histogram
Definition
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
Labels
| Label | Values | Description |
|---|---|---|
tenant |
(varies) | Tenant identifier |
deployment_type |
fresh, upgrade |
Type of installation |
Histogram Buckets
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
Collection Points
- Install timestamp - Recorded on first Authority service startup
- First promotion - Recorded in Release Orchestrator on first verified promotion
Why This Matters
A short time-to-first-release indicates:
- Good onboarding experience
- Clear documentation
- Sensible default configurations
- Working integrations
Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of time distribution
- P50/P90/P99 statistics
- Trend over time
Alert Response
Warning (P90 > 4 hours):
- Review recent onboarding experiences
- Check for common configuration issues
- Review documentation clarity
Critical (P90 > 24 hours):
- Investigate blocked customers
- Check for integration failures
- Consider guided onboarding assistance
Metric 2: Mean Time to Answer "Why Blocked"
Name: stella_why_blocked_latency_seconds
Type: Histogram
Definition
Time from block decision to user viewing explanation (via CLI, UI, or API).
Labels
| Label | Values | Description |
|---|---|---|
tenant |
(varies) | Tenant identifier |
surface |
cli, ui, api |
Interface used to view explanation |
resolution_type |
immediate, delayed |
Same session vs different session |
Histogram Buckets
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
Collection Points
- Block decision - Timestamp stored in verdict
- Explanation view - Tracked when
stella explain blockor UI equivalent invoked
Why This Matters
Short "why blocked" latency indicates:
- Clear block messaging
- Discoverable explanation tools
- Good explainability UX
Long latency may indicate:
- Users confused about where to find answers
- Documentation gaps
- UX friction
Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of latency distribution
- Trend line over time
- Breakdown by surface (CLI vs UI vs API)
Alert Response
Warning (P90 > 5 minutes):
- Review block notification messaging
- Check CLI command discoverability
- Verify UI links are prominent
Critical (P90 > 1 hour):
- Investigate user flows
- Add proactive notifications
- Review documentation and help text
Metric 3: Support Minutes per Customer
Name: stella_support_burden_minutes_total
Type: Counter
Definition
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
Labels
| Label | Values | Description |
|---|---|---|
tenant |
(varies) | Tenant identifier |
category |
install, config, policy, integration, bug, other |
Support category |
month |
YYYY-MM | Month of support |
Collection
Log support interactions using:
stella ops support log --tenant <id> --minutes <n> --category <cat>
Or via API:
POST /v1/ops/support/log
{
"tenant": "acme-corp",
"minutes": 15,
"category": "config"
}
Why This Matters
This metric tracks operational scalability. For solo-scaled operations:
- Support burden should trend toward zero
- High support minutes indicate product gaps
- Categories identify areas needing improvement
Dashboard Usage
The Grafana dashboard shows:
- Stacked bar chart by category
- Monthly trend per tenant
- Total support burden
Alert Response
Warning (> 30 min/month per tenant):
- Review support interactions for patterns
- Identify documentation gaps
- Create runbooks for common issues
Critical (> 60 min/month per tenant):
- Escalate to product for feature work
- Consider dedicated support time
- Prioritize automation
Metric 4: Determinism Regressions
Name: stella_determinism_regressions_total
Type: Counter
Definition
Count of detected determinism failures in production (same inputs produced different outputs).
Labels
| Label | Values | Description |
|---|---|---|
tenant |
(varies) | Tenant identifier |
component |
scanner, policy, attestor, export |
Component with regression |
severity |
bitwise, semantic, policy |
Fidelity tier of regression |
Severity Tiers
| Tier | Description | Impact |
|---|---|---|
bitwise |
Byte-for-byte output differs | Low - cosmetic |
semantic |
Output semantically differs | Medium - potential confusion |
policy |
Policy decision differs | Critical - audit risk |
Collection Points
- Scheduled verification jobs - Regular determinism checks
- Replay verification failures - User-initiated replays
- CI golden test failures - Development-time detection
Why This Matters
Determinism is a core moat. Regressions indicate:
- Non-deterministic code introduced
- External dependency changes
- Time-sensitive logic bugs
Policy-level regressions are audit-breaking and must be fixed immediately.
Dashboard Usage
The Grafana dashboard shows:
- Counter with severity breakdown
- Alert status indicator
- Historical trend
Alert Response
Warning (any bitwise/semantic):
- Review recent deployments
- Check for dependency updates
- Investigate affected component
Critical (any policy):
- Immediate investigation required
- Consider rollback
- Review all recent policy decisions
- Notify affected customers
Dashboard Access
The P0 metrics dashboard is available at:
/grafana/d/stella-p0-metrics
Or directly:
stella ops dashboard p0
Dashboard Features
- Tenant selector - Filter by specific tenant
- Time range - Adjust analysis window
- SLO indicators - Green/yellow/red status
- Drill-down links - Navigate to detailed views
Alerting Configuration
Alerts are configured in devops/telemetry/alerts/stella-p0-alerts.yml.
Alert Channels
Configure alert destinations in Grafana:
- Slack/Teams for warnings
- PagerDuty for critical alerts
- Email for summaries
Silencing Alerts
During maintenance windows:
stella ops alerts silence --duration 2h --reason "Planned maintenance"
Implementation Notes
Source Files
| Component | Location |
|---|---|
| Metric definitions | src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs |
| Install timestamp | src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs |
| Dashboard template | devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json |
| Alert rules | devops/telemetry/alerts/stella-p0-alerts.yml |
Adding Custom Metrics
To add additional P0-level metrics:
- Define in
P0ProductMetrics.cs - Add collection points in relevant services
- Create dashboard panel in Grafana JSON
- Add alert rules
- Update this documentation
Related
Last updated: 2026-01-17 (UTC)