Files

master 702a27ac83 synergy moats product advisory implementations

2026-01-17 01:32:20 +02:00

8.1 KiB

Raw Blame History

P0 Product Metrics

Sprint: SPRINT_20260117_028_Telemetry_p0_metrics
Task: P0M-007 - Documentation

This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.

Overview

These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."

Metric	Target	Alert Threshold
Time to First Verified Release	P90 < 4 hours	P90 > 24 hours
Mean Time to Answer "Why Blocked"	P90 < 5 minutes	P90 > 1 hour
Support Minutes per Customer	Trend toward 0	> 30 min/month
Determinism Regressions	Zero	Any policy-level

Metric 1: Time to First Verified Release

Name: stella_time_to_first_verified_release_seconds
Type: Histogram

Definition

Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).

Labels

Label	Values	Description
`tenant`	(varies)	Tenant identifier
`deployment_type`	`fresh`, `upgrade`	Type of installation

Histogram Buckets

5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)

Collection Points

Install timestamp - Recorded on first Authority service startup
First promotion - Recorded in Release Orchestrator on first verified promotion

Why This Matters

A short time-to-first-release indicates:

Good onboarding experience
Clear documentation
Sensible default configurations
Working integrations

Dashboard Usage

The Grafana dashboard shows:

Histogram heatmap of time distribution
P50/P90/P99 statistics
Trend over time

Alert Response

Warning (P90 > 4 hours):

Review recent onboarding experiences
Check for common configuration issues
Review documentation clarity

Critical (P90 > 24 hours):

Investigate blocked customers
Check for integration failures
Consider guided onboarding assistance

Metric 2: Mean Time to Answer "Why Blocked"

Name: stella_why_blocked_latency_seconds
Type: Histogram

Definition

Time from block decision to user viewing explanation (via CLI, UI, or API).

Labels

Label	Values	Description
`tenant`	(varies)	Tenant identifier
`surface`	`cli`, `ui`, `api`	Interface used to view explanation
`resolution_type`	`immediate`, `delayed`	Same session vs different session

Histogram Buckets

1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h

Collection Points

Block decision - Timestamp stored in verdict
Explanation view - Tracked when stella explain block or UI equivalent invoked

Why This Matters

Short "why blocked" latency indicates:

Clear block messaging
Discoverable explanation tools
Good explainability UX

Long latency may indicate:

Users confused about where to find answers
Documentation gaps
UX friction

Dashboard Usage

The Grafana dashboard shows:

Histogram heatmap of latency distribution
Trend line over time
Breakdown by surface (CLI vs UI vs API)

Alert Response

Warning (P90 > 5 minutes):

Review block notification messaging
Check CLI command discoverability
Verify UI links are prominent

Critical (P90 > 1 hour):

Investigate user flows
Add proactive notifications
Review documentation and help text

Metric 3: Support Minutes per Customer

Name: stella_support_burden_minutes_total
Type: Counter

Definition

Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.

Labels

Label	Values	Description
`tenant`	(varies)	Tenant identifier
`category`	`install`, `config`, `policy`, `integration`, `bug`, `other`	Support category
`month`	YYYY-MM	Month of support

Collection

Log support interactions using:

stella ops support log --tenant <id> --minutes <n> --category <cat>

Or via API:

POST /v1/ops/support/log
{
  "tenant": "acme-corp",
  "minutes": 15,
  "category": "config"
}

Why This Matters

This metric tracks operational scalability. For solo-scaled operations:

Support burden should trend toward zero
High support minutes indicate product gaps
Categories identify areas needing improvement

Dashboard Usage

The Grafana dashboard shows:

Stacked bar chart by category
Monthly trend per tenant
Total support burden

Alert Response

Warning (> 30 min/month per tenant):

Review support interactions for patterns
Identify documentation gaps
Create runbooks for common issues

Critical (> 60 min/month per tenant):

Escalate to product for feature work
Consider dedicated support time
Prioritize automation

Metric 4: Determinism Regressions

Name: stella_determinism_regressions_total
Type: Counter

Definition

Count of detected determinism failures in production (same inputs produced different outputs).

Labels

Label	Values	Description
`tenant`	(varies)	Tenant identifier
`component`	`scanner`, `policy`, `attestor`, `export`	Component with regression
`severity`	`bitwise`, `semantic`, `policy`	Fidelity tier of regression

Severity Tiers

Tier	Description	Impact
`bitwise`	Byte-for-byte output differs	Low - cosmetic
`semantic`	Output semantically differs	Medium - potential confusion
`policy`	Policy decision differs	Critical - audit risk

Collection Points

Scheduled verification jobs - Regular determinism checks
Replay verification failures - User-initiated replays
CI golden test failures - Development-time detection

Why This Matters

Determinism is a core moat. Regressions indicate:

Non-deterministic code introduced
External dependency changes
Time-sensitive logic bugs

Policy-level regressions are audit-breaking and must be fixed immediately.

Dashboard Usage

The Grafana dashboard shows:

Counter with severity breakdown
Alert status indicator
Historical trend

Alert Response

Warning (any bitwise/semantic):

Review recent deployments
Check for dependency updates
Investigate affected component

Critical (any policy):

Immediate investigation required
Consider rollback
Review all recent policy decisions
Notify affected customers

Dashboard Access

The P0 metrics dashboard is available at:

/grafana/d/stella-p0-metrics

Or directly:

stella ops dashboard p0

Dashboard Features

Tenant selector - Filter by specific tenant
Time range - Adjust analysis window
SLO indicators - Green/yellow/red status
Drill-down links - Navigate to detailed views

Alerting Configuration

Alerts are configured in devops/telemetry/alerts/stella-p0-alerts.yml.

Alert Channels

Configure alert destinations in Grafana:

Slack/Teams for warnings
PagerDuty for critical alerts
Email for summaries

Silencing Alerts

During maintenance windows:

stella ops alerts silence --duration 2h --reason "Planned maintenance"

Implementation Notes

Source Files

Component	Location
Metric definitions	`src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs`
Install timestamp	`src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs`
Dashboard template	`devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json`
Alert rules	`devops/telemetry/alerts/stella-p0-alerts.yml`

Adding Custom Metrics

To add additional P0-level metrics:

Define in P0ProductMetrics.cs
Add collection points in relevant services
Create dashboard panel in Grafana JSON
Add alert rules
Update this documentation

Last updated: 2026-01-17 (UTC)

8.1 KiB Raw Blame History

P0 Product Metrics

Overview

Metric 1: Time to First Verified Release

Definition

Labels

Histogram Buckets

Collection Points

Why This Matters

Dashboard Usage

Alert Response

Metric 2: Mean Time to Answer "Why Blocked"

Definition

Labels

Histogram Buckets

Collection Points

Why This Matters

Dashboard Usage

Alert Response

Metric 3: Support Minutes per Customer

Definition

Labels

Collection

Why This Matters

Dashboard Usage

Alert Response

Metric 4: Determinism Regressions

Definition

Labels

Severity Tiers

Collection Points

Why This Matters

Dashboard Usage

Alert Response

Dashboard Access

Dashboard Features

Alerting Configuration

Alert Channels

Silencing Alerts

Implementation Notes

Source Files

Adding Custom Metrics

Related

8.1 KiB

Raw Blame History