334 lines
8.1 KiB
Markdown
334 lines
8.1 KiB
Markdown
# P0 Product Metrics
|
|
|
|
> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
|
|
> **Task:** P0M-007 - Documentation
|
|
|
|
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
|
|
|
|
## Overview
|
|
|
|
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
|
|
|
|
| Metric | Target | Alert Threshold |
|
|
|--------|--------|-----------------|
|
|
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
|
|
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
|
|
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
|
|
| Determinism Regressions | Zero | Any policy-level |
|
|
|
|
---
|
|
|
|
## Metric 1: Time to First Verified Release
|
|
|
|
**Name:** `stella_time_to_first_verified_release_seconds`
|
|
**Type:** Histogram
|
|
|
|
### Definition
|
|
|
|
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
|
|
|
|
### Labels
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `tenant` | (varies) | Tenant identifier |
|
|
| `deployment_type` | `fresh`, `upgrade` | Type of installation |
|
|
|
|
### Histogram Buckets
|
|
|
|
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
|
|
|
|
### Collection Points
|
|
|
|
1. **Install timestamp** - Recorded on first Authority service startup
|
|
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
|
|
|
|
### Why This Matters
|
|
|
|
A short time-to-first-release indicates:
|
|
- Good onboarding experience
|
|
- Clear documentation
|
|
- Sensible default configurations
|
|
- Working integrations
|
|
|
|
### Dashboard Usage
|
|
|
|
The Grafana dashboard shows:
|
|
- Histogram heatmap of time distribution
|
|
- P50/P90/P99 statistics
|
|
- Trend over time
|
|
|
|
### Alert Response
|
|
|
|
**Warning (P90 > 4 hours):**
|
|
1. Review recent onboarding experiences
|
|
2. Check for common configuration issues
|
|
3. Review documentation clarity
|
|
|
|
**Critical (P90 > 24 hours):**
|
|
1. Investigate blocked customers
|
|
2. Check for integration failures
|
|
3. Consider guided onboarding assistance
|
|
|
|
---
|
|
|
|
## Metric 2: Mean Time to Answer "Why Blocked"
|
|
|
|
**Name:** `stella_why_blocked_latency_seconds`
|
|
**Type:** Histogram
|
|
|
|
### Definition
|
|
|
|
Time from block decision to user viewing explanation (via CLI, UI, or API).
|
|
|
|
### Labels
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `tenant` | (varies) | Tenant identifier |
|
|
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
|
|
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
|
|
|
|
### Histogram Buckets
|
|
|
|
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
|
|
|
|
### Collection Points
|
|
|
|
1. **Block decision** - Timestamp stored in verdict
|
|
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
|
|
|
|
### Why This Matters
|
|
|
|
Short "why blocked" latency indicates:
|
|
- Clear block messaging
|
|
- Discoverable explanation tools
|
|
- Good explainability UX
|
|
|
|
Long latency may indicate:
|
|
- Users confused about where to find answers
|
|
- Documentation gaps
|
|
- UX friction
|
|
|
|
### Dashboard Usage
|
|
|
|
The Grafana dashboard shows:
|
|
- Histogram heatmap of latency distribution
|
|
- Trend line over time
|
|
- Breakdown by surface (CLI vs UI vs API)
|
|
|
|
### Alert Response
|
|
|
|
**Warning (P90 > 5 minutes):**
|
|
1. Review block notification messaging
|
|
2. Check CLI command discoverability
|
|
3. Verify UI links are prominent
|
|
|
|
**Critical (P90 > 1 hour):**
|
|
1. Investigate user flows
|
|
2. Add proactive notifications
|
|
3. Review documentation and help text
|
|
|
|
---
|
|
|
|
## Metric 3: Support Minutes per Customer
|
|
|
|
**Name:** `stella_support_burden_minutes_total`
|
|
**Type:** Counter
|
|
|
|
### Definition
|
|
|
|
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
|
|
|
|
### Labels
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `tenant` | (varies) | Tenant identifier |
|
|
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
|
|
| `month` | YYYY-MM | Month of support |
|
|
|
|
### Collection
|
|
|
|
Log support interactions using:
|
|
|
|
```bash
|
|
stella ops support log --tenant <id> --minutes <n> --category <cat>
|
|
```
|
|
|
|
Or via API:
|
|
|
|
```bash
|
|
POST /v1/ops/support/log
|
|
{
|
|
"tenant": "acme-corp",
|
|
"minutes": 15,
|
|
"category": "config"
|
|
}
|
|
```
|
|
|
|
### Why This Matters
|
|
|
|
This metric tracks operational scalability. For solo-scaled operations:
|
|
- Support burden should trend toward zero
|
|
- High support minutes indicate product gaps
|
|
- Categories identify areas needing improvement
|
|
|
|
### Dashboard Usage
|
|
|
|
The Grafana dashboard shows:
|
|
- Stacked bar chart by category
|
|
- Monthly trend per tenant
|
|
- Total support burden
|
|
|
|
### Alert Response
|
|
|
|
**Warning (> 30 min/month per tenant):**
|
|
1. Review support interactions for patterns
|
|
2. Identify documentation gaps
|
|
3. Create runbooks for common issues
|
|
|
|
**Critical (> 60 min/month per tenant):**
|
|
1. Escalate to product for feature work
|
|
2. Consider dedicated support time
|
|
3. Prioritize automation
|
|
|
|
---
|
|
|
|
## Metric 4: Determinism Regressions
|
|
|
|
**Name:** `stella_determinism_regressions_total`
|
|
**Type:** Counter
|
|
|
|
### Definition
|
|
|
|
Count of detected determinism failures in production (same inputs produced different outputs).
|
|
|
|
### Labels
|
|
|
|
| Label | Values | Description |
|
|
|-------|--------|-------------|
|
|
| `tenant` | (varies) | Tenant identifier |
|
|
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
|
|
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
|
|
|
|
### Severity Tiers
|
|
|
|
| Tier | Description | Impact |
|
|
|------|-------------|--------|
|
|
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
|
|
| `semantic` | Output semantically differs | Medium - potential confusion |
|
|
| `policy` | Policy decision differs | **Critical** - audit risk |
|
|
|
|
### Collection Points
|
|
|
|
1. **Scheduled verification jobs** - Regular determinism checks
|
|
2. **Replay verification failures** - User-initiated replays
|
|
3. **CI golden test failures** - Development-time detection
|
|
|
|
### Why This Matters
|
|
|
|
Determinism is a core moat. Regressions indicate:
|
|
- Non-deterministic code introduced
|
|
- External dependency changes
|
|
- Time-sensitive logic bugs
|
|
|
|
**Policy-level regressions are audit-breaking** and must be fixed immediately.
|
|
|
|
### Dashboard Usage
|
|
|
|
The Grafana dashboard shows:
|
|
- Counter with severity breakdown
|
|
- Alert status indicator
|
|
- Historical trend
|
|
|
|
### Alert Response
|
|
|
|
**Warning (any bitwise/semantic):**
|
|
1. Review recent deployments
|
|
2. Check for dependency updates
|
|
3. Investigate affected component
|
|
|
|
**Critical (any policy):**
|
|
1. **Immediate investigation required**
|
|
2. Consider rollback
|
|
3. Review all recent policy decisions
|
|
4. Notify affected customers
|
|
|
|
---
|
|
|
|
## Dashboard Access
|
|
|
|
The P0 metrics dashboard is available at:
|
|
|
|
```
|
|
/grafana/d/stella-p0-metrics
|
|
```
|
|
|
|
Or directly:
|
|
```bash
|
|
stella ops dashboard p0
|
|
```
|
|
|
|
### Dashboard Features
|
|
|
|
- **Tenant selector** - Filter by specific tenant
|
|
- **Time range** - Adjust analysis window
|
|
- **SLO indicators** - Green/yellow/red status
|
|
- **Drill-down links** - Navigate to detailed views
|
|
|
|
---
|
|
|
|
## Alerting Configuration
|
|
|
|
Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
|
|
|
|
### Alert Channels
|
|
|
|
Configure alert destinations in Grafana:
|
|
- Slack/Teams for warnings
|
|
- PagerDuty for critical alerts
|
|
- Email for summaries
|
|
|
|
### Silencing Alerts
|
|
|
|
During maintenance windows:
|
|
```bash
|
|
stella ops alerts silence --duration 2h --reason "Planned maintenance"
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Notes
|
|
|
|
### Source Files
|
|
|
|
| Component | Location |
|
|
|-----------|----------|
|
|
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
|
|
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
|
|
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
|
|
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
|
|
|
|
### Adding Custom Metrics
|
|
|
|
To add additional P0-level metrics:
|
|
|
|
1. Define in `P0ProductMetrics.cs`
|
|
2. Add collection points in relevant services
|
|
3. Create dashboard panel in Grafana JSON
|
|
4. Add alert rules
|
|
5. Update this documentation
|
|
|
|
---
|
|
|
|
## Related
|
|
|
|
- [Observability Guide](observability.md)
|
|
- [Alerting Configuration](alerting.md)
|
|
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
|
|
|
|
---
|
|
|
|
_Last updated: 2026-01-17 (UTC)_
|