Files
git.stella-ops.org/docs/doctor/README.md
2026-01-13 18:53:39 +02:00

421 lines
12 KiB
Markdown

# Stella Ops Doctor
> Self-service diagnostics for Stella Ops deployments
## Overview
The Doctor system provides comprehensive diagnostics for Stella Ops deployments, enabling operators, DevOps engineers, and developers to:
- **Diagnose** what is working and what is not
- **Understand** why failures occur with collected evidence
- **Remediate** issues with copy/paste commands
- **Verify** fixes with re-runnable checks
## Quick Start
### CLI
```bash
# Quick health check
stella doctor
# Full diagnostic with all checks
stella doctor --full
# Check specific category
stella doctor --category database
# Export report for support
stella doctor export --output diagnostic-bundle.zip
# Apply safe fixes from a report (dry-run by default)
stella doctor fix --from doctor-report.json --apply
```
### UI
Navigate to `/ops/doctor` in the Stella Ops console to access the interactive Doctor Dashboard.
Fix actions are exposed in the UI and mirror CLI commands; destructive steps are never executed by Doctor.
### API
```bash
# Run diagnostics
POST /api/v1/doctor/run
# Get available checks
GET /api/v1/doctor/checks
# Stream results
WebSocket /api/v1/doctor/stream
```
## Available Checks
The Doctor system includes 48+ diagnostic checks across 7 plugins:
| Plugin | Category | Checks | Description |
|--------|----------|--------|-------------|
| `stellaops.doctor.core` | Core | 9 | Configuration, runtime, disk, memory, time, crypto |
| `stellaops.doctor.database` | Database | 8 | Connectivity, migrations, schema, connection pool |
| `stellaops.doctor.servicegraph` | ServiceGraph | 6 | Gateway, routing, service health |
| `stellaops.doctor.security` | Security | 9 | OIDC, LDAP, TLS, Vault |
| `stellaops.doctor.scm.*` | Integration.SCM | 8 | GitHub, GitLab connectivity/auth/permissions |
| `stellaops.doctor.registry.*` | Integration.Registry | 6 | Harbor, ECR connectivity/auth/pull |
| `stellaops.doctor.observability` | Observability | 4 | OTLP, logs, metrics |
### Check ID Convention
```
check.{category}.{subcategory}.{specific}
```
Examples:
- `check.config.required`
- `check.database.migrations.pending`
- `check.services.gateway.routing`
- `check.integration.scm.github.auth`
## CLI Reference
See [CLI Reference](./cli-reference.md) for complete command documentation.
### Common Commands
```bash
# Quick health check (tagged 'quick' checks only)
stella doctor --quick
# Full diagnostic with all checks
stella doctor --full
# Filter by category
stella doctor --category database
stella doctor --category security
# Filter by plugin
stella doctor --plugin scm.github
# Run single check
stella doctor --check check.database.migrations.pending
# Output formats
stella doctor --format json
stella doctor --format markdown
stella doctor --format text
# Filter output by severity
stella doctor --severity fail,warn
# Export diagnostic bundle
stella doctor export --output diagnostic.zip
stella doctor export --include-logs --log-duration 4h
```
## Exit Codes
| Code | Meaning |
|------|---------|
| 0 | All checks passed |
| 1 | One or more warnings |
| 2 | One or more failures |
| 3 | Doctor engine error |
| 4 | Invalid arguments |
| 5 | Timeout exceeded |
## Output Example
```
Stella Ops Doctor
=================
Running 47 checks across 8 plugins...
[PASS] check.config.required
All required configuration values are present
[PASS] check.database.connectivity
PostgreSQL connection successful (latency: 12ms)
[WARN] check.tls.certificates.expiry
Diagnosis: TLS certificate expires in 14 days
Evidence:
Certificate: /etc/ssl/certs/stellaops.crt
Subject: CN=stellaops.example.com
Expires: 2026-01-26T00:00:00Z
Days remaining: 14
Likely Causes:
1. Certificate renewal not scheduled
2. ACME/Let's Encrypt automation not configured
Fix Steps:
# 1. Check current certificate
openssl x509 -in /etc/ssl/certs/stellaops.crt -noout -dates
# 2. Renew certificate (if using certbot)
sudo certbot renew --cert-name stellaops.example.com
# 3. Restart services to pick up new certificate
sudo systemctl restart stellaops-gateway
Verification:
stella doctor --check check.tls.certificates.expiry
[FAIL] check.database.migrations.pending
Diagnosis: 3 pending release migrations detected in schema 'auth'
Evidence:
Schema: auth
Current version: 099_add_dpop_thumbprints
Pending migrations:
- 100_add_tenant_quotas
- 101_add_audit_retention
- 102_add_session_revocation
Likely Causes:
1. Release migrations not applied before deployment
2. Migration files added after last deployment
Fix Steps:
# 1. Backup database first (RECOMMENDED)
pg_dump -h localhost -U stella_admin -d stellaops -F c \
-f stellaops_backup_$(date +%Y%m%d_%H%M%S).dump
# 2. Apply pending release migrations
stella system migrations-run --module Authority --category release
# 3. Verify migrations applied
stella system migrations-status --module Authority
Verification:
stella doctor --check check.database.migrations.pending
--------------------------------------------------------------------------------
Summary: 44 passed, 2 warnings, 1 failed (47 total)
Duration: 8.3s
--------------------------------------------------------------------------------
```
## Export Bundle
The Doctor export feature creates a diagnostic bundle for support escalation:
```bash
stella doctor export --output diagnostic-bundle.zip
```
The bundle contains:
- `doctor-report.json` - Full diagnostic report
- `doctor-report.md` - Human-readable report
- `environment.json` - Environment information
- `system-info.json` - System details (OS, runtime, memory)
- `config-sanitized.json` - Sanitized configuration (secrets redacted)
- `logs/` - Recent log files (optional)
- `README.md` - Bundle contents guide
### Export Options
```bash
# Include logs from last 4 hours
stella doctor export --include-logs --log-duration 4h
# Exclude configuration
stella doctor export --no-config
# Custom output path
stella doctor export --output /tmp/support-bundle.zip
```
## Security
### Secret Redaction
All evidence output is sanitized. Sensitive values (passwords, tokens, connection strings) are replaced with `***REDACTED***` in:
- Console output
- JSON exports
- Diagnostic bundles
- Log files
### RBAC Permissions
| Scope | Description |
|-------|-------------|
| `doctor:run` | Execute doctor checks |
| `doctor:run:full` | Execute all checks including sensitive |
| `doctor:export` | Export diagnostic reports |
| `admin:system` | Access system-level checks |
## Plugin Development
To create a custom Doctor plugin, implement `IDoctorPlugin`:
```csharp
public class MyCustomPlugin : IDoctorPlugin
{
public string PluginId => "stellaops.doctor.custom";
public string DisplayName => "Custom Checks";
public Version Version => new(1, 0, 0);
public DoctorCategory Category => DoctorCategory.Integration;
public bool IsAvailable(IServiceProvider services) => true;
public IReadOnlyList<IDoctorCheck> GetChecks(DoctorPluginContext context)
{
return new IDoctorCheck[]
{
new MyCustomCheck()
};
}
public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
=> Task.CompletedTask;
}
```
Implement checks using `IDoctorCheck`:
```csharp
public class MyCustomCheck : IDoctorCheck
{
public string CheckId => "check.custom.mycheck";
public string Name => "My Custom Check";
public string Description => "Validates custom configuration";
public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
public IReadOnlyList<string> Tags => new[] { "custom", "quick" };
public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(2);
public bool CanRun(DoctorPluginContext context) => true;
public async Task<DoctorCheckResult> RunAsync(
DoctorPluginContext context,
CancellationToken ct)
{
// Perform check logic
var isValid = await ValidateAsync(ct);
if (isValid)
{
return DoctorCheckResult.Pass(
checkId: CheckId,
diagnosis: "Custom configuration is valid",
evidence: new Evidence
{
Description = "Validation passed",
Data = new Dictionary<string, string>
{
["validated_at"] = context.TimeProvider.GetUtcNow().ToString("O")
}
});
}
return DoctorCheckResult.Fail(
checkId: CheckId,
diagnosis: "Custom configuration is invalid",
evidence: new Evidence
{
Description = "Validation failed",
Data = new Dictionary<string, string>
{
["error"] = "Configuration file missing"
}
},
remediation: new Remediation
{
Steps = new[]
{
new RemediationStep
{
Order = 1,
Description = "Create configuration file",
Command = "cp /etc/stellaops/custom.yaml.sample /etc/stellaops/custom.yaml",
CommandType = CommandType.Shell
}
}
});
}
}
```
Register the plugin in DI:
```csharp
services.AddSingleton<IDoctorPlugin, MyCustomPlugin>();
```
## Architecture
```
+------------------+ +------------------+ +------------------+
| CLI | | UI | | External |
| stella doctor | | /ops/doctor | | Monitoring |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+------------------------------------------------------------------------+
| Doctor API Layer |
| POST /api/v1/doctor/run GET /api/v1/doctor/checks |
| GET /api/v1/doctor/report WebSocket /api/v1/doctor/stream |
+------------------------------------------------------------------------+
|
v
+------------------------------------------------------------------------+
| Doctor Engine (Core) |
| +------------------+ +------------------+ +------------------+ |
| | Check Registry | | Check Executor | | Report Generator | |
| | - Discovery | | - Parallel exec | | - JSON/MD/Text | |
| | - Filtering | | - Timeout mgmt | | - Remediation | |
| +------------------+ +------------------+ +------------------+ |
+------------------------------------------------------------------------+
|
v
+------------------------------------------------------------------------+
| Plugin System |
+--------+---------+---------+---------+---------+---------+-------------+
| | | | | |
v v v v v v
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
| Core | | DB & | |Service| | SCM | |Regis-| |Observ-| |Security |
| Plugin | |Migra-| | Graph | |Plugin| | try | |ability| | Plugin |
| | | tions| |Plugin | | | |Plugin| |Plugin | | |
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
```
## Related Documentation
- [CLI Reference](./cli-reference.md) - Complete CLI command reference
- [Doctor Capabilities Specification](./doctor-capabilities.md) - Full technical specification
- [Plugin Development Guide](./plugin-development.md) - Creating custom plugins
## Troubleshooting
### Doctor Engine Error (Exit Code 3)
If `stella doctor` returns exit code 3:
1. Check the error message for details
2. Verify required services are running
3. Check connectivity to databases
4. Review logs at `/var/log/stellaops/doctor.log`
### Timeout Exceeded (Exit Code 5)
If checks are timing out:
```bash
# Increase per-check timeout
stella doctor --timeout 60s
# Run with reduced parallelism
stella doctor --parallel 2
```
### Checks Not Found
If expected checks are not appearing:
1. Verify plugin is registered in DI
2. Check `CanRun()` returns true for your environment
3. Review plugin initialization logs