notify doctors work, audit work, new product advisory sprints

This commit is contained in:
master
2026-01-13 08:36:29 +02:00
parent b8868a5f13
commit 9ca7cb183e
343 changed files with 24492 additions and 3544 deletions

416
docs/doctor/README.md Normal file
View File

@@ -0,0 +1,416 @@
# Stella Ops Doctor
> Self-service diagnostics for Stella Ops deployments
## Overview
The Doctor system provides comprehensive diagnostics for Stella Ops deployments, enabling operators, DevOps engineers, and developers to:
- **Diagnose** what is working and what is not
- **Understand** why failures occur with collected evidence
- **Remediate** issues with copy/paste commands
- **Verify** fixes with re-runnable checks
## Quick Start
### CLI
```bash
# Quick health check
stella doctor
# Full diagnostic with all checks
stella doctor --full
# Check specific category
stella doctor --category database
# Export report for support
stella doctor export --output diagnostic-bundle.zip
```
### UI
Navigate to `/ops/doctor` in the Stella Ops console to access the interactive Doctor Dashboard.
### API
```bash
# Run diagnostics
POST /api/v1/doctor/run
# Get available checks
GET /api/v1/doctor/checks
# Stream results
WebSocket /api/v1/doctor/stream
```
## Available Checks
The Doctor system includes 48+ diagnostic checks across 7 plugins:
| Plugin | Category | Checks | Description |
|--------|----------|--------|-------------|
| `stellaops.doctor.core` | Core | 9 | Configuration, runtime, disk, memory, time, crypto |
| `stellaops.doctor.database` | Database | 8 | Connectivity, migrations, schema, connection pool |
| `stellaops.doctor.servicegraph` | ServiceGraph | 6 | Gateway, routing, service health |
| `stellaops.doctor.security` | Security | 9 | OIDC, LDAP, TLS, Vault |
| `stellaops.doctor.scm.*` | Integration.SCM | 8 | GitHub, GitLab connectivity/auth/permissions |
| `stellaops.doctor.registry.*` | Integration.Registry | 6 | Harbor, ECR connectivity/auth/pull |
| `stellaops.doctor.observability` | Observability | 4 | OTLP, logs, metrics |
### Check ID Convention
```
check.{category}.{subcategory}.{specific}
```
Examples:
- `check.config.required`
- `check.database.migrations.pending`
- `check.services.gateway.routing`
- `check.integration.scm.github.auth`
## CLI Reference
See [CLI Reference](./cli-reference.md) for complete command documentation.
### Common Commands
```bash
# Quick health check (tagged 'quick' checks only)
stella doctor --quick
# Full diagnostic with all checks
stella doctor --full
# Filter by category
stella doctor --category database
stella doctor --category security
# Filter by plugin
stella doctor --plugin scm.github
# Run single check
stella doctor --check check.database.migrations.pending
# Output formats
stella doctor --format json
stella doctor --format markdown
stella doctor --format text
# Filter output by severity
stella doctor --severity fail,warn
# Export diagnostic bundle
stella doctor export --output diagnostic.zip
stella doctor export --include-logs --log-duration 4h
```
## Exit Codes
| Code | Meaning |
|------|---------|
| 0 | All checks passed |
| 1 | One or more warnings |
| 2 | One or more failures |
| 3 | Doctor engine error |
| 4 | Invalid arguments |
| 5 | Timeout exceeded |
## Output Example
```
Stella Ops Doctor
=================
Running 47 checks across 8 plugins...
[PASS] check.config.required
All required configuration values are present
[PASS] check.database.connectivity
PostgreSQL connection successful (latency: 12ms)
[WARN] check.tls.certificates.expiry
Diagnosis: TLS certificate expires in 14 days
Evidence:
Certificate: /etc/ssl/certs/stellaops.crt
Subject: CN=stellaops.example.com
Expires: 2026-01-26T00:00:00Z
Days remaining: 14
Likely Causes:
1. Certificate renewal not scheduled
2. ACME/Let's Encrypt automation not configured
Fix Steps:
# 1. Check current certificate
openssl x509 -in /etc/ssl/certs/stellaops.crt -noout -dates
# 2. Renew certificate (if using certbot)
sudo certbot renew --cert-name stellaops.example.com
# 3. Restart services to pick up new certificate
sudo systemctl restart stellaops-gateway
Verification:
stella doctor --check check.tls.certificates.expiry
[FAIL] check.database.migrations.pending
Diagnosis: 3 pending release migrations detected in schema 'auth'
Evidence:
Schema: auth
Current version: 099_add_dpop_thumbprints
Pending migrations:
- 100_add_tenant_quotas
- 101_add_audit_retention
- 102_add_session_revocation
Likely Causes:
1. Release migrations not applied before deployment
2. Migration files added after last deployment
Fix Steps:
# 1. Backup database first (RECOMMENDED)
pg_dump -h localhost -U stella_admin -d stellaops -F c \
-f stellaops_backup_$(date +%Y%m%d_%H%M%S).dump
# 2. Apply pending release migrations
stella system migrations-run --module Authority --category release
# 3. Verify migrations applied
stella system migrations-status --module Authority
Verification:
stella doctor --check check.database.migrations.pending
--------------------------------------------------------------------------------
Summary: 44 passed, 2 warnings, 1 failed (47 total)
Duration: 8.3s
--------------------------------------------------------------------------------
```
## Export Bundle
The Doctor export feature creates a diagnostic bundle for support escalation:
```bash
stella doctor export --output diagnostic-bundle.zip
```
The bundle contains:
- `doctor-report.json` - Full diagnostic report
- `doctor-report.md` - Human-readable report
- `environment.json` - Environment information
- `system-info.json` - System details (OS, runtime, memory)
- `config-sanitized.json` - Sanitized configuration (secrets redacted)
- `logs/` - Recent log files (optional)
- `README.md` - Bundle contents guide
### Export Options
```bash
# Include logs from last 4 hours
stella doctor export --include-logs --log-duration 4h
# Exclude configuration
stella doctor export --no-config
# Custom output path
stella doctor export --output /tmp/support-bundle.zip
```
## Security
### Secret Redaction
All evidence output is sanitized. Sensitive values (passwords, tokens, connection strings) are replaced with `***REDACTED***` in:
- Console output
- JSON exports
- Diagnostic bundles
- Log files
### RBAC Permissions
| Scope | Description |
|-------|-------------|
| `doctor:run` | Execute doctor checks |
| `doctor:run:full` | Execute all checks including sensitive |
| `doctor:export` | Export diagnostic reports |
| `admin:system` | Access system-level checks |
## Plugin Development
To create a custom Doctor plugin, implement `IDoctorPlugin`:
```csharp
public class MyCustomPlugin : IDoctorPlugin
{
public string PluginId => "stellaops.doctor.custom";
public string DisplayName => "Custom Checks";
public Version Version => new(1, 0, 0);
public DoctorCategory Category => DoctorCategory.Integration;
public bool IsAvailable(IServiceProvider services) => true;
public IReadOnlyList<IDoctorCheck> GetChecks(DoctorPluginContext context)
{
return new IDoctorCheck[]
{
new MyCustomCheck()
};
}
public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
=> Task.CompletedTask;
}
```
Implement checks using `IDoctorCheck`:
```csharp
public class MyCustomCheck : IDoctorCheck
{
public string CheckId => "check.custom.mycheck";
public string Name => "My Custom Check";
public string Description => "Validates custom configuration";
public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
public IReadOnlyList<string> Tags => new[] { "custom", "quick" };
public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(2);
public bool CanRun(DoctorPluginContext context) => true;
public async Task<DoctorCheckResult> RunAsync(
DoctorPluginContext context,
CancellationToken ct)
{
// Perform check logic
var isValid = await ValidateAsync(ct);
if (isValid)
{
return DoctorCheckResult.Pass(
checkId: CheckId,
diagnosis: "Custom configuration is valid",
evidence: new Evidence
{
Description = "Validation passed",
Data = new Dictionary<string, string>
{
["validated_at"] = context.TimeProvider.GetUtcNow().ToString("O")
}
});
}
return DoctorCheckResult.Fail(
checkId: CheckId,
diagnosis: "Custom configuration is invalid",
evidence: new Evidence
{
Description = "Validation failed",
Data = new Dictionary<string, string>
{
["error"] = "Configuration file missing"
}
},
remediation: new Remediation
{
Steps = new[]
{
new RemediationStep
{
Order = 1,
Description = "Create configuration file",
Command = "cp /etc/stellaops/custom.yaml.sample /etc/stellaops/custom.yaml",
CommandType = CommandType.Shell
}
}
});
}
}
```
Register the plugin in DI:
```csharp
services.AddSingleton<IDoctorPlugin, MyCustomPlugin>();
```
## Architecture
```
+------------------+ +------------------+ +------------------+
| CLI | | UI | | External |
| stella doctor | | /ops/doctor | | Monitoring |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+------------------------------------------------------------------------+
| Doctor API Layer |
| POST /api/v1/doctor/run GET /api/v1/doctor/checks |
| GET /api/v1/doctor/report WebSocket /api/v1/doctor/stream |
+------------------------------------------------------------------------+
|
v
+------------------------------------------------------------------------+
| Doctor Engine (Core) |
| +------------------+ +------------------+ +------------------+ |
| | Check Registry | | Check Executor | | Report Generator | |
| | - Discovery | | - Parallel exec | | - JSON/MD/Text | |
| | - Filtering | | - Timeout mgmt | | - Remediation | |
| +------------------+ +------------------+ +------------------+ |
+------------------------------------------------------------------------+
|
v
+------------------------------------------------------------------------+
| Plugin System |
+--------+---------+---------+---------+---------+---------+-------------+
| | | | | |
v v v v v v
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
| Core | | DB & | |Service| | SCM | |Regis-| |Observ-| |Security |
| Plugin | |Migra-| | Graph | |Plugin| | try | |ability| | Plugin |
| | | tions| |Plugin | | | |Plugin| |Plugin | | |
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
```
## Related Documentation
- [CLI Reference](./cli-reference.md) - Complete CLI command reference
- [Doctor Capabilities Specification](./doctor-capabilities.md) - Full technical specification
- [Plugin Development Guide](./plugin-development.md) - Creating custom plugins
## Troubleshooting
### Doctor Engine Error (Exit Code 3)
If `stella doctor` returns exit code 3:
1. Check the error message for details
2. Verify required services are running
3. Check connectivity to databases
4. Review logs at `/var/log/stellaops/doctor.log`
### Timeout Exceeded (Exit Code 5)
If checks are timing out:
```bash
# Increase per-check timeout
stella doctor --timeout 60s
# Run with reduced parallelism
stella doctor --parallel 2
```
### Checks Not Found
If expected checks are not appearing:
1. Verify plugin is registered in DI
2. Check `CanRun()` returns true for your environment
3. Review plugin initialization logs

View File

@@ -0,0 +1,396 @@
# Doctor CLI Reference
> Complete reference for `stella doctor` commands
## Commands
### stella doctor
Run diagnostic checks.
```bash
stella doctor [options]
```
#### Options
| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--format` | `-f` | enum | `text` | Output format: `text`, `json`, `markdown` |
| `--quick` | `-q` | flag | false | Run only quick checks (tagged `quick`) |
| `--full` | | flag | false | Run all checks including slow/intensive |
| `--category` | `-c` | string[] | all | Filter by category |
| `--plugin` | `-p` | string[] | all | Filter by plugin ID |
| `--check` | | string | | Run single check by ID |
| `--severity` | `-s` | enum[] | all | Filter output by severity |
| `--timeout` | `-t` | duration | 30s | Per-check timeout |
| `--parallel` | | int | 4 | Max parallel check execution |
| `--no-remediation` | | flag | false | Skip remediation output |
| `--verbose` | `-v` | flag | false | Include detailed evidence |
#### Categories
- `core` - Configuration, runtime, system checks
- `database` - Database connectivity, migrations, pools
- `service-graph` - Service health, gateway, routing
- `security` - Authentication, TLS, secrets
- `integration` - SCM, registry integrations
- `observability` - Telemetry, logging, metrics
#### Examples
```bash
# Quick health check
stella doctor
# Full diagnostic
stella doctor --full
# Database checks only
stella doctor --category database
# GitHub integration checks
stella doctor --plugin scm.github
# Single check
stella doctor --check check.database.connectivity
# JSON output (for CI/CD)
stella doctor --format json
# Show only failures and warnings
stella doctor --severity fail,warn
# Markdown report
stella doctor --format markdown > doctor-report.md
# Verbose with all evidence
stella doctor --verbose
# Custom timeout and parallelism
stella doctor --timeout 60s --parallel 2
```
### stella doctor export
Generate a diagnostic bundle for support.
```bash
stella doctor export [options]
```
#### Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `--output` | path | `diagnostic-bundle.zip` | Output file path |
| `--include-logs` | flag | false | Include recent log files |
| `--log-duration` | duration | `1h` | Duration of logs to include |
| `--no-config` | flag | false | Exclude configuration |
#### Duration Format
Duration values can be specified as:
- `30m` - 30 minutes
- `1h` - 1 hour
- `4h` - 4 hours
- `24h` or `1d` - 24 hours
#### Examples
```bash
# Basic export
stella doctor export --output diagnostic.zip
# Include logs from last 4 hours
stella doctor export --include-logs --log-duration 4h
# Without configuration (for privacy)
stella doctor export --no-config
# Full bundle with logs
stella doctor export \
--output support-bundle.zip \
--include-logs \
--log-duration 24h
```
#### Bundle Contents
The export creates a ZIP archive containing:
```
diagnostic-bundle.zip
+-- README.md # Bundle contents guide
+-- doctor-report.json # Full diagnostic report
+-- doctor-report.md # Human-readable report
+-- environment.json # Environment information
+-- system-info.json # System details
+-- config-sanitized.json # Configuration (secrets redacted)
+-- logs/ # Log files (if --include-logs)
+-- stellaops-*.log
```
### stella doctor list
List available checks.
```bash
stella doctor list [options]
```
#### Options
| Option | Type | Description |
|--------|------|-------------|
| `--category` | string | Filter by category |
| `--plugin` | string | Filter by plugin |
| `--format` | enum | Output format: `text`, `json` |
#### Examples
```bash
# List all checks
stella doctor list
# List database checks
stella doctor list --category database
# List as JSON
stella doctor list --format json
```
## Exit Codes
| Code | Name | Description |
|------|------|-------------|
| 0 | `Success` | All checks passed |
| 1 | `Warnings` | One or more warnings, no failures |
| 2 | `Failures` | One or more checks failed |
| 3 | `EngineError` | Doctor engine error |
| 4 | `InvalidArgs` | Invalid command arguments |
| 5 | `Timeout` | Timeout exceeded |
### Using Exit Codes in Scripts
```bash
#!/bin/bash
stella doctor --format json > report.json
exit_code=$?
case $exit_code in
0)
echo "All checks passed"
;;
1)
echo "Warnings detected - review report"
;;
2)
echo "Failures detected - action required"
exit 1
;;
*)
echo "Doctor error (code: $exit_code)"
exit 1
;;
esac
```
## CI/CD Integration
### GitHub Actions
```yaml
- name: Run Stella Doctor
run: |
stella doctor --format json --severity fail,warn > doctor-report.json
exit_code=$?
if [ $exit_code -eq 2 ]; then
echo "::error::Doctor checks failed"
cat doctor-report.json
exit 1
fi
```
### GitLab CI
```yaml
doctor:
stage: validate
script:
- stella doctor --format json > doctor-report.json
artifacts:
when: always
paths:
- doctor-report.json
allow_failure:
exit_codes:
- 1 # Allow warnings
```
### Jenkins
```groovy
stage('Health Check') {
steps {
script {
def result = sh(
script: 'stella doctor --format json',
returnStatus: true
)
if (result == 2) {
error "Doctor checks failed"
}
}
}
}
```
## Output Formats
### Text Format (Default)
Human-readable console output with colors and formatting.
```
Stella Ops Doctor
=================
Running 47 checks across 8 plugins...
[PASS] check.config.required
All required configuration values are present
[FAIL] check.database.migrations.pending
Diagnosis: 3 pending migrations in schema 'auth'
Fix Steps:
# Apply migrations
stella system migrations-run --module Authority
--------------------------------------------------------------------------------
Summary: 46 passed, 0 warnings, 1 failed (47 total)
Duration: 8.3s
--------------------------------------------------------------------------------
```
### JSON Format
Machine-readable format for automation:
```json
{
"summary": {
"total": 47,
"passed": 46,
"warnings": 0,
"failures": 1,
"skipped": 0,
"duration": "PT8.3S"
},
"executedAt": "2026-01-12T14:30:00Z",
"checks": [
{
"checkId": "check.config.required",
"pluginId": "stellaops.doctor.core",
"category": "Core",
"severity": "Pass",
"diagnosis": "All required configuration values are present",
"evidence": {
"description": "Configuration validated",
"data": {
"configSource": "appsettings.json",
"keysChecked": "42"
}
},
"duration": "PT0.012S"
},
{
"checkId": "check.database.migrations.pending",
"pluginId": "stellaops.doctor.database",
"category": "Database",
"severity": "Fail",
"diagnosis": "3 pending migrations in schema 'auth'",
"evidence": {
"description": "Migration status",
"data": {
"schema": "auth",
"pendingCount": "3"
}
},
"remediation": {
"steps": [
{
"order": 1,
"description": "Apply pending migrations",
"command": "stella system migrations-run --module Authority",
"commandType": "Shell"
}
]
},
"duration": "PT0.234S"
}
]
}
```
### Markdown Format
Formatted for documentation and reports:
```markdown
# Stella Ops Doctor Report
**Generated:** 2026-01-12T14:30:00Z
**Duration:** 8.3s
## Summary
| Status | Count |
|--------|-------|
| Passed | 46 |
| Warnings | 0 |
| Failures | 1 |
| Skipped | 0 |
| **Total** | **47** |
## Failed Checks
### check.database.migrations.pending
**Status:** FAIL
**Plugin:** stellaops.doctor.database
**Category:** Database
**Diagnosis:** 3 pending migrations in schema 'auth'
**Evidence:**
- Schema: auth
- Pending count: 3
**Fix Steps:**
1. Apply pending migrations
```bash
stella system migrations-run --module Authority
```
## Passed Checks
- check.config.required
- check.database.connectivity
- ... (44 more)
```
## Environment Variables
| Variable | Description |
|----------|-------------|
| `STELLAOPS_DOCTOR_TIMEOUT` | Default per-check timeout |
| `STELLAOPS_DOCTOR_PARALLEL` | Default parallelism |
| `STELLAOPS_CONFIG_PATH` | Configuration file path |
## See Also
- [Doctor Overview](./README.md)
- [Doctor Capabilities Specification](./doctor-capabilities.md)