Files
git.stella-ops.org/docs/doctor
master c58a236d70 Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
..
2026-02-12 10:27:23 +02:00

Stella Ops Doctor

Self-service diagnostics for Stella Ops deployments

Overview

The Doctor system provides comprehensive diagnostics for Stella Ops deployments, enabling operators, DevOps engineers, and developers to:

  • Diagnose what is working and what is not
  • Understand why failures occur with collected evidence
  • Remediate issues with copy/paste commands
  • Verify fixes with re-runnable checks

Quick Start

CLI

# Quick health check
stella doctor

# Full diagnostic with all checks
stella doctor --full

# Check specific category
stella doctor --category database

# Export report for support
stella doctor export --output diagnostic-bundle.zip

# Apply safe fixes from a report (dry-run by default)
stella doctor fix --from doctor-report.json --apply

UI

Navigate to /ops/doctor in the Stella Ops console to access the interactive Doctor Dashboard. Fix actions are exposed in the UI and mirror CLI commands; destructive steps are never executed by Doctor.

API

# Run diagnostics
POST /api/v1/doctor/run

# Generate AI-assisted diagnosis from Doctor report payloads or stored runs
POST /api/v1/doctor/diagnosis

# Get available checks
GET /api/v1/doctor/checks

# Stream results
WebSocket /api/v1/doctor/stream

# Manage scheduled doctor runs
GET/POST /api/v1/doctor/scheduler/schedules
PUT/DELETE /api/v1/doctor/scheduler/schedules/{scheduleId}

# Query scheduler trend data
GET /api/v1/doctor/scheduler/trends
GET /api/v1/doctor/scheduler/trends/checks/{checkId}
GET /api/v1/doctor/scheduler/trends/degrading

Available Checks

The Doctor system includes 60+ diagnostic checks across 10 plugins:

Plugin Category Checks Description
stellaops.doctor.core Core 9 Configuration, runtime, disk, memory, time, crypto
stellaops.doctor.database Database 8 Connectivity, migrations, schema, connection pool
stellaops.doctor.servicegraph ServiceGraph 6 Gateway, routing, service health
stellaops.doctor.security Security 9 OIDC, LDAP, TLS, Vault
stellaops.doctor.attestation Security 4 Rekor connectivity, Cosign keys, clock skew, offline bundle
stellaops.doctor.verification Security 5 Artifact pull, signatures, SBOM, VEX, policy engine
stellaops.doctor.scm.* Integration.SCM 8 GitHub, GitLab connectivity/auth/permissions
stellaops.doctor.registry.* Integration.Registry 6 Harbor, ECR connectivity/auth/pull
stellaops.doctor.observability Observability 4 OTLP, logs, metrics
stellaops.doctor.timestamping Security 22 RFC-3161 and eIDAS timestamping health

Setup Wizard Essential Checks

The following checks are mandatory for the setup wizard to validate a new installation:

  1. DB connectivity + schema version (stellaops.doctor.database)

    • check.db.connection - Database is reachable
    • check.db.schema.version - Schema version matches expected
  2. Attestation store availability (stellaops.doctor.attestation)

    • check.attestation.rekor.connectivity - Rekor transparency log reachable
    • check.attestation.cosign.keymaterial - Signing keys available (file/KMS/keyless)
    • check.attestation.clock.skew - System clock synchronized (<5s skew)
  3. Artifact verification pipeline (stellaops.doctor.verification)

    • check.verification.artifact.pull - Test artifact accessible by digest
    • check.verification.signature - DSSE signatures verifiable
    • check.verification.sbom.validation - SBOM (CycloneDX/SPDX) valid
    • check.verification.vex.validation - VEX document valid
    • check.verification.policy.engine - Policy evaluation passes

Check ID Convention

check.{category}.{subcategory}.{specific}

Examples:

  • check.config.required
  • check.database.migrations.pending
  • check.services.gateway.routing
  • check.integration.scm.github.auth
  • check.attestation.rekor.connectivity
  • check.verification.sbom.validation

CLI Reference

See CLI Reference for complete command documentation.

Common Commands

# Quick health check (tagged 'quick' checks only)
stella doctor --quick

# Full diagnostic with all checks
stella doctor --full

# Filter by category
stella doctor --category database
stella doctor --category security

# Filter by plugin
stella doctor --plugin scm.github

# Run single check
stella doctor --check check.database.migrations.pending

# Output formats
stella doctor --format json
stella doctor --format markdown
stella doctor --format text

# Filter output by severity
stella doctor --severity fail,warn

# Export diagnostic bundle
stella doctor export --output diagnostic.zip
stella doctor export --include-logs --log-duration 4h

Exit Codes

Code Meaning
0 All checks passed
1 One or more warnings
2 One or more failures
3 Doctor engine error
4 Invalid arguments
5 Timeout exceeded

Output Example

Stella Ops Doctor
=================

Running 47 checks across 8 plugins...

[PASS] check.config.required
  All required configuration values are present

[PASS] check.database.connectivity
  PostgreSQL connection successful (latency: 12ms)

[WARN] check.tls.certificates.expiry
  Diagnosis: TLS certificate expires in 14 days

  Evidence:
    Certificate: /etc/ssl/certs/stellaops.crt
    Subject: CN=stellaops.example.com
    Expires: 2026-01-26T00:00:00Z
    Days remaining: 14

  Likely Causes:
    1. Certificate renewal not scheduled
    2. ACME/Let's Encrypt automation not configured

  Fix Steps:
    # 1. Check current certificate
    openssl x509 -in /etc/ssl/certs/stellaops.crt -noout -dates

    # 2. Renew certificate (if using certbot)
    sudo certbot renew --cert-name stellaops.example.com

    # 3. Restart services to pick up new certificate
    sudo systemctl restart stellaops-gateway

  Runbook:
    https://docs.stella-ops.org/runbooks/tls-certificate-renewal

  Verification:
    stella doctor --check check.tls.certificates.expiry

[FAIL] check.database.migrations.pending
  Diagnosis: 3 pending release migrations detected in schema 'auth'

  Evidence:
    Schema: auth
    Current version: 099_add_dpop_thumbprints
    Pending migrations:
      - 100_add_tenant_quotas
      - 101_add_audit_retention
      - 102_add_session_revocation

  Likely Causes:
    1. Release migrations not applied before deployment
    2. Migration files added after last deployment

  Fix Steps:
    # 1. Backup database first (RECOMMENDED)
    pg_dump -h localhost -U stella_admin -d stellaops -F c \
      -f stellaops_backup_$(date +%Y%m%d_%H%M%S).dump

    # 2. Apply pending release migrations
    stella system migrations-run --module Authority --category release

    # 3. Verify migrations applied
    stella system migrations-status --module Authority

  Verification:
    stella doctor --check check.database.migrations.pending

--------------------------------------------------------------------------------
Summary: 44 passed, 2 warnings, 1 failed (47 total)
Duration: 8.3s
--------------------------------------------------------------------------------

Export Bundle

The Doctor export feature creates a diagnostic bundle for support escalation:

stella doctor export --output diagnostic-bundle.zip

The bundle contains:

  • doctor-report.json - Full diagnostic report
  • doctor-report.md - Human-readable report
  • environment.json - Environment information
  • system-info.json - System details (OS, runtime, memory)
  • config-sanitized.json - Sanitized configuration (secrets redacted)
  • logs/ - Recent log files (optional)
  • README.md - Bundle contents guide

Export Options

# Include logs from last 4 hours
stella doctor export --include-logs --log-duration 4h

# Exclude configuration
stella doctor export --no-config

# Custom output path
stella doctor export --output /tmp/support-bundle.zip

Security

Secret Redaction

All evidence output is sanitized. Sensitive values (passwords, tokens, connection strings) are replaced with ***REDACTED*** in:

  • Console output
  • JSON exports
  • Diagnostic bundles
  • Log files

RBAC Permissions

Scope Description
doctor:run Execute doctor checks
doctor:run:full Execute all checks including sensitive
doctor:export Export diagnostic reports
admin:system Access system-level checks

Plugin Development

To create a custom Doctor plugin, implement IDoctorPlugin:

public class MyCustomPlugin : IDoctorPlugin
{
    public string PluginId => "stellaops.doctor.custom";
    public string DisplayName => "Custom Checks";
    public Version Version => new(1, 0, 0);
    public DoctorCategory Category => DoctorCategory.Integration;

    public bool IsAvailable(IServiceProvider services) => true;

    public IReadOnlyList<IDoctorCheck> GetChecks(DoctorPluginContext context)
    {
        return new IDoctorCheck[]
        {
            new MyCustomCheck()
        };
    }

    public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
        => Task.CompletedTask;
}

Implement checks using IDoctorCheck:

public class MyCustomCheck : IDoctorCheck
{
    public string CheckId => "check.custom.mycheck";
    public string Name => "My Custom Check";
    public string Description => "Validates custom configuration";
    public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
    public IReadOnlyList<string> Tags => new[] { "custom", "quick" };
    public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(2);

    public bool CanRun(DoctorPluginContext context) => true;

    public async Task<DoctorCheckResult> RunAsync(
        DoctorPluginContext context,
        CancellationToken ct)
    {
        // Perform check logic
        var isValid = await ValidateAsync(ct);

        if (isValid)
        {
            return DoctorCheckResult.Pass(
                checkId: CheckId,
                diagnosis: "Custom configuration is valid",
                evidence: new Evidence
                {
                    Description = "Validation passed",
                    Data = new Dictionary<string, string>
                    {
                        ["validated_at"] = context.TimeProvider.GetUtcNow().ToString("O")
                    }
                });
        }

        return DoctorCheckResult.Fail(
            checkId: CheckId,
            diagnosis: "Custom configuration is invalid",
            evidence: new Evidence
            {
                Description = "Validation failed",
                Data = new Dictionary<string, string>
                {
                    ["error"] = "Configuration file missing"
                }
            },
            remediation: new Remediation
            {
                Steps = new[]
                {
                    new RemediationStep
                    {
                        Order = 1,
                        Description = "Create configuration file",
                        Command = "cp /etc/stellaops/custom.yaml.sample /etc/stellaops/custom.yaml",
                        CommandType = CommandType.Shell
                    }
                }
            });
    }
}

Register the plugin in DI:

services.AddSingleton<IDoctorPlugin, MyCustomPlugin>();

Architecture

+------------------+     +------------------+     +------------------+
|       CLI        |     |        UI        |     |    External      |
|  stella doctor   |     |   /ops/doctor    |     |   Monitoring     |
+--------+---------+     +--------+---------+     +--------+---------+
         |                        |                        |
         v                        v                        v
+------------------------------------------------------------------------+
|                         Doctor API Layer                                |
|  POST /api/v1/doctor/run    GET /api/v1/doctor/checks                  |
|  GET /api/v1/doctor/report  WebSocket /api/v1/doctor/stream            |
+------------------------------------------------------------------------+
         |
         v
+------------------------------------------------------------------------+
|                      Doctor Engine (Core)                               |
|  +------------------+  +------------------+  +------------------+       |
|  | Check Registry   |  | Check Executor   |  | Report Generator |       |
|  | - Discovery      |  | - Parallel exec  |  | - JSON/MD/Text   |       |
|  | - Filtering      |  | - Timeout mgmt   |  | - Remediation    |       |
|  +------------------+  +------------------+  +------------------+       |
+------------------------------------------------------------------------+
         |
         v
+------------------------------------------------------------------------+
|                        Plugin System                                    |
+--------+---------+---------+---------+---------+---------+-------------+
         |         |         |         |         |         |
         v         v         v         v         v         v
+--------+  +------+  +------+  +------+  +------+  +------+  +----------+
| Core   |  | DB & |  |Service|  | SCM  |  |Regis-|  |Observ-|  |Security |
| Plugin |  |Migra-|  | Graph |  |Plugin|  | try  |  |ability|  | Plugin  |
|        |  | tions|  |Plugin |  |      |  |Plugin|  |Plugin |  |         |
+--------+  +------+  +------+  +------+  +------+  +------+  +----------+

Troubleshooting

Doctor Engine Error (Exit Code 3)

If stella doctor returns exit code 3:

  1. Check the error message for details
  2. Verify required services are running
  3. Check connectivity to databases
  4. Review logs at /var/log/stellaops/doctor.log

Timeout Exceeded (Exit Code 5)

If checks are timing out:

# Increase per-check timeout
stella doctor --timeout 60s

# Run with reduced parallelism
stella doctor --parallel 2

Checks Not Found

If expected checks are not appearing:

  1. Verify plugin is registered in DI
  2. Check CanRun() returns true for your environment
  3. Review plugin initialization logs