Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/environment/environment-capacity.md
+++ b/docs/doctor/articles/environment/environment-capacity.md
@@ -0,0 +1,84 @@
+---
+checkId: check.environment.capacity
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, capacity, resources, cpu, memory, storage]
+---
+# Environment Capacity
+
+## What It Checks
+Queries the Release Orchestrator API (`/api/v1/environments/capacity`) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:
+- **Warn** when usage >= 75%
+- **Fail** when usage >= 90%
+
+Deployment slot utilization is calculated as `activeDeployments / maxConcurrentDeployments * 100`. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.
+
+## Why It Matters
+Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.
+
+## Common Causes
+- Gradual organic growth without corresponding resource scaling
+- Runaway or leaked processes consuming CPU/memory
+- Accumulated old deployments that were never cleaned up
+- Resource limits set too tightly relative to actual workload
+- Unexpected traffic spike or batch job saturating storage
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check current resource usage on the host
+docker stats --no-stream
+
+# Increase resource limits in docker-compose.stella-ops.yml
+# Edit the target service under deploy.resources.limits:
+#   cpus: '4.0'
+#   memory: 8G
+
+# Remove stopped containers to free deployment slots
+docker container prune -f
+
+# Restart with updated limits
+docker compose -f docker-compose.stella-ops.yml up -d
+```
+
+### Bare Metal / systemd
+```bash
+# Check system resource usage
+free -h && df -h && top -bn1 | head -20
+
+# Increase memory/CPU limits in systemd unit overrides
+sudo systemctl edit stellaops-environment-agent.service
+# Add under [Service]:
+#   MemoryMax=8G
+#   CPUQuota=400%
+
+sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service
+
+# Clean up old deployments
+stella env cleanup <environment-name>
+```
+
+### Kubernetes / Helm
+```bash
+# Check node resource usage
+kubectl top nodes
+kubectl top pods -n stellaops
+
+# Scale up resources via Helm values
+helm upgrade stellaops stellaops/stellaops \
+  --set environments.resources.limits.cpu=4 \
+  --set environments.resources.limits.memory=8Gi \
+  --set environments.maxConcurrentDeployments=20
+
+# Or add more nodes to the cluster for horizontal scaling
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.capacity
+```
+
+## Related Checks
+- `check.environment.deployments` - checks deployed service health, which may degrade under capacity pressure
+- `check.environment.connectivity` - verifies agents are reachable, which capacity exhaustion can prevent
--- a/docs/doctor/articles/environment/environment-connectivity.md
+++ b/docs/doctor/articles/environment/environment-connectivity.md
@@ -0,0 +1,98 @@
+---
+checkId: check.environment.connectivity
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, connectivity, agent, network]
+---
+# Environment Connectivity
+
+## What It Checks
+Retrieves the list of environments from the Release Orchestrator (`/api/v1/environments`), then probes each environment agent's `/health` endpoint. For each agent the check measures:
+- **Reachability** -- whether the health endpoint returns a success status code
+- **Latency** -- fails warn if response takes more than 500ms
+- **TLS certificate validity** -- warns if the agent's TLS certificate expires within 30 days
+- **Authentication** -- detects 401/403 responses indicating credential issues
+
+If any agent is unreachable, the check fails. High latency or expiring certificates produce a warn.
+
+## Why It Matters
+Environment agents are the control surface through which Stella Ops manages deployments, collects telemetry, and enforces policy. An unreachable agent means the platform cannot deploy to, monitor, or roll back services in that environment. TLS certificate expiry causes hard connectivity failures with no graceful degradation. High latency slows deployment pipelines and can cause timeouts in approval workflows.
+
+## Common Causes
+- Environment agent service is stopped or crashed
+- Firewall rule change blocking the agent port
+- Network partition between Stella Ops control plane and target environment
+- TLS certificate not renewed before expiry
+- Agent authentication credentials rotated without updating Stella Ops configuration
+- DNS resolution failure for the agent hostname
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Check if the environment agent container is running
+docker ps --filter "name=environment-agent"
+
+# View agent logs for errors
+docker logs stellaops-environment-agent --tail 100
+
+# Restart the agent
+docker compose -f docker-compose.stella-ops.yml restart environment-agent
+
+# If TLS cert is expiring, replace the certificate files
+# mounted into the agent container and restart
+cp /path/to/new/cert.pem devops/compose/certs/agent.pem
+cp /path/to/new/key.pem devops/compose/certs/agent-key.pem
+docker compose -f docker-compose.stella-ops.yml restart environment-agent
+```
+
+### Bare Metal / systemd
+```bash
+# Check agent service status
+sudo systemctl status stellaops-environment-agent
+
+# View logs
+sudo journalctl -u stellaops-environment-agent --since "1 hour ago"
+
+# Restart agent
+sudo systemctl restart stellaops-environment-agent
+
+# Renew TLS certificate
+sudo cp /path/to/new/cert.pem /etc/stellaops/certs/agent.pem
+sudo cp /path/to/new/key.pem /etc/stellaops/certs/agent-key.pem
+sudo systemctl restart stellaops-environment-agent
+
+# Test network connectivity from control plane
+curl -v https://<agent-host>:<agent-port>/health
+```
+
+### Kubernetes / Helm
+```bash
+# Check agent pod status
+kubectl get pods -n stellaops -l app=environment-agent
+
+# View agent logs
+kubectl logs -n stellaops -l app=environment-agent --tail=100
+
+# Restart agent pods
+kubectl rollout restart deployment/environment-agent -n stellaops
+
+# Renew TLS certificate via cert-manager or manual secret update
+kubectl create secret tls agent-tls \
+  --cert=/path/to/cert.pem \
+  --key=/path/to/key.pem \
+  -n stellaops --dry-run=client -o yaml | kubectl apply -f -
+
+# Check network policies
+kubectl get networkpolicies -n stellaops
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.connectivity
+```
+
+## Related Checks
+- `check.environment.deployments` - checks health of services deployed via agents
+- `check.environment.network.policy` - verifies network policies that may block agent connectivity
+- `check.environment.secrets` - agent credentials may need rotation
--- a/docs/doctor/articles/environment/environment-deployment-health.md
+++ b/docs/doctor/articles/environment/environment-deployment-health.md
@@ -0,0 +1,90 @@
+---
+checkId: check.environment.deployments
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, deployment, services, health]
+---
+# Environment Deployment Health
+
+## What It Checks
+Queries the Release Orchestrator (`/api/v1/environments/deployments`) for all deployed services across all environments. Each service is evaluated for:
+- **Status** -- `failed`, `stopped`, `degraded`, or healthy
+- **Replica health** -- compares `healthyReplicas` against total `replicas`; partial health triggers degraded status
+
+Severity escalation:
+- **Fail** if any production service has status `failed` (production detected by environment name containing "prod")
+- **Fail** if any non-production service has status `failed`
+- **Warn** if services are `degraded` (partial replica health)
+- **Warn** if services are `stopped`
+- **Pass** if all services are healthy
+
+## Why It Matters
+Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.
+
+## Common Causes
+- Service crashed due to unhandled exception or OOM kill
+- Deployment rolled out a bad image version
+- Dependency (database, cache, message broker) became unavailable
+- Resource exhaustion preventing replicas from starting
+- Health check endpoint misconfigured, causing false failures
+- Node failure taking down co-located replicas
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Identify failed containers
+docker ps -a --filter "status=exited" --filter "status=dead"
+
+# View logs for the failed service
+docker logs <container-name> --tail 200
+
+# Restart the failed service
+docker compose -f docker-compose.stella-ops.yml restart <service-name>
+
+# If the image is bad, roll back to previous version
+# Edit docker-compose.stella-ops.yml to pin the previous image tag
+docker compose -f docker-compose.stella-ops.yml up -d <service-name>
+```
+
+### Bare Metal / systemd
+```bash
+# Check service status
+sudo systemctl status stellaops-<service-name>
+
+# View logs for crash details
+sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager
+
+# Restart the service
+sudo systemctl restart stellaops-<service-name>
+
+# Roll back to previous binary
+sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
+sudo systemctl restart stellaops-<service-name>
+```
+
+### Kubernetes / Helm
+```bash
+# Check pod status across environments
+kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running
+
+# View events and logs for failing pods
+kubectl describe pod <pod-name> -n stellaops-<env>
+kubectl logs <pod-name> -n stellaops-<env> --previous
+
+# Rollback a deployment
+kubectl rollout undo deployment/<service-name> -n stellaops-<env>
+
+# Or via Helm
+helm rollback stellaops <previous-revision> -n stellaops-<env>
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.deployments
+```
+
+## Related Checks
+- `check.environment.capacity` - resource exhaustion can cause deployment failures
+- `check.environment.connectivity` - agent must be reachable to report deployment health
+- `check.environment.drift` - configuration drift can cause services to fail after redeployment
--- a/docs/doctor/articles/environment/environment-drift.md
+++ b/docs/doctor/articles/environment/environment-drift.md
@@ -0,0 +1,86 @@
+---
+checkId: check.environment.drift
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, drift, configuration, consistency]
+---
+# Environment Drift Detection
+
+## What It Checks
+Queries the Release Orchestrator drift report API (`/api/v1/environments/drift`) and compares configuration snapshots across environments. The check requires at least 2 environments to perform comparison. Each drift item carries a severity classification:
+- **Fail** if any drift is classified as `critical` (e.g., security-relevant configuration differences between staging and production)
+- **Warn** if drifts exist but none are critical
+- **Pass** if no configuration drift is detected between environments
+
+Evidence includes the specific configuration keys that drifted and which environments are affected.
+
+## Why It Matters
+Configuration drift between environments undermines the core promise of promotion-based releases: that what you test in staging is what runs in production. Drift can cause subtle behavioral differences that only manifest under production load, making bugs nearly impossible to reproduce. Critical drift in security-related configuration (TLS settings, authentication, network policies) can create compliance violations and security exposures.
+
+## Common Causes
+- Manual configuration changes applied directly to one environment (bypassing the release pipeline)
+- Failed deployment that left partial configuration in one environment
+- Configuration sync job that did not propagate to all environments
+- Environment restored from an outdated backup
+- Intentional per-environment overrides that were not tracked as accepted exceptions
+
+## How to Fix
+
+### Docker Compose
+```bash
+# View the current drift report
+stella env drift show
+
+# Compare specific configuration between environments
+diff <(docker exec stellaops-staging cat /app/appsettings.json) \
+     <(docker exec stellaops-prod cat /app/appsettings.json)
+
+# Reconcile by redeploying from the canonical source
+docker compose -f docker-compose.stella-ops.yml up -d --force-recreate <service>
+
+# If drift is intentional, mark it as accepted
+stella env drift accept <config-key>
+```
+
+### Bare Metal / systemd
+```bash
+# View drift report
+stella env drift show
+
+# Compare config files between environments
+diff /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
+
+# Reconcile by copying from source of truth
+sudo cp /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
+sudo systemctl restart stellaops-<service>
+
+# Or accept drift as intentional
+stella env drift accept <config-key>
+```
+
+### Kubernetes / Helm
+```bash
+# View drift between environments
+stella env drift show
+
+# Compare Helm values between environments
+diff <(helm get values stellaops -n stellaops-staging -o yaml) \
+     <(helm get values stellaops -n stellaops-prod -o yaml)
+
+# Reconcile by redeploying with consistent values
+helm upgrade stellaops stellaops/stellaops -n stellaops-prod \
+  -f values-prod.yaml
+
+# Compare ConfigMaps
+kubectl diff -f configmap.yaml -n stellaops-prod
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.drift
+```
+
+## Related Checks
+- `check.environment.deployments` - drift can cause service failures after redeployment
+- `check.environment.secrets` - secret configuration differences between environments
+- `check.environment.network.policy` - network policy drift is a security concern
--- a/docs/doctor/articles/environment/environment-network-policy.md
+++ b/docs/doctor/articles/environment/environment-network-policy.md
@@ -0,0 +1,107 @@
+---
+checkId: check.environment.network.policy
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, network, policy, security, isolation]
+---
+# Environment Network Policy
+
+## What It Checks
+Retrieves network policies from the Release Orchestrator (`/api/v1/environments/network-policies`) and evaluates isolation posture for each environment. The check enforces these rules:
+- **Production environments must not allow ingress from dev** -- detected as critical violation
+- **Production environments should use default-deny policies** -- missing default-deny is a warning
+- **No environment should have wildcard ingress** (`*` or `0.0.0.0/0`) -- critical for production, warning for others
+- **Wildcard egress** (`*` or `0.0.0.0/0`) is flagged as informational
+
+Severity:
+- **Fail** if any critical violations exist (prod ingress from dev, wildcard ingress on prod)
+- **Warn** if only warning-level violations exist (missing default-deny, wildcard ingress on non-prod)
+- **Warn** if no network policies are configured at all
+- **Pass** if all policies are correctly configured
+
+## Why It Matters
+Network isolation between environments is a fundamental security control. Allowing dev-to-production ingress means compromised development infrastructure can directly attack production services. Missing default-deny policies mean any new service added to the environment is implicitly network-accessible. Wildcard ingress exposes services to the entire network or internet. These misconfigurations are common audit findings that can block compliance certifications.
+
+## Common Causes
+- Network policies not yet defined for a new environment
+- Legacy policy left in place from initial setup
+- Production policy copied from dev without tightening rules
+- Manual firewall rule change not reflected in Stella Ops policy
+- Policy update deployed to staging but not promoted to production
+
+## How to Fix
+
+### Docker Compose
+```bash
+# Review current network policies
+stella env network-policy list
+
+# Create a default-deny policy for production
+stella env network-policy create prod --default-deny
+
+# Allow only staging ingress to production
+stella env network-policy update prod --default-deny --allow-from staging
+
+# Restrict egress to specific destinations
+stella env network-policy update prod --egress-allow "10.0.0.0/8,registry.internal"
+
+# In Docker Compose, use network isolation
+# Define separate networks in docker-compose.stella-ops.yml:
+#   networks:
+#     prod-internal:
+#       internal: true
+#     staging-internal:
+#       internal: true
+```
+
+### Bare Metal / systemd
+```bash
+# Review current iptables/nftables rules
+sudo iptables -L -n -v
+# or
+sudo nft list ruleset
+
+# Apply default-deny for production network interface
+sudo iptables -A INPUT -i prod0 -j DROP
+sudo iptables -I INPUT -i prod0 -s <staging-subnet> -j ACCEPT
+
+# Or configure via stellaops policy
+stella env network-policy update prod --default-deny --allow-from staging
+
+# Persist firewall rules
+sudo netfilter-persistent save
+```
+
+### Kubernetes / Helm
+```bash
+# Review existing network policies
+kubectl get networkpolicies -n stellaops-prod
+
+# Apply default-deny via Helm
+helm upgrade stellaops stellaops/stellaops \
+  --set environments.prod.networkPolicy.defaultDeny=true \
+  --set environments.prod.networkPolicy.allowFrom[0]=stellaops-staging
+
+# Or apply a NetworkPolicy manifest directly
+cat <<EOF | kubectl apply -f -
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: default-deny-ingress
+  namespace: stellaops-prod
+spec:
+  podSelector: {}
+  policyTypes:
+  - Ingress
+EOF
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.network.policy
+```
+
+## Related Checks
+- `check.environment.connectivity` - network policies can block agent connectivity if misconfigured
+- `check.environment.drift` - network policy differences between environments are a form of drift
+- `check.environment.secrets` - network isolation protects secret transmission
--- a/docs/doctor/articles/environment/environment-secret-health.md
+++ b/docs/doctor/articles/environment/environment-secret-health.md
@@ -0,0 +1,94 @@
+---
+checkId: check.environment.secrets
+plugin: stellaops.doctor.environment
+severity: warn
+tags: [environment, secrets, security, rotation, expiry]
+---
+# Environment Secret Health
+
+## What It Checks
+Queries the Release Orchestrator secrets status API (`/api/v1/environments/secrets/status`) for metadata about all configured secrets (no actual secret values are retrieved). Each secret is evaluated for:
+- **Expiry** -- secrets already expired, expiring within 7 days (critical), or expiring within 30 days (warning)
+- **Rotation compliance** -- if a rotation policy is defined, checks whether `lastRotated` exceeds the policy interval by more than 10% grace
+
+Severity escalation:
+- **Fail** if any production secret has expired
+- **Fail** if any secret has expired or production secrets are expiring within 7 days
+- **Warn** if secrets are expiring within 30 days or rotation is overdue
+- **Pass** if all secrets are healthy
+
+## Why It Matters
+Expired secrets cause immediate authentication and authorization failures. Services that depend on expired credentials will fail to connect to databases, registries, external APIs, and other integrations. In production, this means outages. Secrets expiring within 7 days require urgent rotation to prevent imminent failures. Overdue rotation violates security policies and increases the blast radius of a credential compromise.
+
+## Common Causes
+- Secret expired without automated rotation being configured
+- Rotation job failed silently (scheduler down, permissions changed)
+- Secret provider (Vault, Key Vault) connection lost during rotation window
+- Manual secret set with fixed expiry and no follow-up rotation
+- Rotation policy interval shorter than actual rotation cadence
+
+## How to Fix
+
+### Docker Compose
+```bash
+# List secrets with expiry status
+stella env secrets list --expiring
+
+# Rotate an expired or expiring secret immediately
+stella env secrets rotate <environment> <secret-name>
+
+# Check secret provider connectivity
+stella secrets provider status
+
+# Update secret in .env file for compose deployments
+# Edit devops/compose/.env with the new secret value
+# Then restart affected services
+docker compose -f docker-compose.stella-ops.yml restart <service>
+```
+
+### Bare Metal / systemd
+```bash
+# List secrets with expiry details
+stella env secrets list --expiring
+
+# Rotate expired secret
+stella env secrets rotate <environment> <secret-name>
+
+# If using file-based secrets, update the file
+sudo vi /etc/stellaops/secrets/<secret-name>
+sudo chmod 600 /etc/stellaops/secrets/<secret-name>
+sudo systemctl restart stellaops-<service>
+
+# Schedule automated rotation
+stella env secrets rotate-scheduled --days 7
+```
+
+### Kubernetes / Helm
+```bash
+# List expiring secrets
+stella env secrets list --expiring
+
+# Rotate secret and update Kubernetes secret
+stella env secrets rotate <environment> <secret-name>
+
+# Or update manually
+kubectl create secret generic <secret-name> \
+  --from-literal=value=<new-value> \
+  -n stellaops-<env> --dry-run=client -o yaml | kubectl apply -f -
+
+# Restart pods to pick up new secret
+kubectl rollout restart deployment/<service> -n stellaops-<env>
+
+# For external-secrets-operator, trigger a refresh
+kubectl annotate externalsecret <name> -n stellaops force-sync=$(date +%s)
+```
+
+## Verification
+```bash
+stella doctor run --check check.environment.secrets
+```
+
+## Related Checks
+- `check.environment.connectivity` - expired agent credentials cause connectivity failures
+- `check.environment.deployments` - services fail when their secrets expire
+- `check.integration.secrets.manager` - verifies the secrets manager itself is healthy