Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,84 @@
---
checkId: check.environment.capacity
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, capacity, resources, cpu, memory, storage]
---
# Environment Capacity
## What It Checks
Queries the Release Orchestrator API (`/api/v1/environments/capacity`) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:
- **Warn** when usage >= 75%
- **Fail** when usage >= 90%
Deployment slot utilization is calculated as `activeDeployments / maxConcurrentDeployments * 100`. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.
## Why It Matters
Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.
## Common Causes
- Gradual organic growth without corresponding resource scaling
- Runaway or leaked processes consuming CPU/memory
- Accumulated old deployments that were never cleaned up
- Resource limits set too tightly relative to actual workload
- Unexpected traffic spike or batch job saturating storage
## How to Fix
### Docker Compose
```bash
# Check current resource usage on the host
docker stats --no-stream
# Increase resource limits in docker-compose.stella-ops.yml
# Edit the target service under deploy.resources.limits:
# cpus: '4.0'
# memory: 8G
# Remove stopped containers to free deployment slots
docker container prune -f
# Restart with updated limits
docker compose -f docker-compose.stella-ops.yml up -d
```
### Bare Metal / systemd
```bash
# Check system resource usage
free -h && df -h && top -bn1 | head -20
# Increase memory/CPU limits in systemd unit overrides
sudo systemctl edit stellaops-environment-agent.service
# Add under [Service]:
# MemoryMax=8G
# CPUQuota=400%
sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service
# Clean up old deployments
stella env cleanup <environment-name>
```
### Kubernetes / Helm
```bash
# Check node resource usage
kubectl top nodes
kubectl top pods -n stellaops
# Scale up resources via Helm values
helm upgrade stellaops stellaops/stellaops \
--set environments.resources.limits.cpu=4 \
--set environments.resources.limits.memory=8Gi \
--set environments.maxConcurrentDeployments=20
# Or add more nodes to the cluster for horizontal scaling
```
## Verification
```bash
stella doctor run --check check.environment.capacity
```
## Related Checks
- `check.environment.deployments` - checks deployed service health, which may degrade under capacity pressure
- `check.environment.connectivity` - verifies agents are reachable, which capacity exhaustion can prevent

View File

@@ -0,0 +1,98 @@
---
checkId: check.environment.connectivity
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, connectivity, agent, network]
---
# Environment Connectivity
## What It Checks
Retrieves the list of environments from the Release Orchestrator (`/api/v1/environments`), then probes each environment agent's `/health` endpoint. For each agent the check measures:
- **Reachability** -- whether the health endpoint returns a success status code
- **Latency** -- fails warn if response takes more than 500ms
- **TLS certificate validity** -- warns if the agent's TLS certificate expires within 30 days
- **Authentication** -- detects 401/403 responses indicating credential issues
If any agent is unreachable, the check fails. High latency or expiring certificates produce a warn.
## Why It Matters
Environment agents are the control surface through which Stella Ops manages deployments, collects telemetry, and enforces policy. An unreachable agent means the platform cannot deploy to, monitor, or roll back services in that environment. TLS certificate expiry causes hard connectivity failures with no graceful degradation. High latency slows deployment pipelines and can cause timeouts in approval workflows.
## Common Causes
- Environment agent service is stopped or crashed
- Firewall rule change blocking the agent port
- Network partition between Stella Ops control plane and target environment
- TLS certificate not renewed before expiry
- Agent authentication credentials rotated without updating Stella Ops configuration
- DNS resolution failure for the agent hostname
## How to Fix
### Docker Compose
```bash
# Check if the environment agent container is running
docker ps --filter "name=environment-agent"
# View agent logs for errors
docker logs stellaops-environment-agent --tail 100
# Restart the agent
docker compose -f docker-compose.stella-ops.yml restart environment-agent
# If TLS cert is expiring, replace the certificate files
# mounted into the agent container and restart
cp /path/to/new/cert.pem devops/compose/certs/agent.pem
cp /path/to/new/key.pem devops/compose/certs/agent-key.pem
docker compose -f docker-compose.stella-ops.yml restart environment-agent
```
### Bare Metal / systemd
```bash
# Check agent service status
sudo systemctl status stellaops-environment-agent
# View logs
sudo journalctl -u stellaops-environment-agent --since "1 hour ago"
# Restart agent
sudo systemctl restart stellaops-environment-agent
# Renew TLS certificate
sudo cp /path/to/new/cert.pem /etc/stellaops/certs/agent.pem
sudo cp /path/to/new/key.pem /etc/stellaops/certs/agent-key.pem
sudo systemctl restart stellaops-environment-agent
# Test network connectivity from control plane
curl -v https://<agent-host>:<agent-port>/health
```
### Kubernetes / Helm
```bash
# Check agent pod status
kubectl get pods -n stellaops -l app=environment-agent
# View agent logs
kubectl logs -n stellaops -l app=environment-agent --tail=100
# Restart agent pods
kubectl rollout restart deployment/environment-agent -n stellaops
# Renew TLS certificate via cert-manager or manual secret update
kubectl create secret tls agent-tls \
--cert=/path/to/cert.pem \
--key=/path/to/key.pem \
-n stellaops --dry-run=client -o yaml | kubectl apply -f -
# Check network policies
kubectl get networkpolicies -n stellaops
```
## Verification
```bash
stella doctor run --check check.environment.connectivity
```
## Related Checks
- `check.environment.deployments` - checks health of services deployed via agents
- `check.environment.network.policy` - verifies network policies that may block agent connectivity
- `check.environment.secrets` - agent credentials may need rotation

View File

@@ -0,0 +1,90 @@
---
checkId: check.environment.deployments
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, deployment, services, health]
---
# Environment Deployment Health
## What It Checks
Queries the Release Orchestrator (`/api/v1/environments/deployments`) for all deployed services across all environments. Each service is evaluated for:
- **Status** -- `failed`, `stopped`, `degraded`, or healthy
- **Replica health** -- compares `healthyReplicas` against total `replicas`; partial health triggers degraded status
Severity escalation:
- **Fail** if any production service has status `failed` (production detected by environment name containing "prod")
- **Fail** if any non-production service has status `failed`
- **Warn** if services are `degraded` (partial replica health)
- **Warn** if services are `stopped`
- **Pass** if all services are healthy
## Why It Matters
Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.
## Common Causes
- Service crashed due to unhandled exception or OOM kill
- Deployment rolled out a bad image version
- Dependency (database, cache, message broker) became unavailable
- Resource exhaustion preventing replicas from starting
- Health check endpoint misconfigured, causing false failures
- Node failure taking down co-located replicas
## How to Fix
### Docker Compose
```bash
# Identify failed containers
docker ps -a --filter "status=exited" --filter "status=dead"
# View logs for the failed service
docker logs <container-name> --tail 200
# Restart the failed service
docker compose -f docker-compose.stella-ops.yml restart <service-name>
# If the image is bad, roll back to previous version
# Edit docker-compose.stella-ops.yml to pin the previous image tag
docker compose -f docker-compose.stella-ops.yml up -d <service-name>
```
### Bare Metal / systemd
```bash
# Check service status
sudo systemctl status stellaops-<service-name>
# View logs for crash details
sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager
# Restart the service
sudo systemctl restart stellaops-<service-name>
# Roll back to previous binary
sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
sudo systemctl restart stellaops-<service-name>
```
### Kubernetes / Helm
```bash
# Check pod status across environments
kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running
# View events and logs for failing pods
kubectl describe pod <pod-name> -n stellaops-<env>
kubectl logs <pod-name> -n stellaops-<env> --previous
# Rollback a deployment
kubectl rollout undo deployment/<service-name> -n stellaops-<env>
# Or via Helm
helm rollback stellaops <previous-revision> -n stellaops-<env>
```
## Verification
```bash
stella doctor run --check check.environment.deployments
```
## Related Checks
- `check.environment.capacity` - resource exhaustion can cause deployment failures
- `check.environment.connectivity` - agent must be reachable to report deployment health
- `check.environment.drift` - configuration drift can cause services to fail after redeployment

View File

@@ -0,0 +1,86 @@
---
checkId: check.environment.drift
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, drift, configuration, consistency]
---
# Environment Drift Detection
## What It Checks
Queries the Release Orchestrator drift report API (`/api/v1/environments/drift`) and compares configuration snapshots across environments. The check requires at least 2 environments to perform comparison. Each drift item carries a severity classification:
- **Fail** if any drift is classified as `critical` (e.g., security-relevant configuration differences between staging and production)
- **Warn** if drifts exist but none are critical
- **Pass** if no configuration drift is detected between environments
Evidence includes the specific configuration keys that drifted and which environments are affected.
## Why It Matters
Configuration drift between environments undermines the core promise of promotion-based releases: that what you test in staging is what runs in production. Drift can cause subtle behavioral differences that only manifest under production load, making bugs nearly impossible to reproduce. Critical drift in security-related configuration (TLS settings, authentication, network policies) can create compliance violations and security exposures.
## Common Causes
- Manual configuration changes applied directly to one environment (bypassing the release pipeline)
- Failed deployment that left partial configuration in one environment
- Configuration sync job that did not propagate to all environments
- Environment restored from an outdated backup
- Intentional per-environment overrides that were not tracked as accepted exceptions
## How to Fix
### Docker Compose
```bash
# View the current drift report
stella env drift show
# Compare specific configuration between environments
diff <(docker exec stellaops-staging cat /app/appsettings.json) \
<(docker exec stellaops-prod cat /app/appsettings.json)
# Reconcile by redeploying from the canonical source
docker compose -f docker-compose.stella-ops.yml up -d --force-recreate <service>
# If drift is intentional, mark it as accepted
stella env drift accept <config-key>
```
### Bare Metal / systemd
```bash
# View drift report
stella env drift show
# Compare config files between environments
diff /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
# Reconcile by copying from source of truth
sudo cp /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
sudo systemctl restart stellaops-<service>
# Or accept drift as intentional
stella env drift accept <config-key>
```
### Kubernetes / Helm
```bash
# View drift between environments
stella env drift show
# Compare Helm values between environments
diff <(helm get values stellaops -n stellaops-staging -o yaml) \
<(helm get values stellaops -n stellaops-prod -o yaml)
# Reconcile by redeploying with consistent values
helm upgrade stellaops stellaops/stellaops -n stellaops-prod \
-f values-prod.yaml
# Compare ConfigMaps
kubectl diff -f configmap.yaml -n stellaops-prod
```
## Verification
```bash
stella doctor run --check check.environment.drift
```
## Related Checks
- `check.environment.deployments` - drift can cause service failures after redeployment
- `check.environment.secrets` - secret configuration differences between environments
- `check.environment.network.policy` - network policy drift is a security concern

View File

@@ -0,0 +1,107 @@
---
checkId: check.environment.network.policy
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, network, policy, security, isolation]
---
# Environment Network Policy
## What It Checks
Retrieves network policies from the Release Orchestrator (`/api/v1/environments/network-policies`) and evaluates isolation posture for each environment. The check enforces these rules:
- **Production environments must not allow ingress from dev** -- detected as critical violation
- **Production environments should use default-deny policies** -- missing default-deny is a warning
- **No environment should have wildcard ingress** (`*` or `0.0.0.0/0`) -- critical for production, warning for others
- **Wildcard egress** (`*` or `0.0.0.0/0`) is flagged as informational
Severity:
- **Fail** if any critical violations exist (prod ingress from dev, wildcard ingress on prod)
- **Warn** if only warning-level violations exist (missing default-deny, wildcard ingress on non-prod)
- **Warn** if no network policies are configured at all
- **Pass** if all policies are correctly configured
## Why It Matters
Network isolation between environments is a fundamental security control. Allowing dev-to-production ingress means compromised development infrastructure can directly attack production services. Missing default-deny policies mean any new service added to the environment is implicitly network-accessible. Wildcard ingress exposes services to the entire network or internet. These misconfigurations are common audit findings that can block compliance certifications.
## Common Causes
- Network policies not yet defined for a new environment
- Legacy policy left in place from initial setup
- Production policy copied from dev without tightening rules
- Manual firewall rule change not reflected in Stella Ops policy
- Policy update deployed to staging but not promoted to production
## How to Fix
### Docker Compose
```bash
# Review current network policies
stella env network-policy list
# Create a default-deny policy for production
stella env network-policy create prod --default-deny
# Allow only staging ingress to production
stella env network-policy update prod --default-deny --allow-from staging
# Restrict egress to specific destinations
stella env network-policy update prod --egress-allow "10.0.0.0/8,registry.internal"
# In Docker Compose, use network isolation
# Define separate networks in docker-compose.stella-ops.yml:
# networks:
# prod-internal:
# internal: true
# staging-internal:
# internal: true
```
### Bare Metal / systemd
```bash
# Review current iptables/nftables rules
sudo iptables -L -n -v
# or
sudo nft list ruleset
# Apply default-deny for production network interface
sudo iptables -A INPUT -i prod0 -j DROP
sudo iptables -I INPUT -i prod0 -s <staging-subnet> -j ACCEPT
# Or configure via stellaops policy
stella env network-policy update prod --default-deny --allow-from staging
# Persist firewall rules
sudo netfilter-persistent save
```
### Kubernetes / Helm
```bash
# Review existing network policies
kubectl get networkpolicies -n stellaops-prod
# Apply default-deny via Helm
helm upgrade stellaops stellaops/stellaops \
--set environments.prod.networkPolicy.defaultDeny=true \
--set environments.prod.networkPolicy.allowFrom[0]=stellaops-staging
# Or apply a NetworkPolicy manifest directly
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: stellaops-prod
spec:
podSelector: {}
policyTypes:
- Ingress
EOF
```
## Verification
```bash
stella doctor run --check check.environment.network.policy
```
## Related Checks
- `check.environment.connectivity` - network policies can block agent connectivity if misconfigured
- `check.environment.drift` - network policy differences between environments are a form of drift
- `check.environment.secrets` - network isolation protects secret transmission

View File

@@ -0,0 +1,94 @@
---
checkId: check.environment.secrets
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, secrets, security, rotation, expiry]
---
# Environment Secret Health
## What It Checks
Queries the Release Orchestrator secrets status API (`/api/v1/environments/secrets/status`) for metadata about all configured secrets (no actual secret values are retrieved). Each secret is evaluated for:
- **Expiry** -- secrets already expired, expiring within 7 days (critical), or expiring within 30 days (warning)
- **Rotation compliance** -- if a rotation policy is defined, checks whether `lastRotated` exceeds the policy interval by more than 10% grace
Severity escalation:
- **Fail** if any production secret has expired
- **Fail** if any secret has expired or production secrets are expiring within 7 days
- **Warn** if secrets are expiring within 30 days or rotation is overdue
- **Pass** if all secrets are healthy
## Why It Matters
Expired secrets cause immediate authentication and authorization failures. Services that depend on expired credentials will fail to connect to databases, registries, external APIs, and other integrations. In production, this means outages. Secrets expiring within 7 days require urgent rotation to prevent imminent failures. Overdue rotation violates security policies and increases the blast radius of a credential compromise.
## Common Causes
- Secret expired without automated rotation being configured
- Rotation job failed silently (scheduler down, permissions changed)
- Secret provider (Vault, Key Vault) connection lost during rotation window
- Manual secret set with fixed expiry and no follow-up rotation
- Rotation policy interval shorter than actual rotation cadence
## How to Fix
### Docker Compose
```bash
# List secrets with expiry status
stella env secrets list --expiring
# Rotate an expired or expiring secret immediately
stella env secrets rotate <environment> <secret-name>
# Check secret provider connectivity
stella secrets provider status
# Update secret in .env file for compose deployments
# Edit devops/compose/.env with the new secret value
# Then restart affected services
docker compose -f docker-compose.stella-ops.yml restart <service>
```
### Bare Metal / systemd
```bash
# List secrets with expiry details
stella env secrets list --expiring
# Rotate expired secret
stella env secrets rotate <environment> <secret-name>
# If using file-based secrets, update the file
sudo vi /etc/stellaops/secrets/<secret-name>
sudo chmod 600 /etc/stellaops/secrets/<secret-name>
sudo systemctl restart stellaops-<service>
# Schedule automated rotation
stella env secrets rotate-scheduled --days 7
```
### Kubernetes / Helm
```bash
# List expiring secrets
stella env secrets list --expiring
# Rotate secret and update Kubernetes secret
stella env secrets rotate <environment> <secret-name>
# Or update manually
kubectl create secret generic <secret-name> \
--from-literal=value=<new-value> \
-n stellaops-<env> --dry-run=client -o yaml | kubectl apply -f -
# Restart pods to pick up new secret
kubectl rollout restart deployment/<service> -n stellaops-<env>
# For external-secrets-operator, trigger a refresh
kubectl annotate externalsecret <name> -n stellaops force-sync=$(date +%s)
```
## Verification
```bash
stella doctor run --check check.environment.secrets
```
## Related Checks
- `check.environment.connectivity` - expired agent credentials cause connectivity failures
- `check.environment.deployments` - services fail when their secrets expire
- `check.integration.secrets.manager` - verifies the secrets manager itself is healthy