Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,98 @@
---
checkId: check.environment.connectivity
plugin: stellaops.doctor.environment
severity: warn
tags: [environment, connectivity, agent, network]
---
# Environment Connectivity
## What It Checks
Retrieves the list of environments from the Release Orchestrator (`/api/v1/environments`), then probes each environment agent's `/health` endpoint. For each agent the check measures:
- **Reachability** -- whether the health endpoint returns a success status code
- **Latency** -- fails warn if response takes more than 500ms
- **TLS certificate validity** -- warns if the agent's TLS certificate expires within 30 days
- **Authentication** -- detects 401/403 responses indicating credential issues
If any agent is unreachable, the check fails. High latency or expiring certificates produce a warn.
## Why It Matters
Environment agents are the control surface through which Stella Ops manages deployments, collects telemetry, and enforces policy. An unreachable agent means the platform cannot deploy to, monitor, or roll back services in that environment. TLS certificate expiry causes hard connectivity failures with no graceful degradation. High latency slows deployment pipelines and can cause timeouts in approval workflows.
## Common Causes
- Environment agent service is stopped or crashed
- Firewall rule change blocking the agent port
- Network partition between Stella Ops control plane and target environment
- TLS certificate not renewed before expiry
- Agent authentication credentials rotated without updating Stella Ops configuration
- DNS resolution failure for the agent hostname
## How to Fix
### Docker Compose
```bash
# Check if the environment agent container is running
docker ps --filter "name=environment-agent"
# View agent logs for errors
docker logs stellaops-environment-agent --tail 100
# Restart the agent
docker compose -f docker-compose.stella-ops.yml restart environment-agent
# If TLS cert is expiring, replace the certificate files
# mounted into the agent container and restart
cp /path/to/new/cert.pem devops/compose/certs/agent.pem
cp /path/to/new/key.pem devops/compose/certs/agent-key.pem
docker compose -f docker-compose.stella-ops.yml restart environment-agent
```
### Bare Metal / systemd
```bash
# Check agent service status
sudo systemctl status stellaops-environment-agent
# View logs
sudo journalctl -u stellaops-environment-agent --since "1 hour ago"
# Restart agent
sudo systemctl restart stellaops-environment-agent
# Renew TLS certificate
sudo cp /path/to/new/cert.pem /etc/stellaops/certs/agent.pem
sudo cp /path/to/new/key.pem /etc/stellaops/certs/agent-key.pem
sudo systemctl restart stellaops-environment-agent
# Test network connectivity from control plane
curl -v https://<agent-host>:<agent-port>/health
```
### Kubernetes / Helm
```bash
# Check agent pod status
kubectl get pods -n stellaops -l app=environment-agent
# View agent logs
kubectl logs -n stellaops -l app=environment-agent --tail=100
# Restart agent pods
kubectl rollout restart deployment/environment-agent -n stellaops
# Renew TLS certificate via cert-manager or manual secret update
kubectl create secret tls agent-tls \
--cert=/path/to/cert.pem \
--key=/path/to/key.pem \
-n stellaops --dry-run=client -o yaml | kubectl apply -f -
# Check network policies
kubectl get networkpolicies -n stellaops
```
## Verification
```bash
stella doctor run --check check.environment.connectivity
```
## Related Checks
- `check.environment.deployments` - checks health of services deployed via agents
- `check.environment.network.policy` - verifies network policies that may block agent connectivity
- `check.environment.secrets` - agent credentials may need rotation