Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
98
docs/doctor/articles/environment/environment-connectivity.md
Normal file
98
docs/doctor/articles/environment/environment-connectivity.md
Normal file
@@ -0,0 +1,98 @@
|
||||
---
|
||||
checkId: check.environment.connectivity
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, connectivity, agent, network]
|
||||
---
|
||||
# Environment Connectivity
|
||||
|
||||
## What It Checks
|
||||
Retrieves the list of environments from the Release Orchestrator (`/api/v1/environments`), then probes each environment agent's `/health` endpoint. For each agent the check measures:
|
||||
- **Reachability** -- whether the health endpoint returns a success status code
|
||||
- **Latency** -- fails warn if response takes more than 500ms
|
||||
- **TLS certificate validity** -- warns if the agent's TLS certificate expires within 30 days
|
||||
- **Authentication** -- detects 401/403 responses indicating credential issues
|
||||
|
||||
If any agent is unreachable, the check fails. High latency or expiring certificates produce a warn.
|
||||
|
||||
## Why It Matters
|
||||
Environment agents are the control surface through which Stella Ops manages deployments, collects telemetry, and enforces policy. An unreachable agent means the platform cannot deploy to, monitor, or roll back services in that environment. TLS certificate expiry causes hard connectivity failures with no graceful degradation. High latency slows deployment pipelines and can cause timeouts in approval workflows.
|
||||
|
||||
## Common Causes
|
||||
- Environment agent service is stopped or crashed
|
||||
- Firewall rule change blocking the agent port
|
||||
- Network partition between Stella Ops control plane and target environment
|
||||
- TLS certificate not renewed before expiry
|
||||
- Agent authentication credentials rotated without updating Stella Ops configuration
|
||||
- DNS resolution failure for the agent hostname
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check if the environment agent container is running
|
||||
docker ps --filter "name=environment-agent"
|
||||
|
||||
# View agent logs for errors
|
||||
docker logs stellaops-environment-agent --tail 100
|
||||
|
||||
# Restart the agent
|
||||
docker compose -f docker-compose.stella-ops.yml restart environment-agent
|
||||
|
||||
# If TLS cert is expiring, replace the certificate files
|
||||
# mounted into the agent container and restart
|
||||
cp /path/to/new/cert.pem devops/compose/certs/agent.pem
|
||||
cp /path/to/new/key.pem devops/compose/certs/agent-key.pem
|
||||
docker compose -f docker-compose.stella-ops.yml restart environment-agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check agent service status
|
||||
sudo systemctl status stellaops-environment-agent
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u stellaops-environment-agent --since "1 hour ago"
|
||||
|
||||
# Restart agent
|
||||
sudo systemctl restart stellaops-environment-agent
|
||||
|
||||
# Renew TLS certificate
|
||||
sudo cp /path/to/new/cert.pem /etc/stellaops/certs/agent.pem
|
||||
sudo cp /path/to/new/key.pem /etc/stellaops/certs/agent-key.pem
|
||||
sudo systemctl restart stellaops-environment-agent
|
||||
|
||||
# Test network connectivity from control plane
|
||||
curl -v https://<agent-host>:<agent-port>/health
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check agent pod status
|
||||
kubectl get pods -n stellaops -l app=environment-agent
|
||||
|
||||
# View agent logs
|
||||
kubectl logs -n stellaops -l app=environment-agent --tail=100
|
||||
|
||||
# Restart agent pods
|
||||
kubectl rollout restart deployment/environment-agent -n stellaops
|
||||
|
||||
# Renew TLS certificate via cert-manager or manual secret update
|
||||
kubectl create secret tls agent-tls \
|
||||
--cert=/path/to/cert.pem \
|
||||
--key=/path/to/key.pem \
|
||||
-n stellaops --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicies -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.deployments` - checks health of services deployed via agents
|
||||
- `check.environment.network.policy` - verifies network policies that may block agent connectivity
|
||||
- `check.environment.secrets` - agent credentials may need rotation
|
||||
Reference in New Issue
Block a user