docs consolidation and others
This commit is contained in:
26
docs/operations/binary-prereqs.md
Normal file
26
docs/operations/binary-prereqs.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Binary Prerequisites & Offline Layout
|
||||
|
||||
## Layout (authoritative)
|
||||
- `.nuget/packages/` — NuGet package cache (configured via `nuget.config` `globalPackagesFolder`).
|
||||
- `devops/manifests/` — binary integrity manifests (e.g., `binary-plugins.manifest.json`).
|
||||
- `devops/offline/feeds/` — air-gap bundles (tarballs, OCI layers, SBOM packs) registered in `manifest.json`.
|
||||
- Module-owned binaries (currently `plugins/`, `tools/`, `deploy/`, `ops/`) are tracked for integrity in `devops/manifests/` until relocated.
|
||||
|
||||
## Adding or updating NuGet packages
|
||||
1) Run `dotnet restore` which populates `.nuget/packages/` per the sources in `nuget.config`.
|
||||
2) Never add new feeds to `nuget.config` without review; the configured sources are `nuget.org` and `stellaops` (internal feed).
|
||||
3) For offline builds, pre-populate `.nuget/packages/` from a network-connected machine, then copy to the air-gapped environment.
|
||||
|
||||
## Adding other binaries
|
||||
1) Prefer building from source; if you must pin a binary, drop it under `devops/offline/` and append an entry with SHA-256, origin URL, version, and intended consumer.
|
||||
2) For module-owned binaries (e.g., plugins), record the artefact in `devops/manifests/binary-plugins.manifest.json` until it can be rebuilt deterministically as part of CI.
|
||||
|
||||
## Automation & Integrity
|
||||
- Run `scripts/update-binary-manifests.py` to refresh manifests after adding binaries.
|
||||
- Run `scripts/verify-binaries.sh` locally; CI executes it on every PR/branch to block binaries outside approved roots.
|
||||
- CI also re-runs the manifest generator and fails if the manifests would change—commit regenerated manifests as part of the change.
|
||||
- NuGet restore uses `.nuget/packages/` as configured in `nuget.config`. Clean by removing `.nuget/packages/` if needed.
|
||||
- For offline enforcement, set `OFFLINE=1` (CI should fail if it reaches `nuget.org` without `ALLOW_REMOTE=1`).
|
||||
|
||||
## Housekeeping
|
||||
- Refresh manifests when binaries change and record the update in the current sprint's Execution Log.
|
||||
295
docs/operations/deployment/VERSION_MATRIX.md
Normal file
295
docs/operations/deployment/VERSION_MATRIX.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# StellaOps Deployment Version Matrix
|
||||
|
||||
> **Last Updated:** 2025-12-04
|
||||
> **Purpose:** Single source of truth for service versions across deployment environments
|
||||
> **Unblocks:** COMPOSE-44-001, 44-001, 44-002, 44-003, 45-001, 45-002, 45-003 (7 tasks)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Environment | Core Version | Status |
|
||||
|-------------|-------------|--------|
|
||||
| **Development** | `2025.10.0-edge` | Active |
|
||||
| **Staging** | `2025.09.2` | Stable |
|
||||
| **Production** | `2025.09.2` | Stable |
|
||||
| **Air-Gap** | `2025.09.2-airgap` | Certified |
|
||||
|
||||
---
|
||||
|
||||
## Service Version Matrix
|
||||
|
||||
### Core Services
|
||||
|
||||
| Service | Dev | Staging | Prod | Air-Gap | Notes |
|
||||
|---------|-----|---------|------|---------|-------|
|
||||
| Authority | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | OAuth 2.1 / mTLS |
|
||||
| Signer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | ECDSA/RSA/EdDSA |
|
||||
| Attestor | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | in-toto/DSSE |
|
||||
| Concelier | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Advisory ingestion |
|
||||
| Scanner | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | SBOM/Vuln scanning |
|
||||
| Excititor | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | VEX export |
|
||||
| Policy | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | OPA/Rego engine |
|
||||
| Scheduler | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Job scheduling |
|
||||
| Notify | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Notifications |
|
||||
|
||||
### Platform Services
|
||||
|
||||
| Service | Dev | Staging | Prod | Air-Gap | Notes |
|
||||
|---------|-----|---------|------|---------|-------|
|
||||
| Orchestrator Web | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | API Gateway |
|
||||
| Orchestrator Worker | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Background jobs |
|
||||
| Graph API | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Graph queries |
|
||||
| Graph Indexer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Graph ingest |
|
||||
| Timeline Indexer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Event timeline |
|
||||
| Findings Ledger | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Finding storage |
|
||||
|
||||
### Infrastructure Dependencies
|
||||
|
||||
| Component | Version | Digest | Notes |
|
||||
|-----------|---------|--------|-------|
|
||||
| PostgreSQL | `16-alpine` | N/A | Primary database (REQUIRED) |
|
||||
| Valkey | `8.0` | N/A | Cache, DPoP security (REQUIRED) |
|
||||
| RustFS | `2025.10.0-edge` | N/A | Object storage (REQUIRED) |
|
||||
| NATS | `2.10` | `sha256:c82559e4476289481a8a5196e675ebfe67eea81d95e5161e3e78eccfe766608e` | Message queue (optional) |
|
||||
|
||||
---
|
||||
|
||||
## Container Image Registry
|
||||
|
||||
### Primary Registry
|
||||
|
||||
```
|
||||
registry.stella-ops.org/stellaops/<service>:<version>
|
||||
```
|
||||
|
||||
### Image Naming Convention
|
||||
|
||||
| Pattern | Example | Use Case |
|
||||
|---------|---------|----------|
|
||||
| `<service>:<version>` | `authority:2025.09.2` | Tagged releases |
|
||||
| `<service>:<version>-<variant>` | `authority:2025.09.2-airgap` | Environment variants |
|
||||
| `<service>:edge` | `authority:edge` | Latest dev build |
|
||||
| `<service>@sha256:<digest>` | `authority@sha256:abc123...` | Immutable reference |
|
||||
|
||||
### Air-Gap Bundle Images
|
||||
|
||||
Air-gap deployments use pre-bundled images with all dependencies:
|
||||
|
||||
```
|
||||
registry.stella-ops.org/stellaops/airgap-bundle:2025.09.2
|
||||
```
|
||||
|
||||
Bundle contents:
|
||||
- All core services at matching version
|
||||
- Infrastructure containers (PostgreSQL, Valkey, RustFS, NATS)
|
||||
- CLI tools and migration utilities
|
||||
- Offline kit documentation
|
||||
|
||||
---
|
||||
|
||||
## Version Promotion Workflow
|
||||
|
||||
### Stages
|
||||
|
||||
```
|
||||
Dev (edge) → Staging → Production → Air-Gap (certified)
|
||||
```
|
||||
|
||||
### Promotion Criteria
|
||||
|
||||
| Stage | Criteria |
|
||||
|-------|----------|
|
||||
| Dev → Staging | All unit tests pass, integration tests pass |
|
||||
| Staging → Prod | E2E tests pass, security scan clean, performance benchmarks pass |
|
||||
| Prod → Air-Gap | Offline validation complete, bundle integrity verified, documentation updated |
|
||||
|
||||
### Promotion Commands
|
||||
|
||||
```bash
|
||||
# Promote dev to staging
|
||||
./scripts/promote.sh --from dev --to staging --version 2025.10.0
|
||||
|
||||
# Promote staging to production
|
||||
./scripts/promote.sh --from staging --to prod --version 2025.10.0
|
||||
|
||||
# Create air-gap certified bundle
|
||||
./scripts/create-airgap-bundle.sh --version 2025.09.2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Helm Chart Values
|
||||
|
||||
### Development (`values-dev.yaml`)
|
||||
|
||||
```yaml
|
||||
global:
|
||||
imageTag: "2025.10.0-edge"
|
||||
imagePullPolicy: Always
|
||||
environment: development
|
||||
|
||||
services:
|
||||
authority:
|
||||
replicaCount: 1
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "100m"
|
||||
```
|
||||
|
||||
### Production (`values-prod.yaml`)
|
||||
|
||||
```yaml
|
||||
global:
|
||||
imageTag: "2025.09.2"
|
||||
imagePullPolicy: IfNotPresent
|
||||
environment: production
|
||||
|
||||
services:
|
||||
authority:
|
||||
replicaCount: 3
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "250m"
|
||||
```
|
||||
|
||||
### Air-Gap (`values-airgap.yaml`)
|
||||
|
||||
```yaml
|
||||
global:
|
||||
imageTag: "2025.09.2-airgap"
|
||||
imagePullPolicy: Never # Images pre-loaded
|
||||
environment: airgap
|
||||
offlineMode: true
|
||||
|
||||
airgap:
|
||||
enabled: true
|
||||
bundleVersion: "2025.09.2"
|
||||
stalenessThresholdSeconds: 604800 # 7 days
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Docker Compose Reference
|
||||
|
||||
### Quick Start (Development)
|
||||
|
||||
```yaml
|
||||
# docker-compose.dev.yaml
|
||||
version: "3.8"
|
||||
services:
|
||||
authority:
|
||||
image: registry.stella-ops.org/stellaops/authority:2025.10.0-edge
|
||||
|
||||
concelier:
|
||||
image: registry.stella-ops.org/stellaops/concelier:2025.10.0-edge
|
||||
|
||||
scanner:
|
||||
image: registry.stella-ops.org/stellaops/scanner:2025.10.0-edge
|
||||
```
|
||||
|
||||
### Production
|
||||
|
||||
```yaml
|
||||
# docker-compose.prod.yaml
|
||||
version: "3.8"
|
||||
services:
|
||||
authority:
|
||||
image: registry.stella-ops.org/stellaops/authority@sha256:...
|
||||
deploy:
|
||||
replicas: 3
|
||||
|
||||
concelier:
|
||||
image: registry.stella-ops.org/stellaops/concelier@sha256:...
|
||||
deploy:
|
||||
replicas: 2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
### Startup Order
|
||||
|
||||
```
|
||||
1. Infrastructure (PostgreSQL, Valkey, RustFS, NATS)
|
||||
↓
|
||||
2. Core Auth (Authority, Signer)
|
||||
↓
|
||||
3. Data Services (Concelier, Excititor)
|
||||
↓
|
||||
4. Compute Services (Scanner, Policy, Scheduler)
|
||||
↓
|
||||
5. Platform Services (Orchestrator, Graph, Timeline)
|
||||
↓
|
||||
6. UI/CLI
|
||||
```
|
||||
|
||||
### Health Check Endpoints
|
||||
|
||||
| Service | Health Endpoint | Ready Endpoint |
|
||||
|---------|-----------------|----------------|
|
||||
| All | `/health` | `/ready` |
|
||||
| Authority | `/health` | `/ready` (includes JWKS) |
|
||||
| Scanner | `/health` | `/ready` (includes analyzer check) |
|
||||
|
||||
---
|
||||
|
||||
## Breaking Changes Log
|
||||
|
||||
### 2025.10.0 (Upcoming)
|
||||
|
||||
- **Authority:** New OAuth 2.1 endpoints (backward compatible)
|
||||
- **Scanner:** Analyzer plugin format v2 (migration required)
|
||||
- **Concelier:** LNM API v2 (v1 deprecated, removed in 2025.11.0)
|
||||
|
||||
### 2025.09.2 (Current Stable)
|
||||
|
||||
- **All:** Initial GA release
|
||||
- **Air-Gap:** First certified offline bundle
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
### Helm Rollback
|
||||
|
||||
```bash
|
||||
# List releases
|
||||
helm history stellaops -n stellaops
|
||||
|
||||
# Rollback to previous
|
||||
helm rollback stellaops 1 -n stellaops
|
||||
```
|
||||
|
||||
### Compose Rollback
|
||||
|
||||
```bash
|
||||
# Stop current
|
||||
docker-compose down
|
||||
|
||||
# Edit .env to previous version
|
||||
# VERSION=2025.09.1
|
||||
|
||||
# Start previous
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Helm Chart Documentation](../deploy/helm/stellaops/README.md)
|
||||
- [Compose Quickstart](../deploy/compose/README.md)
|
||||
- [Offline Kit Guide](./OFFLINE_KIT.md)
|
||||
- [Air-Gap Provenance](../modules/findings-ledger/airgap-provenance.md)
|
||||
- [Staleness Schema](../schemas/ledger-airgap-staleness.schema.json)
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change | Author |
|
||||
|------|--------|--------|
|
||||
| 2025-12-04 | Initial version matrix created | Claude |
|
||||
| 2025-12-04 | Added air-gap certification workflow | Claude |
|
||||
228
docs/operations/deployment/console.md
Normal file
228
docs/operations/deployment/console.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# Deploying the StellaOps Console
|
||||
|
||||
> **Audience:** Deployment Guild, Console Guild, operators rolling out the web console.
|
||||
> **Scope:** Helm and Docker Compose deployment steps, ingress/TLS configuration, required environment variables, health checks, offline/air-gap operation, and compliance checklist (Sprint 23).
|
||||
|
||||
The StellaOps Console ships as part of the `stellaops` stack Helm chart and Compose bundles maintained under `deploy/`. This guide describes the supported deployment paths, the configuration surface, and operational checks needed to run the console in connected or air-gapped environments.
|
||||
|
||||
---
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
- Kubernetes cluster (v1.28+) with ingress controller (NGINX, Traefik, or equivalent) and Cert-Manager for automated TLS, or Docker host for Compose deployments.
|
||||
- Container registry access to `registry.stella-ops.org` (or mirrored registry) for all images listed in `deploy/releases/*.yaml`.
|
||||
- Authority service configured with console client (`aud=ui`, scopes `ui.read`, `ui.admin`).
|
||||
- DNS entry pointing to the console hostname (for example, `console.acme.internal`).
|
||||
- Cosign public key for manifest verification (`deploy/releases/manifest.json.sig`).
|
||||
- Optional: Offline Kit bundle for air-gapped sites (`stella-ops-offline-kit-<ver>.tar.gz`).
|
||||
|
||||
---
|
||||
|
||||
## 2. Helm deployment (recommended)
|
||||
|
||||
### 2.1 Install chart repository
|
||||
|
||||
```bash
|
||||
helm repo add stellaops https://downloads.stella-ops.org/helm
|
||||
helm repo update stellaops
|
||||
```
|
||||
|
||||
If operating offline, copy the chart archive from the Offline Kit (`deploy/helm/stellaops-<ver>.tgz`) and run:
|
||||
|
||||
```bash
|
||||
helm install stellaops ./stellaops-<ver>.tgz --namespace stellaops --create-namespace
|
||||
```
|
||||
|
||||
### 2.2 Base installation
|
||||
|
||||
```bash
|
||||
helm install stellaops stellaops/stellaops \
|
||||
--namespace stellaops \
|
||||
--create-namespace \
|
||||
--values deploy/helm/stellaops/values-prod.yaml
|
||||
```
|
||||
|
||||
The chart deploys Authority, Console web/API gateway, Scanner API, Scheduler, and supporting services. The console frontend pod is labelled `app=stellaops-web-ui`.
|
||||
|
||||
### 2.3 Helm values highlights
|
||||
|
||||
Key sections in `deploy/helm/stellaops/values-prod.yaml`:
|
||||
|
||||
| Path | Description |
|
||||
|------|-------------|
|
||||
| `console.ingress.host` | Hostname served by the console (`console.example.com`). |
|
||||
| `console.ingress.tls.secretName` | Kubernetes secret containing TLS certificate (generated by Cert-Manager or uploaded manually). |
|
||||
| `console.config.apiGateway.baseUrl` | Internal base URL the UI uses to reach the gateway (defaults to `https://stellaops-web`). |
|
||||
| `console.env.AUTHORITY_ISSUER` | Authority issuer URL (for example, `https://authority.example.com`). |
|
||||
| `console.env.AUTHORITY_CLIENT_ID` | Authority client ID for the console UI. |
|
||||
| `console.env.AUTHORITY_SCOPES` | Space-separated scopes required by UI (`ui.read ui.admin`). |
|
||||
| `console.resources` | CPU/memory requests and limits (default 250m CPU / 512Mi memory). |
|
||||
| `console.podAnnotations` | Optional annotations for service mesh or monitoring. |
|
||||
|
||||
Use `values-stage.yaml`, `values-dev.yaml`, or `values-airgap.yaml` as templates for other environments.
|
||||
|
||||
### 2.4 TLS and ingress
|
||||
|
||||
Example ingress override:
|
||||
|
||||
```yaml
|
||||
console:
|
||||
ingress:
|
||||
enabled: true
|
||||
className: nginx
|
||||
host: console.acme.internal
|
||||
tls:
|
||||
enabled: true
|
||||
secretName: console-tls
|
||||
```
|
||||
|
||||
Generate certificates using Cert-Manager or provide an existing secret. For air-gapped deployments, pre-create the secret with the mirrored CA chain.
|
||||
|
||||
### 2.5 Health checks
|
||||
|
||||
Console pods expose:
|
||||
|
||||
| Path | Purpose | Notes |
|
||||
|------|---------|-------|
|
||||
| `/health/live` | Liveness probe | Confirms process responsive. |
|
||||
| `/health/ready` | Readiness probe | Verifies configuration bootstrap and Authority reachability. |
|
||||
| `/metrics` | Prometheus metrics | Enabled when `console.metrics.enabled=true`. |
|
||||
|
||||
Helm chart sets default probes (`initialDelaySeconds: 10`, `periodSeconds: 15`). Adjust via `console.livenessProbe` and `console.readinessProbe`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Docker Compose deployment
|
||||
|
||||
Located in `deploy/compose/docker-compose.console.yaml`. Quick start:
|
||||
|
||||
```bash
|
||||
cd deploy/compose
|
||||
docker compose -f docker-compose.console.yaml --env-file console.env up -d
|
||||
```
|
||||
|
||||
`console.env` should define:
|
||||
|
||||
```
|
||||
CONSOLE_PUBLIC_BASE_URL=https://console.acme.internal
|
||||
AUTHORITY_ISSUER=https://authority.acme.internal
|
||||
AUTHORITY_CLIENT_ID=console-ui
|
||||
AUTHORITY_CLIENT_SECRET=<if using confidential client>
|
||||
AUTHORITY_SCOPES=ui.read ui.admin
|
||||
CONSOLE_GATEWAY_BASE_URL=https://api.acme.internal
|
||||
```
|
||||
|
||||
The compose bundle includes Traefik as reverse proxy with TLS termination. Update `traefik/dynamic/console.yml` for custom certificates or additional middlewares (CSP headers, rate limits).
|
||||
|
||||
---
|
||||
|
||||
## 4. Environment variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `CONSOLE_PUBLIC_BASE_URL` | External URL used for redirects, deep links, and telemetry. | None (required). |
|
||||
| `CONSOLE_GATEWAY_BASE_URL` | URL of the web gateway that proxies API calls (`/console/*`). | Chart service name. |
|
||||
| `AUTHORITY_ISSUER` | Authority issuer (`https://authority.example.com`). | None (required). |
|
||||
| `AUTHORITY_CLIENT_ID` | OIDC client configured in Authority. | None (required). |
|
||||
| `AUTHORITY_SCOPES` | Space-separated scopes assigned to the console client. | `ui.read ui.admin`. |
|
||||
| `AUTHORITY_DPOP_ENABLED` | Enables DPoP challenge/response (recommended true). | `true`. |
|
||||
| `CONSOLE_FEATURE_FLAGS` | Comma-separated feature flags (`runs`, `downloads.offline`, etc.). | `runs,downloads,policies`. |
|
||||
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`, etc.). | `Information`. |
|
||||
| `CONSOLE_METRICS_ENABLED` | Expose `/metrics` endpoint. | `true`. |
|
||||
| `CONSOLE_SENTRY_DSN` | Optional error reporting DSN. | Blank. |
|
||||
|
||||
When running behind additional proxies, set `ASPNETCORE_FORWARDEDHEADERS_ENABLED=true` to honour `X-Forwarded-*` headers.
|
||||
|
||||
---
|
||||
|
||||
## 5. Security headers and CSP
|
||||
|
||||
The console serves a strict Content Security Policy (CSP) by default:
|
||||
|
||||
```
|
||||
default-src 'self';
|
||||
connect-src 'self' https://*.stella-ops.local;
|
||||
script-src 'self';
|
||||
style-src 'self' 'unsafe-inline';
|
||||
img-src 'self' data:;
|
||||
font-src 'self';
|
||||
frame-ancestors 'none';
|
||||
```
|
||||
|
||||
Adjust via `console.config.cspOverrides` if additional domains are required. For integrations embedding the console, update OIDC redirect URIs and Authority scopes accordingly.
|
||||
|
||||
TLS recommendations:
|
||||
|
||||
- Use TLS 1.2+ with modern cipher suite policy.
|
||||
- Enable HSTS (`Strict-Transport-Security: max-age=31536000; includeSubDomains`).
|
||||
- Provide custom trust bundles via `console.config.trustBundleSecret` when using private CAs.
|
||||
|
||||
---
|
||||
|
||||
## 6. Logging and metrics
|
||||
|
||||
- Structured logs emitted to stdout with correlation IDs. Configure log shipping via Fluent Bit or similar.
|
||||
- Metrics available at `/metrics` in Prometheus format. Key metrics include `ui_request_duration_seconds`, `ui_tenant_switch_total`, and `ui_download_manifest_refresh_seconds`.
|
||||
- Enable OpenTelemetry exporter by setting `OTEL_EXPORTER_OTLP_ENDPOINT` and associated headers in environment variables.
|
||||
|
||||
---
|
||||
|
||||
## 7. Offline and air-gap deployment
|
||||
|
||||
- Mirror container images using the Downloads workspace or Offline Kit manifest. Example:
|
||||
|
||||
```bash
|
||||
oras copy registry.stella-ops.org/stellaops/web-ui@sha256:<digest> \
|
||||
registry.airgap.local/stellaops/web-ui:2025.10.0
|
||||
```
|
||||
|
||||
- Import Offline Kit using `stella ouk import` before starting the console so manifest parity checks succeed.
|
||||
- Use `values-airgap.yaml` to disable external telemetry endpoints and configure internal certificate chains.
|
||||
- Run `helm upgrade --install` using the mirrored chart (`stellaops-<ver>.tgz`) and set `console.offlineMode=true` to surface offline banners.
|
||||
|
||||
---
|
||||
|
||||
## 8. Health checks and remediation
|
||||
|
||||
| Check | Command | Expected result |
|
||||
|-------|---------|-----------------|
|
||||
| Pod status | `kubectl get pods -n stellaops` | `Running` state with restarts = 0. |
|
||||
| Liveness | `kubectl exec deploy/stellaops-web-ui -- curl -fsS http://localhost:8080/health/live` | Returns `{"status":"Healthy"}`. |
|
||||
| Readiness | `kubectl exec deploy/stellaops-web-ui -- curl -fsS http://localhost:8080/health/ready` | Returns `{"status":"Ready"}`. |
|
||||
| Gateway reachability | `curl -I https://console.example.com/api/console/status` | `200 OK` with CSP headers. |
|
||||
| Static assets | `curl -I https://console.example.com/static/assets/app.js` | `200 OK` with long cache headers. |
|
||||
|
||||
Troubleshooting steps:
|
||||
|
||||
- **Authority unreachable:** readiness fails with `AUTHORITY_UNREACHABLE`. Check DNS, trust bundles, and Authority service health.
|
||||
- **Manifest mismatch:** console logs `DOWNLOAD_MANIFEST_SIGNATURE_INVALID`. Verify cosign key and re-sync manifest.
|
||||
- **Ingress 404:** ensure ingress controller routes host to `stellaops-web-ui` service; check TLS secret name.
|
||||
- **SSE blocked:** confirm proxy allows HTTP/1.1 and disables buffering on `/console/runs/*`.
|
||||
|
||||
---
|
||||
|
||||
## 9. References
|
||||
|
||||
- `deploy/helm/stellaops/values-*.yaml` - environment-specific overrides.
|
||||
- `deploy/compose/docker-compose.console.yaml` - Compose bundle.
|
||||
- `docs/UI_GUIDE.md` - Console workflows and offline posture.
|
||||
- `/docs/security/console-security.md` - CSP and Authority scopes.
|
||||
- `/docs/OFFLINE_KIT.md` - Offline kit packaging and verification.
|
||||
- `/docs/modules/devops/runbooks/deployment-runbook.md` (pending) - wider platform deployment steps.
|
||||
|
||||
---
|
||||
|
||||
## 10. Compliance checklist
|
||||
|
||||
- [ ] Helm and Compose instructions verified against `deploy/` assets.
|
||||
- [ ] Ingress/TLS guidance aligns with Security Guild recommendations.
|
||||
- [ ] Environment variables documented with defaults and required values.
|
||||
- [ ] Health/liveness/readiness endpoints tested and listed.
|
||||
- [ ] Offline workflow (mirrors, manifest parity) captured.
|
||||
- [ ] Logging and metrics surface documented metrics.
|
||||
- [ ] CSP and security header defaults stated alongside override guidance.
|
||||
- [ ] Troubleshooting section linked to relevant runbooks.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-27 (Sprint 23).*
|
||||
158
docs/operations/deployment/containers.md
Normal file
158
docs/operations/deployment/containers.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Container Deployment Guide — AOC Update
|
||||
|
||||
> **Audience:** DevOps Guild, platform operators deploying StellaOps services.
|
||||
> **Scope:** Deployment configuration changes required by the Aggregation-Only Contract (AOC), including schema validators, guard environment flags, and verifier identities.
|
||||
|
||||
This guide supplements existing deployment manuals with AOC-specific configuration. It assumes familiarity with the base Compose/Helm manifests described in `ops/deployment/` and `docs/modules/devops/architecture.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1 · Schema constraint enablement
|
||||
|
||||
### 1.1 PostgreSQL constraints
|
||||
|
||||
- Apply CHECK constraints and NOT NULL rules to `advisory_raw` and `vex_raw` tables before enabling AOC guards.
|
||||
- Before enabling constraints or the idempotency index, run the duplicate audit helper to confirm no conflicting raw advisories remain:
|
||||
```bash
|
||||
psql -d concelier -f ops/devops/scripts/check-advisory-raw-duplicates.sql -v LIMIT=200
|
||||
```
|
||||
Resolve any reported rows prior to rollout.
|
||||
- Use the migration script provided in `ops/devops/scripts/apply-aoc-constraints.sql`:
|
||||
|
||||
```bash
|
||||
kubectl exec -n concelier deploy/concelier-postgres -- \
|
||||
psql -d concelier -f ops/devops/scripts/apply-aoc-constraints.sql
|
||||
|
||||
kubectl exec -n excititor deploy/excititor-postgres -- \
|
||||
psql -d excititor -f ops/devops/scripts/apply-aoc-constraints.sql
|
||||
```
|
||||
|
||||
- Constraints enforce required fields (`tenant`, `source`, `upstream`, `linkset`) and reject forbidden keys at DB level.
|
||||
- Rollback plan: constraints can be dropped via the same script with `--remove` if required.
|
||||
|
||||
### 1.2 Migration order
|
||||
|
||||
1. Deploy constraints in maintenance window.
|
||||
2. Roll out Concelier/Excititor images with guard middleware enabled (`AOC_GUARD_ENABLED=true`).
|
||||
3. Run smoke tests (`stella sources ingest --dry-run` fixtures) before resuming production ingestion.
|
||||
|
||||
### 1.3 Supersedes backfill verification
|
||||
|
||||
1. **Duplicate audit:** Confirm `psql -d concelier -f ops/devops/scripts/check-advisory-raw-duplicates.sql -v LIMIT=200` reports no conflicts before restarting Concelier with the new migrations.
|
||||
2. **Post-migration check:** After the service restarts, validate that the `advisory` view points to `advisory_backup_20251028`:
|
||||
```bash
|
||||
psql -d concelier -c "SELECT viewname, definition FROM pg_views WHERE viewname = 'advisory';"
|
||||
```
|
||||
The definition should reference `advisory_backup_20251028`.
|
||||
3. **Supersedes chain spot-check:** Inspect a sample set to ensure deterministic chaining:
|
||||
```bash
|
||||
psql -d concelier -c "
|
||||
SELECT id, supersedes FROM advisory_raw
|
||||
WHERE upstream_id IS NOT NULL
|
||||
ORDER BY tenant, source_vendor, upstream_id, retrieved_at
|
||||
LIMIT 5;"
|
||||
```
|
||||
Each revision should reference the previous `id` (or `null` for the first revision). Record findings in the change ticket before proceeding to production.
|
||||
|
||||
---
|
||||
|
||||
## 2 · Container environment flags
|
||||
|
||||
Add the following environment variables to Concelier/Excititor deployments:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `AOC_GUARD_ENABLED` | `true` | Enables `AOCWriteGuard` interception. Set `false` only for controlled rollback. |
|
||||
| `AOC_ALLOW_SUPERSEDES_RETROFIT` | `false` | Allows temporary supersedes backfill during migration. Remove after cutover. |
|
||||
| `AOC_METRICS_ENABLED` | `true` | Emits `ingestion_write_total`, `aoc_violation_total`, etc. |
|
||||
| `AOC_TENANT_HEADER` | `X-Stella-Tenant` | Header name expected from Gateway. |
|
||||
| `AOC_VERIFIER_USER` | `stella-aoc-verify` | Read-only service user used by UI/CLI verification. |
|
||||
|
||||
Compose snippet:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
- AOC_GUARD_ENABLED=true
|
||||
- AOC_ALLOW_SUPERSEDES_RETROFIT=false
|
||||
- AOC_METRICS_ENABLED=true
|
||||
- AOC_TENANT_HEADER=X-Stella-Tenant
|
||||
- AOC_VERIFIER_USER=stella-aoc-verify
|
||||
```
|
||||
|
||||
Ensure `AOC_VERIFIER_USER` exists in Authority with `aoc:verify` scope and no write permissions.
|
||||
|
||||
---
|
||||
|
||||
## 3 · Verifier identity
|
||||
|
||||
- Create a dedicated client (`stella-aoc-verify`) via Authority bootstrap:
|
||||
|
||||
```yaml
|
||||
clients:
|
||||
- clientId: stella-aoc-verify
|
||||
grantTypes: [client_credentials]
|
||||
scopes: [aoc:verify, advisory:read, vex:read]
|
||||
tenants: [default]
|
||||
```
|
||||
|
||||
- Store credentials in secret store (`Kubernetes Secret`, `Docker swarm secret`).
|
||||
- Bind credentials to `stella aoc verify` CI jobs and Console verification service.
|
||||
- Rotate quarterly; document in `ops/authority-key-rotation.md`.
|
||||
|
||||
---
|
||||
|
||||
## 4 · Deployment steps
|
||||
|
||||
1. **Pre-checks:** Confirm database backups, alerting in maintenance mode, and staging environment validated.
|
||||
2. **Apply validators:** Run scripts per § 1.1.
|
||||
3. **Update manifests:** Inject environment variables (§ 2) and mount guard configuration configmaps.
|
||||
4. **Redeploy services:** Rolling restart Concelier/Excititor pods. Monitor `ingestion_write_total` for steady throughput.
|
||||
5. **Seed verifier:** Deploy read-only verifier user and store credentials.
|
||||
6. **Run verification:** Execute `stella aoc verify --since 24h` and ensure exit code `0`.
|
||||
7. **Update dashboards:** Point Grafana panels to new metrics (`aoc_violation_total`).
|
||||
8. **Record handoff:** Capture console screenshots and verification logs for release notes.
|
||||
|
||||
---
|
||||
|
||||
## 5 · Offline Kit updates
|
||||
|
||||
- Ship validator scripts with Offline Kit (`offline-kit/scripts/apply-aoc-validators.js`).
|
||||
- Include pre-generated verification reports for air-gapped deployments.
|
||||
- Document offline CLI workflow in bundle README referencing `docs/modules/cli/guides/cli-reference.md`.
|
||||
- Ensure `stella-aoc-verify` credentials are scoped to offline tenant and rotated during bundle refresh.
|
||||
|
||||
---
|
||||
|
||||
## 6 · Rollback plan
|
||||
|
||||
1. Disable guard via `AOC_GUARD_ENABLED=false` on Concelier/Excititor and rollout.
|
||||
2. Remove validators with the migration script (`--remove`).
|
||||
3. Pause verification jobs to prevent noise.
|
||||
4. Investigate and remediate upstream issues before re-enabling guards.
|
||||
|
||||
---
|
||||
|
||||
## 7 · References
|
||||
|
||||
- [Aggregation-Only Contract reference](../aoc/aggregation-only-contract.md)
|
||||
- [Authority scopes & tenancy](../security/authority-scopes.md)
|
||||
- [Observability guide](../observability/observability.md)
|
||||
- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
|
||||
- [Concelier architecture](../modules/concelier/architecture.md)
|
||||
- [Excititor architecture](../modules/excititor/architecture.md)
|
||||
|
||||
---
|
||||
|
||||
## 8 · Compliance checklist
|
||||
|
||||
- [ ] Validators documented and scripts referenced for online/offline deployments.
|
||||
- [ ] Environment variables cover guard enablement, metrics, and tenant header.
|
||||
- [ ] Read-only verifier user installation steps included.
|
||||
- [ ] Offline kit instructions align with validator/verification workflow.
|
||||
- [ ] Rollback procedure captured.
|
||||
- [ ] Cross-links to AOC docs, Authority scopes, and observability guides present.
|
||||
- [ ] DevOps Guild sign-off tracked (owner: @devops-guild, due 2025-10-29).
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-26 (Sprint 19).*
|
||||
212
docs/operations/deployment/docker.md
Normal file
212
docs/operations/deployment/docker.md
Normal file
@@ -0,0 +1,212 @@
|
||||
# StellaOps Console — Docker Install Recipes
|
||||
|
||||
> **Audience:** Deployment Guild, Console Guild, platform operators.
|
||||
> **Scope:** Acquire the `stellaops/web-ui` image, run it with Compose or Helm, mirror it for air‑gapped environments, and keep parity with CLI workflows.
|
||||
|
||||
This guide focuses on the new **StellaOps Console** container. Start with the general [Installation Guide](../INSTALL_GUIDE.md) for shared prerequisites (Docker, registry access, TLS) and use the steps below to layer in the console.
|
||||
|
||||
---
|
||||
|
||||
## 1 · Release artefacts
|
||||
|
||||
| Artefact | Source | Verification |
|
||||
|----------|--------|--------------|
|
||||
| Console image | `registry.stella-ops.org/stellaops/web-ui@sha256:<digest>` | Listed in `deploy/releases/<channel>.yaml` (`yq '.services[] | select(.name=="web-ui") | .image'`). Signed with Cosign (`cosign verify --key https://stella-ops.org/keys/cosign.pub …`). |
|
||||
| Compose bundles | `deploy/compose/docker-compose.{dev,stage,prod,airgap}.yaml` | Each profile already includes a `web-ui` service pinned to the release digest. Run `docker compose --env-file <env> -f docker-compose.<profile>.yaml config` to confirm the digest matches the manifest. |
|
||||
| Helm values | `deploy/helm/stellaops/values-*.yaml` (`services.web-ui`) | CI lints the chart; use `helm template` to confirm the rendered Deployment/Service carry the expected digest and env vars. |
|
||||
| Offline artefact (preview) | Generated via `oras copy registry.stella-ops.org/stellaops/web-ui@sha256:<digest> oci-archive:stellaops-web-ui-<channel>.tar` | Record SHA-256 in the downloads manifest (`DOWNLOADS-CONSOLE-23-001`) and sign with Cosign before shipping in the Offline Kit. |
|
||||
|
||||
> **Tip:** Keep Compose/Helm digests in sync with the release manifest to preserve determinism. `deploy/tools/validate-profiles.sh` performs a quick cross-check.
|
||||
|
||||
---
|
||||
|
||||
## 2 · Compose quickstart (connected host)
|
||||
|
||||
1. **Prepare workspace**
|
||||
|
||||
```bash
|
||||
mkdir stella-console && cd stella-console
|
||||
cp /path/to/repo/deploy/compose/env/dev.env.example .env
|
||||
```
|
||||
|
||||
2. **Add console configuration** – append the following to `.env` (adjust per environment):
|
||||
|
||||
```bash
|
||||
CONSOLE_PUBLIC_BASE_URL=https://console.dev.stella-ops.local
|
||||
CONSOLE_GATEWAY_BASE_URL=https://api.dev.stella-ops.local
|
||||
AUTHORITY_ISSUER=https://authority.dev.stella-ops.local
|
||||
AUTHORITY_CLIENT_ID=console-ui
|
||||
AUTHORITY_SCOPES="ui.read ui.admin findings:read advisory:read vex:read aoc:verify"
|
||||
AUTHORITY_DPOP_ENABLED=true
|
||||
```
|
||||
|
||||
Optional extras from [`docs/deploy/console.md`](../deploy/console.md):
|
||||
|
||||
```bash
|
||||
CONSOLE_FEATURE_FLAGS=runs,downloads,policies
|
||||
CONSOLE_METRICS_ENABLED=true
|
||||
CONSOLE_LOG_LEVEL=Information
|
||||
```
|
||||
|
||||
3. **Verify bundle provenance**
|
||||
|
||||
```bash
|
||||
cosign verify-blob \
|
||||
--key https://stella-ops.org/keys/cosign.pub \
|
||||
--signature /path/to/repo/deploy/compose/docker-compose.dev.yaml.sig \
|
||||
/path/to/repo/deploy/compose/docker-compose.dev.yaml
|
||||
```
|
||||
|
||||
4. **Launch infrastructure + console**
|
||||
|
||||
```bash
|
||||
docker compose --env-file .env -f /path/to/repo/deploy/compose/docker-compose.dev.yaml up -d postgres valkey rustfs
|
||||
docker compose --env-file .env -f /path/to/repo/deploy/compose/docker-compose.dev.yaml up -d web-ui
|
||||
```
|
||||
|
||||
The `web-ui` service exposes the console on port `8443` by default. Change the published port in the Compose file if you need to front it with an existing reverse proxy.
|
||||
|
||||
**Infrastructure notes:**
|
||||
- **Postgres**: Primary database (v16+)
|
||||
- **Valkey**: Redis-compatible cache for streams, queues, DPoP nonces
|
||||
- **RustFS**: S3-compatible object store for SBOMs and artifacts
|
||||
|
||||
5. **Health check**
|
||||
|
||||
```bash
|
||||
curl -k https://console.dev.stella-ops.local/health/ready
|
||||
```
|
||||
|
||||
Expect `{"status":"Ready"}`. If the response is `401`, confirm Authority credentials and scopes.
|
||||
|
||||
---
|
||||
|
||||
## 3 · Helm deployment (cluster)
|
||||
|
||||
1. **Create an overlay** (example `console-values.yaml`):
|
||||
|
||||
```yaml
|
||||
global:
|
||||
release:
|
||||
version: "2025.10.0-edge"
|
||||
services:
|
||||
web-ui:
|
||||
image: registry.stella-ops.org/stellaops/web-ui@sha256:38b225fa7767a5b94ebae4dae8696044126aac429415e93de514d5dd95748dcf
|
||||
service:
|
||||
port: 8443
|
||||
env:
|
||||
CONSOLE_PUBLIC_BASE_URL: "https://console.dev.stella-ops.local"
|
||||
CONSOLE_GATEWAY_BASE_URL: "https://api.dev.stella-ops.local"
|
||||
AUTHORITY_ISSUER: "https://authority.dev.stella-ops.local"
|
||||
AUTHORITY_CLIENT_ID: "console-ui"
|
||||
AUTHORITY_SCOPES: "ui.read ui.admin findings:read advisory:read vex:read aoc:verify"
|
||||
AUTHORITY_DPOP_ENABLED: "true"
|
||||
CONSOLE_FEATURE_FLAGS: "runs,downloads,policies"
|
||||
CONSOLE_METRICS_ENABLED: "true"
|
||||
```
|
||||
|
||||
2. **Render and validate**
|
||||
|
||||
```bash
|
||||
helm template stella-console ./deploy/helm/stellaops -f console-values.yaml | \
|
||||
grep -A2 'name: stellaops-web-ui' -A6 'image:'
|
||||
```
|
||||
|
||||
3. **Deploy**
|
||||
|
||||
```bash
|
||||
helm upgrade --install stella-console ./deploy/helm/stellaops \
|
||||
-f deploy/helm/stellaops/values-dev.yaml \
|
||||
-f console-values.yaml
|
||||
```
|
||||
|
||||
4. **Post-deploy checks**
|
||||
|
||||
```bash
|
||||
kubectl get pods -l app.kubernetes.io/name=stellaops-web-ui
|
||||
kubectl port-forward deploy/stellaops-web-ui 8443:8443
|
||||
curl -k https://localhost:8443/health/ready
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4 · Offline packaging
|
||||
|
||||
1. **Mirror the image to an OCI archive**
|
||||
|
||||
```bash
|
||||
DIGEST=$(yq '.services[] | select(.name=="web-ui") | .image' deploy/releases/2025.10-edge.yaml | cut -d@ -f2)
|
||||
oras copy registry.stella-ops.org/stellaops/web-ui@${DIGEST} \
|
||||
oci-archive:stellaops-web-ui-2025.10.0.tar
|
||||
shasum -a 256 stellaops-web-ui-2025.10.0.tar
|
||||
```
|
||||
|
||||
2. **Sign the archive**
|
||||
|
||||
```bash
|
||||
cosign sign-blob --key ~/keys/offline-kit.cosign \
|
||||
--output-signature stellaops-web-ui-2025.10.0.tar.sig \
|
||||
stellaops-web-ui-2025.10.0.tar
|
||||
```
|
||||
|
||||
3. **Load in the air-gap**
|
||||
|
||||
```bash
|
||||
docker load --input stellaops-web-ui-2025.10.0.tar
|
||||
docker tag stellaops/web-ui@${DIGEST} registry.airgap.local/stellaops/web-ui:2025.10.0
|
||||
```
|
||||
|
||||
4. **Update the Offline Kit manifest** (once the downloads pipeline lands):
|
||||
|
||||
```bash
|
||||
jq '.artifacts.console.webUi = {
|
||||
"digest": "sha256:'"${DIGEST#sha256:}"'",
|
||||
"archive": "stellaops-web-ui-2025.10.0.tar",
|
||||
"signature": "stellaops-web-ui-2025.10.0.tar.sig"
|
||||
}' downloads/manifest.json > downloads/manifest.json.tmp
|
||||
mv downloads/manifest.json.tmp downloads/manifest.json
|
||||
```
|
||||
|
||||
Re-run `stella offline kit import downloads/manifest.json` to validate signatures inside the air‑gapped environment.
|
||||
|
||||
---
|
||||
|
||||
## 5 · CLI parity
|
||||
|
||||
Console operations map directly to scriptable workflows:
|
||||
|
||||
| Action | CLI path |
|
||||
|--------|----------|
|
||||
| Fetch signed manifest entry | `stella downloads manifest show --artifact console/web-ui` *(CLI task `CONSOLE-DOC-23-502`, pending release)* |
|
||||
| Mirror digest to OCI archive | `stella downloads mirror --artifact console/web-ui --to oci-archive:stellaops-web-ui.tar` *(planned alongside CLI AOC parity)* |
|
||||
| Import offline kit | `stella offline kit import stellaops-web-ui-2025.10.0.tar` |
|
||||
| Validate console health | `stella console status --endpoint https://console.dev.stella-ops.local` *(planned; fallback to `curl` as shown above)* |
|
||||
|
||||
Track progress for the CLI commands via `DOCS-CONSOLE-23-014` (CLI vs UI parity matrix).
|
||||
|
||||
---
|
||||
|
||||
## 6 · Compliance checklist
|
||||
|
||||
- [ ] Image digest validated against the current release manifest.
|
||||
- [ ] Compose/Helm deployments verified with `docker compose config` / `helm template`.
|
||||
- [ ] Authority issuer, scopes, and DPoP settings documented and applied.
|
||||
- [ ] Offline archive mirrored, signed, and recorded in the downloads manifest.
|
||||
- [ ] CLI parity notes linked to the upcoming `docs/cli-vs-ui-parity.md` matrix.
|
||||
- [ ] References cross-checked with `docs/deploy/console.md` and `docs/security/console-security.md`.
|
||||
- [ ] Health checks documented for connected and air-gapped installs.
|
||||
|
||||
---
|
||||
|
||||
## 7 · References
|
||||
|
||||
- `deploy/releases/<channel>.yaml` – Release manifest (digests, SBOM metadata).
|
||||
- `deploy/compose/README.md` – Compose profile overview.
|
||||
- `deploy/helm/stellaops/values-*.yaml` – Helm defaults per environment.
|
||||
- `/docs/deploy/console.md` – Detailed environment variables, CSP, health checks.
|
||||
- `/docs/security/console-security.md` – Auth flows, scopes, DPoP, monitoring.
|
||||
- `docs/UI_GUIDE.md` – Console workflows and offline posture.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-10-28 (Sprint 23).*
|
||||
41
docs/operations/evidence-locker-handoff.md
Normal file
41
docs/operations/evidence-locker-handoff.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Evidence Locker Handoff (Signals & Zastava)
|
||||
|
||||
## Inputs required (from Ops)
|
||||
- `EVIDENCE_LOCKER_URL` (base URL, no trailing slash)
|
||||
- `CI_EVIDENCE_LOCKER_TOKEN` (Bearer token with write to `zastava/*` and `signals/*`)
|
||||
- **Signals production signing key** for final re-sign (one of):
|
||||
- `COSIGN_PRIVATE_KEY_B64` (base64 of private key) + optional `COSIGN_PASSWORD`, or
|
||||
- key file at `tools/cosign/cosign.key` + password.
|
||||
|
||||
## What’s ready (deterministic artefacts)
|
||||
- Zastava tar: `evidence-locker/zastava/2025-12-02/zastava-evidence.tar`
|
||||
- sha256: `e1d67424273828c48e9bf5b495a96c2ebcaf1ef2c308f60d8b9ac019cf0f1c9`
|
||||
- Signals tar (dev key): `evidence-locker/signals/2025-12-05/signals-evidence.tar`
|
||||
- sha256: `a17910b8e90aaf44d4546057db22cdc791105dd41feb14f0c9b7c8bac5392e0d`
|
||||
|
||||
## Publish both bundles (once URL/token are available)
|
||||
```bash
|
||||
export EVIDENCE_LOCKER_URL="<locker-base-url>"
|
||||
export CI_EVIDENCE_LOCKER_TOKEN="<token>"
|
||||
./tools/upload-all-evidence.sh
|
||||
```
|
||||
|
||||
## Verify locally (hash + inner SHA lists)
|
||||
- Zastava: `./tools/zastava-verify-evidence-tar.sh [path/to/zastava-evidence.tar]`
|
||||
- Signals: `./tools/signals-verify-evidence-tar.sh [path/to/signals-evidence.tar]`
|
||||
|
||||
## Re-sign Signals for production trust (optional but recommended)
|
||||
```bash
|
||||
export COSIGN_PRIVATE_KEY_B64="<prod-key-b64>"
|
||||
export COSIGN_PASSWORD="<pwd-if-any>"
|
||||
OUT_DIR=evidence-locker/signals/2025-12-05 \
|
||||
tools/cosign/sign-signals.sh
|
||||
|
||||
# Rebuild + upload tar
|
||||
./tools/signals-upload-evidence.sh
|
||||
```
|
||||
|
||||
## Notes
|
||||
- All packaging is deterministic (`tar --sort=name --mtime='UTC 1970-01-01' --owner=0 --group=0 --numeric-owner`).
|
||||
- Tlog upload is disabled for offline parity; Evidence Locker trust comes from the provided keys.
|
||||
- Upload scripts exit non-zero on hash mismatch to prevent pushing corrupted artefacts.
|
||||
314
docs/operations/handoff/epic-3500-handoff-checklist.md
Normal file
314
docs/operations/handoff/epic-3500-handoff-checklist.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# Epic 3500: Handoff Checklist
|
||||
|
||||
**Sprint:** SPRINT_3500_0004_0004
|
||||
**Status:** Complete
|
||||
**Date:** 2025-12-20
|
||||
|
||||
This checklist documents the handoff of Epic 3500 (Score Proofs & Reachability Analysis) to operations and support teams.
|
||||
|
||||
---
|
||||
|
||||
## 1. Feature Completeness
|
||||
|
||||
### Score Proofs
|
||||
- [x] Proof generation implemented and tested
|
||||
- [x] DSSE signing working with configured keys
|
||||
- [x] Merkle tree computation verified deterministic
|
||||
- [x] Proof verification CLI/API implemented
|
||||
- [x] Score replay functionality complete
|
||||
- [x] Offline verification supported
|
||||
|
||||
### Reachability Analysis
|
||||
- [x] Call graph generation for supported languages
|
||||
- [x] BFS reachability computation implemented
|
||||
- [x] Verdict assignment (REACHABLE/NOT_REACHABLE/UNKNOWN)
|
||||
- [x] Path explanation available
|
||||
- [x] Confidence scoring implemented
|
||||
- [x] Integration with scan pipeline complete
|
||||
|
||||
### Unknowns Management
|
||||
- [x] Unknown detection during scanning
|
||||
- [x] Queue management (PENDING/TRIAGING/RESOLVED states)
|
||||
- [x] Bulk operations supported
|
||||
- [x] Resolution tracking
|
||||
- [x] Statistics and metrics available
|
||||
|
||||
---
|
||||
|
||||
## 2. Testing Sign-off
|
||||
|
||||
### Unit Tests
|
||||
- [x] Score Proofs: 95%+ coverage
|
||||
- [x] Reachability: 92%+ coverage
|
||||
- [x] Unknowns: 90%+ coverage
|
||||
|
||||
### Integration Tests
|
||||
- [x] End-to-end scan with proof generation
|
||||
- [x] Reachability with call graph ingestion
|
||||
- [x] Unknowns queue workflow
|
||||
- [x] API contract tests passing
|
||||
|
||||
### Performance Tests
|
||||
- [x] Baseline established for proof generation
|
||||
- [x] Reachability benchmarks documented
|
||||
- [x] Large call graph handling verified
|
||||
- [x] Memory usage within limits
|
||||
|
||||
---
|
||||
|
||||
## 3. Documentation Delivered
|
||||
|
||||
### Operations Runbooks
|
||||
| Runbook | Location | Status |
|
||||
|---------|----------|--------|
|
||||
| Score Replay | `docs/operations/score-replay-runbook.md` | ✅ Complete |
|
||||
| Proof Verification | `docs/operations/proof-verification-runbook.md` | ✅ Complete |
|
||||
| Reachability | `docs/operations/reachability-runbook.md` | ✅ Complete |
|
||||
| Unknowns Queue | `docs/operations/unknowns-queue-runbook.md` | ✅ Complete |
|
||||
| Air-Gap Operations | `docs/operations/airgap-operations-runbook.md` | ✅ Complete |
|
||||
|
||||
### Training Materials
|
||||
| Material | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| Score Proofs Concept | `docs/onboarding/concepts/score-proofs-concept-guide.md` | ✅ Complete |
|
||||
| Reachability Concept | `docs/onboarding/concepts/reachability-concept-guide.md` | ✅ Complete |
|
||||
| Unknowns Guide | `docs/onboarding/concepts/unknowns-management-guide.md` | ✅ Complete |
|
||||
| FAQ | `docs/onboarding/faq/faq.md` | ✅ Complete |
|
||||
| Troubleshooting | `docs/onboarding/concepts/troubleshooting-guide.md` | ✅ Complete |
|
||||
| Video Scripts | `docs/onboarding/video-tutorial-scripts.md` | ✅ Complete |
|
||||
|
||||
### Reference Documentation
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| CLI Reference | `docs/modules/cli/guides/*.md` | ✅ Complete |
|
||||
| API Reference | `docs/api/score-proofs-reachability-api-reference.md` | ✅ Complete |
|
||||
| OpenAPI Spec | `src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml` | ✅ Complete |
|
||||
| Release Notes | `docs/releases/v2.5.0-release-notes.md` | ✅ Complete |
|
||||
|
||||
---
|
||||
|
||||
## 4. Knowledge Transfer Sessions
|
||||
|
||||
### Session 1: Feature Overview (Operations)
|
||||
- **Date:** [SCHEDULED]
|
||||
- **Attendees:** Operations Team
|
||||
- **Topics:**
|
||||
- [ ] Score Proofs architecture and flow
|
||||
- [ ] Reachability analysis concepts
|
||||
- [ ] Unknowns queue management
|
||||
- [ ] Monitoring and alerting
|
||||
|
||||
### Session 2: Troubleshooting Deep Dive (Support)
|
||||
- **Date:** [SCHEDULED]
|
||||
- **Attendees:** Support Team
|
||||
- **Topics:**
|
||||
- [ ] Common issues and resolutions
|
||||
- [ ] Diagnostic commands
|
||||
- [ ] Escalation paths
|
||||
- [ ] Customer communication templates
|
||||
|
||||
### Session 3: Technical Deep Dive (Engineering)
|
||||
- **Date:** [SCHEDULED]
|
||||
- **Attendees:** Engineering Team
|
||||
- **Topics:**
|
||||
- [ ] Implementation architecture
|
||||
- [ ] Extension points
|
||||
- [ ] Performance tuning
|
||||
- [ ] Known limitations and future work
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring & Alerting
|
||||
|
||||
### Dashboards Configured
|
||||
- [x] Score Proofs dashboard (Grafana)
|
||||
- [x] Reachability metrics dashboard
|
||||
- [x] Unknowns queue dashboard
|
||||
- [x] Performance metrics dashboard
|
||||
|
||||
### Alerts Defined
|
||||
|
||||
| Alert | Threshold | Severity | Runbook |
|
||||
|-------|-----------|----------|---------|
|
||||
| ProofGenerationFailure | > 1% failure rate | P2 | `score-replay-runbook.md#errors` |
|
||||
| ReachabilityTimeout | > 5% timeout rate | P3 | `reachability-runbook.md#timeouts` |
|
||||
| UnknownsQueueBacklog | > 100 pending | P3 | `unknowns-queue-runbook.md#backlog` |
|
||||
| CallGraphMemoryHigh | > 8GB | P3 | `reachability-runbook.md#memory` |
|
||||
|
||||
### Metrics Exposed
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `stellaops_proofs_generated_total` | Counter | Proofs generated |
|
||||
| `stellaops_proofs_verified_total` | Counter | Proofs verified |
|
||||
| `stellaops_reachability_duration_seconds` | Histogram | Reachability computation time |
|
||||
| `stellaops_unknowns_queue_depth` | Gauge | Pending unknowns |
|
||||
| `stellaops_callgraph_nodes_total` | Gauge | Call graph size |
|
||||
|
||||
---
|
||||
|
||||
## 6. Escalation Paths
|
||||
|
||||
### Level 1: Support Team
|
||||
- First response for customer issues
|
||||
- Use troubleshooting guide and runbooks
|
||||
- Escalate after 30 minutes if unresolved
|
||||
|
||||
### Level 2: Operations Team
|
||||
- Infrastructure and configuration issues
|
||||
- Performance and capacity issues
|
||||
- Escalate after 2 hours if unresolved
|
||||
|
||||
### Level 3: Engineering Team
|
||||
- Bug fixes and code issues
|
||||
- Architecture decisions
|
||||
- On-call rotation applies
|
||||
|
||||
### Contacts
|
||||
| Level | Primary | Backup |
|
||||
|-------|---------|--------|
|
||||
| L1 | support@stellaops.example | help@stellaops.example |
|
||||
| L2 | ops-oncall@stellaops.example | ops-backup@stellaops.example |
|
||||
| L3 | eng-oncall@stellaops.example | eng-backup@stellaops.example |
|
||||
|
||||
---
|
||||
|
||||
## 7. Configuration & Deployment
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `STELLAOPS_PROOF_ENABLED` | Enable proof generation | `false` |
|
||||
| `STELLAOPS_REACHABILITY_ENABLED` | Enable reachability | `false` |
|
||||
| `STELLAOPS_SIGNING_KEY_ID` | Signing key identifier | `default` |
|
||||
| `STELLAOPS_REACHABILITY_MAX_DEPTH` | BFS max depth | `50` |
|
||||
| `STELLAOPS_UNKNOWNS_AUTO_RESOLVE` | Auto-resolve internal | `false` |
|
||||
|
||||
### Helm Values
|
||||
|
||||
```yaml
|
||||
scanner:
|
||||
scoreProofs:
|
||||
enabled: true
|
||||
signingKeySecret: signing-key-secret
|
||||
reachability:
|
||||
enabled: true
|
||||
maxDepth: 50
|
||||
cacheEnabled: true
|
||||
unknowns:
|
||||
autoResolveInternal: false
|
||||
internalPatterns: []
|
||||
```
|
||||
|
||||
### Feature Flags
|
||||
|
||||
| Flag | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| `ff_score_proofs` | Score Proofs feature | `on` |
|
||||
| `ff_reachability` | Reachability feature | `on` |
|
||||
| `ff_unknowns_v2` | New unknowns UI | `off` |
|
||||
|
||||
---
|
||||
|
||||
## 8. Known Limitations
|
||||
|
||||
### Score Proofs
|
||||
1. HSM integration requires compatible hardware
|
||||
2. Post-quantum algorithms not yet available
|
||||
3. Rekor integration requires network connectivity
|
||||
|
||||
### Reachability
|
||||
1. C/C++ support is limited (best-effort)
|
||||
2. Reflection may cause under-reporting
|
||||
3. Large codebases (>1M nodes) may need depth limiting
|
||||
|
||||
### Unknowns
|
||||
1. Historical data not auto-migrated
|
||||
2. Pattern matching is case-sensitive
|
||||
3. Bulk operations limited to 1000 items
|
||||
|
||||
---
|
||||
|
||||
## 9. Future Roadmap
|
||||
|
||||
### v2.6.0 (Planned)
|
||||
- Post-quantum cryptography support
|
||||
- Enhanced dynamic dispatch handling
|
||||
- Reachability caching improvements
|
||||
- UI dashboard for unknowns
|
||||
|
||||
### v2.7.0 (Planned)
|
||||
- Runtime reachability integration
|
||||
- Proof archival service
|
||||
- Cross-tenant unknown sharing
|
||||
- Advanced call graph visualizations
|
||||
|
||||
---
|
||||
|
||||
## 10. Sign-off
|
||||
|
||||
### Development Team
|
||||
- [x] All code complete and merged
|
||||
- [x] Tests passing
|
||||
- [x] Documentation complete
|
||||
- **Signed:** Development Team Lead
|
||||
- **Date:** 2025-12-20
|
||||
|
||||
### Quality Assurance
|
||||
- [x] Test plans executed
|
||||
- [x] Acceptance criteria met
|
||||
- [x] No critical defects open
|
||||
- **Signed:** QA Lead
|
||||
- **Date:** [PENDING]
|
||||
|
||||
### Operations
|
||||
- [x] Runbooks reviewed
|
||||
- [x] Monitoring configured
|
||||
- [x] Escalation paths documented
|
||||
- **Signed:** Operations Lead
|
||||
- **Date:** [PENDING]
|
||||
|
||||
### Product Management
|
||||
- [x] Features match requirements
|
||||
- [x] Documentation approved
|
||||
- [x] Release notes approved
|
||||
- **Signed:** Product Manager
|
||||
- **Date:** [PENDING]
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Quick Start Commands
|
||||
|
||||
```bash
|
||||
# Score Proofs
|
||||
stella scan --sbom ./sbom.json --generate-proof --output ./results/
|
||||
stella proof verify ./results/proof.dsse
|
||||
stella score replay ./results/ --verify
|
||||
|
||||
# Reachability
|
||||
stella scan graph ./src --output ./callgraph.json
|
||||
stella scan --sbom ./sbom.json --call-graph ./callgraph.json --reachability
|
||||
|
||||
# Unknowns
|
||||
stella unknowns list --state pending
|
||||
stella unknowns resolve <id> --resolution internal_package
|
||||
stella unknowns stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix B: Support Resources
|
||||
|
||||
- **Documentation Portal:** [docs/](../)
|
||||
- **API Reference:** [docs/api/](../api/)
|
||||
- **Runbooks:** [docs/operations/](../operations/)
|
||||
- **Training:** [docs/onboarding/](../onboarding/)
|
||||
- **Issue Tracker:** [GitHub Issues]
|
||||
- **Security Issues:** security@stellaops.example.com
|
||||
|
||||
---
|
||||
|
||||
**Handoff Status: COMPLETE**
|
||||
|
||||
All deliverables for Epic 3500 have been completed and documented. Knowledge transfer sessions are scheduled. The feature is ready for production deployment.
|
||||
@@ -0,0 +1,302 @@
|
||||
# Score Proofs & Reachability Handoff Checklist
|
||||
|
||||
**Epic:** 3500 - Score Proofs and Deterministic Replay
|
||||
**Sprint:** 3500.0004.0004
|
||||
**Release Version:** 1.0.0
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This checklist documents the handoff of Score Proofs and Reachability features to operations, support, and stakeholder teams.
|
||||
|
||||
---
|
||||
|
||||
## 1. Documentation Deliverables
|
||||
|
||||
### API & Reference Documentation
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| API Reference | [docs/api/score-proofs-reachability-api-reference.md](../api/score-proofs-reachability-api-reference.md) | ✅ Complete |
|
||||
| Score Proofs CLI | [docs/modules/cli/guides/commands/score-proofs-cli-reference.md](../modules/cli/guides/commands/score-proofs-cli-reference.md) | ✅ Complete |
|
||||
| Reachability CLI | [docs/modules/cli/guides/commands/reachability-cli-reference.md](../modules/cli/guides/commands/reachability-cli-reference.md) | ✅ Complete |
|
||||
| Unknowns CLI | [docs/modules/cli/guides/commands/unknowns-cli-reference.md](../modules/cli/guides/commands/unknowns-cli-reference.md) | ✅ Complete |
|
||||
|
||||
### Operations Documentation
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| Score Proofs Runbook | [docs/operations/score-proofs-runbook.md](../operations/score-proofs-runbook.md) | ✅ Complete |
|
||||
| Reachability Runbook | [docs/operations/reachability-runbook.md](../operations/reachability-runbook.md) | ✅ Complete |
|
||||
| Unknowns Queue Runbook | [docs/operations/unknowns-queue-runbook.md](../operations/unknowns-queue-runbook.md) | ✅ Complete |
|
||||
| Air-Gap Runbook | [docs/airgap/score-proofs-reachability-airgap-runbook.md](../airgap/score-proofs-reachability-airgap-runbook.md) | ✅ Complete |
|
||||
|
||||
### Architecture Documentation
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| High-Level Architecture | [docs/ARCHITECTURE_OVERVIEW.md](../ARCHITECTURE_OVERVIEW.md) | ✅ Updated |
|
||||
| Section 4A: Score Proofs | Same as above | ✅ Complete |
|
||||
| Section 4B: Reachability | Same as above | ✅ Complete |
|
||||
| Section 4C: Unknowns Registry | Same as above | ✅ Complete |
|
||||
|
||||
### Training Materials
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| Score Proofs Concept Guide | [docs/onboarding/concepts/score-proofs-concept-guide.md](../onboarding/concepts/score-proofs-concept-guide.md) | ✅ Complete |
|
||||
| Reachability Concept Guide | [docs/onboarding/concepts/reachability-concept-guide.md](../onboarding/concepts/reachability-concept-guide.md) | ✅ Complete |
|
||||
| Unknowns Management Guide | [docs/onboarding/concepts/unknowns-management-guide.md](../onboarding/concepts/unknowns-management-guide.md) | ✅ Complete |
|
||||
| FAQ | [docs/onboarding/faq/faq.md](../onboarding/faq/faq.md) | ✅ Complete |
|
||||
| Troubleshooting Guide | [docs/onboarding/concepts/troubleshooting-guide.md](../onboarding/concepts/troubleshooting-guide.md) | ✅ Complete |
|
||||
|
||||
### Release Documentation
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| Release Notes | [docs/releases/release-notes-score-proofs-reachability.md](../releases/release-notes-score-proofs-reachability.md) | ✅ Complete |
|
||||
|
||||
### API Specification
|
||||
|
||||
| Document | Location | Status |
|
||||
|----------|----------|--------|
|
||||
| Scanner OpenAPI | [src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml](../../src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml) | ✅ Updated |
|
||||
| Unknowns API | Same as above | ✅ Added |
|
||||
|
||||
---
|
||||
|
||||
## 2. Knowledge Transfer Sessions
|
||||
|
||||
### Recommended Sessions
|
||||
|
||||
| Session | Audience | Duration | Content |
|
||||
|---------|----------|----------|---------|
|
||||
| Score Proofs Deep Dive | Engineering, Ops | 90 min | Architecture, replay, verification |
|
||||
| Reachability Analysis | Security Team | 60 min | Call graphs, BFS, confidence scoring |
|
||||
| Unknowns Triage | SOC Analysts | 45 min | 2-factor ranking, workflows |
|
||||
| Air-Gap Operations | Ops | 60 min | Offline kit, time anchors |
|
||||
| API Overview | Integration Team | 45 min | Endpoints, authentication, examples |
|
||||
|
||||
### Session Materials
|
||||
|
||||
For each session, use:
|
||||
1. Concept guide from `docs/onboarding/concepts/`
|
||||
2. CLI reference from `docs/modules/cli/guides/commands/`
|
||||
3. API reference from `docs/api/`
|
||||
4. Live demo environment
|
||||
|
||||
---
|
||||
|
||||
## 3. Support Team Enablement
|
||||
|
||||
### Escalation Paths
|
||||
|
||||
| Tier | Handles | Escalates To | SLA |
|
||||
|------|---------|--------------|-----|
|
||||
| L1 | Basic usage questions | L2 | 4 hours |
|
||||
| L2 | Configuration, troubleshooting | L3 | 8 hours |
|
||||
| L3 | Bugs, edge cases | Engineering | 24 hours |
|
||||
| Engineering | Code fixes | — | Per severity |
|
||||
|
||||
### Common Support Scenarios
|
||||
|
||||
| Scenario | Resolution Document |
|
||||
|----------|---------------------|
|
||||
| Replay produces different results | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-replay-produces-different-results) |
|
||||
| Signature verification failed | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#2-signature-verification-failed) |
|
||||
| Too many UNKNOWN findings | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-too-many-unknown-findings) |
|
||||
| Reachability computation timeout | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#3-computation-timeout) |
|
||||
| Unknowns not appearing | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-unknowns-not-appearing) |
|
||||
|
||||
### Support Tooling
|
||||
|
||||
```bash
|
||||
# Diagnostic collection
|
||||
stella diagnostic collect --output diagnostics.zip
|
||||
|
||||
# Include specific scan
|
||||
stella diagnostic collect --scan-id $SCAN_ID --output diagnostics.zip
|
||||
|
||||
# Check system status
|
||||
stella status
|
||||
|
||||
# Verify proof integrity
|
||||
stella proof verify --scan-id $SCAN_ID --verbose
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Monitoring & Alerting
|
||||
|
||||
### Key Metrics
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `scanner_proof_generation_duration_seconds` | Time to generate proofs | P99 > 10s |
|
||||
| `scanner_reachability_computation_duration_seconds` | Reachability compute time | P99 > 600s |
|
||||
| `scanner_unknowns_pending_count` | Pending unknowns | > 1000 |
|
||||
| `scanner_proof_verification_failures_total` | Failed verifications | > 0 |
|
||||
| `scanner_reachability_timeout_total` | Computation timeouts | > 5/hour |
|
||||
|
||||
### Dashboard Panels
|
||||
|
||||
Recommended Grafana panels:
|
||||
1. Proof generation rate and latency
|
||||
2. Reachability computation queue depth
|
||||
3. Unknowns by status (pie chart)
|
||||
4. Unknowns by category (bar chart)
|
||||
5. High-priority unknowns trend
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
```yaml
|
||||
# Example Prometheus rules
|
||||
groups:
|
||||
- name: score-proofs
|
||||
rules:
|
||||
- alert: ProofVerificationFailure
|
||||
expr: increase(scanner_proof_verification_failures_total[5m]) > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: Proof verification failures detected
|
||||
|
||||
- alert: ReachabilityComputationTimeout
|
||||
expr: increase(scanner_reachability_timeout_total[1h]) > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: High rate of reachability timeouts
|
||||
|
||||
- alert: HighPriorityUnknownsBacklog
|
||||
expr: scanner_unknowns_pending_count{priority="critical"} > 10
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: Critical unknowns backlog growing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Stakeholder Sign-Off
|
||||
|
||||
### Required Approvals
|
||||
|
||||
| Role | Name | Sign-Off | Date |
|
||||
|------|------|----------|------|
|
||||
| Product Owner | — | ☐ Pending | — |
|
||||
| Engineering Lead | — | ☐ Pending | — |
|
||||
| Security Lead | — | ☐ Pending | — |
|
||||
| Operations Lead | — | ☐ Pending | — |
|
||||
| Support Lead | — | ☐ Pending | — |
|
||||
|
||||
### Sign-Off Criteria
|
||||
|
||||
Each stakeholder confirms:
|
||||
- [ ] Documentation reviewed and approved
|
||||
- [ ] Training sessions completed or scheduled
|
||||
- [ ] Escalation paths understood
|
||||
- [ ] Monitoring dashboards configured
|
||||
- [ ] Alert rules deployed
|
||||
- [ ] Support playbooks available
|
||||
|
||||
---
|
||||
|
||||
## 6. Release Checklist
|
||||
|
||||
### Pre-Release
|
||||
|
||||
- [ ] All documentation complete and reviewed
|
||||
- [ ] OpenAPI specification updated
|
||||
- [ ] Database migrations tested
|
||||
- [ ] Performance benchmarks pass
|
||||
- [ ] Security review completed
|
||||
- [ ] Air-gap scenarios tested
|
||||
|
||||
### Release Day
|
||||
|
||||
- [ ] Announce release to internal teams
|
||||
- [ ] Monitor error rates for first 24 hours
|
||||
- [ ] Support team on standby
|
||||
- [ ] Known issues documented
|
||||
|
||||
### Post-Release
|
||||
|
||||
- [ ] Collect feedback from early users
|
||||
- [ ] Address any critical issues
|
||||
- [ ] Update documentation based on feedback
|
||||
- [ ] Close sprint and epic
|
||||
|
||||
---
|
||||
|
||||
## 7. Known Issues & Limitations
|
||||
|
||||
| Issue | Workaround | Planned Fix |
|
||||
|-------|------------|-------------|
|
||||
| Large SBOM export may timeout | Use streaming export or exclude inputs | Future optimization |
|
||||
| Reflection detection is heuristic | Add reflection hints | Improve in 1.1 |
|
||||
| Very large graphs may timeout | Partition analysis | Future optimization |
|
||||
|
||||
---
|
||||
|
||||
## 8. Contact Information
|
||||
|
||||
### Feature Owners
|
||||
|
||||
| Area | Owner | Contact |
|
||||
|------|-------|---------|
|
||||
| Score Proofs | Engineering Team | — |
|
||||
| Reachability | Engineering Team | — |
|
||||
| Unknowns | Engineering Team | — |
|
||||
|
||||
### Support Contacts
|
||||
|
||||
| Team | Channel |
|
||||
|------|---------|
|
||||
| L1/L2 Support | Internal ticket system |
|
||||
| Engineering | Engineering Slack |
|
||||
| Security | Security team email |
|
||||
|
||||
---
|
||||
|
||||
## 9. Appendix: Quick Reference
|
||||
|
||||
### CLI Commands Summary
|
||||
|
||||
```bash
|
||||
# Score Proofs
|
||||
stella score compute --scan-id $SCAN_ID
|
||||
stella score replay --scan-id $SCAN_ID
|
||||
stella proof verify --scan-id $SCAN_ID
|
||||
stella proof export --scan-id $SCAN_ID --output proof.zip
|
||||
|
||||
# Reachability
|
||||
stella reachability compute --scan-id $SCAN_ID
|
||||
stella reachability findings --scan-id $SCAN_ID
|
||||
stella reachability explain --scan-id $SCAN_ID --cve CVE-XXXX --purl pkg:type/name@ver
|
||||
|
||||
# Unknowns
|
||||
stella unknowns summary --workspace-id $WS_ID
|
||||
stella unknowns list --status pending --min-score 12
|
||||
stella unknowns escalate --id $ID --reason "Review needed"
|
||||
stella unknowns resolve --id $ID --resolution mapped
|
||||
```
|
||||
|
||||
### API Endpoints Summary
|
||||
|
||||
| Category | Key Endpoints |
|
||||
|----------|---------------|
|
||||
| Score | `POST /scans/{id}/score/compute`, `POST /scans/{id}/score/replay` |
|
||||
| Proofs | `GET /scans/{id}/proofs`, `POST /scans/{id}/proofs/verify` |
|
||||
| Reachability | `POST /scans/{id}/reachability/compute`, `GET /scans/{id}/reachability/explain` |
|
||||
| Unknowns | `GET /unknowns`, `POST /unknowns/{id}/escalate`, `POST /unknowns/{id}/resolve` |
|
||||
|
||||
---
|
||||
|
||||
**Handoff Prepared By:** Agent
|
||||
**Sprint:** 3500.0004.0004
|
||||
**Date:** 2025-12-20
|
||||
@@ -36,4 +36,4 @@ Last updated: 2025-11-25
|
||||
- [ ] Health green, queue depth normal.
|
||||
- [ ] Latest plugin bundle signatures valid.
|
||||
- [ ] No secrets in logs (spot-check redaction).
|
||||
- [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).
|
||||
- [ ] Error budget within SLO (see `docs/modules/telemetry/guides/metrics-and-slos.md`).
|
||||
|
||||
12
docs/operations/process/acceptance-guardrails-checklist.md
Normal file
12
docs/operations/process/acceptance-guardrails-checklist.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Acceptance Tests Pack & Guardrails Checklist (Stub)
|
||||
|
||||
Use with `SPRINT_0300_0001_0001_documentation_process.md` task 4 (AT1–AT10).
|
||||
|
||||
- [ ] AT schema version pinned; schema file signed (DSSE) and stored with pack.
|
||||
- [ ] Inputs locked (`inputs.lock`) with scanner/db versions and seeds.
|
||||
- [ ] Fixtures reproducible offline; no external network calls.
|
||||
- [ ] Admission/VEX/auth coverage present; replay parity check documented.
|
||||
- [ ] Gating thresholds defined and enforced in CI.
|
||||
- [ ] Reporting SLOs captured; failure triage path documented.
|
||||
- [ ] DSSE provenance for packs and results; signatures verified in CI (see `pack.dsse.json`).
|
||||
- [ ] README links added to sprint docs and AGENTS where relevant.
|
||||
12
docs/operations/process/implementor-guidelines.md
Normal file
12
docs/operations/process/implementor-guidelines.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Implementor Guidelines (Stub)
|
||||
|
||||
Use with sprint task 18 (IMPLEMENTOR-GAPS-300-018).
|
||||
|
||||
- Determinism/offline: pin toolchains, seeds, inputs.lock; no live network in examples.
|
||||
- Provenance: DSSE-sign schema and results; keep tenant scoping explicit.
|
||||
- Docs touch rule: enforce `docs:` tag (value or `docs: n/a`) in commits/PRs.
|
||||
- Boundary rules: respect module working directories and shared-lib allowlist.
|
||||
- Perf/quota: capture perf budgets and quota impacts when changing hot paths.
|
||||
- Versioning: schema changes require version bump and changelog note.
|
||||
- CI lint: `tools/lint/implementor-guidelines.sh` (stub) to be wired into CI; add to pre-commit or CI pipeline when wiring determinism checks.
|
||||
- Determinism checks: prefer UTC, sorted outputs, pinned seeds; add `inputs.lock` when adding new fixtures or packs.
|
||||
13
docs/operations/process/standup-kickstarter-checklist.md
Normal file
13
docs/operations/process/standup-kickstarter-checklist.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Standup Sprint Kickstarters Checklist (Stub)
|
||||
|
||||
Use with sprint task 22 (STANDUP-GAPS-300-019) and advisory `30-Nov-2025 - Standup Sprint Kickstarters.md`.
|
||||
|
||||
- [ ] Template aligned with `docs/implplan/README.md` sections.
|
||||
- [ ] Readiness evidence checklist filled (deps, owners, SLOs).
|
||||
- [ ] Dependency ledger captured with accountable owners.
|
||||
- [ ] Async/offline workflow defined; time-box/exit rules noted.
|
||||
- [ ] Execution Log update required at standup close.
|
||||
- [ ] Decisions & Risks delta captured per session.
|
||||
- [ ] Metrics collected: blocker clear rate, blocker latency.
|
||||
- [ ] Lint/checks hook points identified for automation.
|
||||
- [ ] DSSE-signed standup summary stored with UTC date.
|
||||
9
docs/operations/process/standup-summary.sample.md
Normal file
9
docs/operations/process/standup-summary.sample.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Standup Summary (DSSE-signed) — Sample
|
||||
|
||||
- Date (UTC): 2025-12-05
|
||||
- Sprint: SPRINT_0300_0001_0001_documentation_process.md
|
||||
- Decisions & Risks: no change
|
||||
- Blockers: none
|
||||
- Next steps: deliver SBOM-VEX kit, finalize fixtures
|
||||
|
||||
DSSE signature: <attach dsse envelope here>
|
||||
@@ -142,6 +142,6 @@ There are no drift-specific metrics emitted by the drift endpoints yet. Recommen
|
||||
|
||||
- `docs/modules/scanner/reachability-drift.md`
|
||||
- `docs/api/scanner-drift-api.md`
|
||||
- `docs/airgap/reachability-drift-airgap-workflows.md`
|
||||
- `docs/modules/airgap/guides/reachability-drift-airgap-workflows.md`
|
||||
- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/009_call_graph_tables.sql`
|
||||
- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/010_reachability_drift_tables.sql`
|
||||
|
||||
@@ -8,7 +8,7 @@ Last updated: 2025-12-17
|
||||
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
|
||||
|
||||
## Preconditions
|
||||
- Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`).
|
||||
- Router rate limiting configured under `rate_limiting` (see `docs/modules/router/guides/rate-limiting-config.md`).
|
||||
- If `for_environment` is enabled:
|
||||
- Valkey reachable from Router instances.
|
||||
- Circuit breaker parameters reviewed for the environment.
|
||||
|
||||
42
docs/operations/runbooks/assistant-ops.md
Normal file
42
docs/operations/runbooks/assistant-ops.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Assistant Ops Runbook (DOCS-AIAI-31-009)
|
||||
|
||||
_Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111_
|
||||
|
||||
This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.
|
||||
|
||||
## 1) Warmup & cache priming
|
||||
- Ensure Offline Kit fixtures are staged:
|
||||
- CLI guardrail bundles: `out/console/guardrails/cli-vuln-29-001/`, `out/console/guardrails/cli-vex-30-001/`.
|
||||
- SBOM context fixtures: copy into `data/advisory-ai/fixtures/sbom/` and record hashes in `SHA256SUMS`.
|
||||
- Profiles/prompts manifests: ensure `profiles.catalog.json` and `prompts.manifest` hashes match `AdvisoryAI:Provenance` settings.
|
||||
- Start services and prime caches using cache-only calls:
|
||||
- `stella advise run summary --advisory-key <id> --timeout 0 --json` (should return cached/empty context, exit 0).
|
||||
- `stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json` (verifies SBOM clamps without executing inference).
|
||||
|
||||
## 2) Guardrail & provenance verification
|
||||
- Run guardrail self-test: `dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail` (offline-safe).
|
||||
- Validate DSSE bundles:
|
||||
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest`
|
||||
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>`
|
||||
- Confirm `AdvisoryAI:Guardrails:BlockedPhrases` file matches the hash captured during pack build; diff against `prompts.manifest`.
|
||||
|
||||
## 3) Scaling & queue health
|
||||
- Defaults: queue capacity 1024, dequeue wait 1s (see `docs/modules/policy/guides/assistant-parameters.md`). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
|
||||
- Metrics to watch: `advisory_ai_queue_depth`, `advisory_ai_latency_seconds`, `advisory_ai_guardrail_blocks_total`.
|
||||
- If queue depth > 75% for 5 minutes, add one worker pod or increase `Queue:Capacity` by 25% (record change in ops log).
|
||||
|
||||
## 4) Outage handling
|
||||
- **SBOM service down**: switch to `NullSbomContextClient` by unsetting `ADVISORYAI__SBOM__BASEADDRESS`; Advisory AI returns deterministic responses with `sbomSummary` counts at 0.
|
||||
- **Policy Engine unavailable**: pin last-known `policyVersion`; set `AdvisoryAI:Guardrails:RequireCitations=true` to avoid drift; raise `advisory.remediation.policyHold` in responses.
|
||||
- **Remote profile disabled**: keep `profile=cloud-openai` blocked; return `advisory.inference.remoteDisabled` with exit code 12 in CLI (see `docs/modules/advisory-ai/guides/cli.md`).
|
||||
|
||||
## 5) Air-gap / offline posture
|
||||
- All external calls are disabled by default. To re-enable remote inference, set `ADVISORYAI__INFERENCE__MODE=Remote` and provide an allowlisted `Remote.BaseAddress`; record the consent in Authority and in the ops log.
|
||||
- Mirror the guardrail artefact folders and `hashes.sha256` into the Offline Kit; re-run the guardrail self-test after mirroring.
|
||||
|
||||
## 6) Checklist before declaring healthy
|
||||
- [ ] Guardrail self-test suite green.
|
||||
- [ ] Cache-only CLI probes return 0 with correct `context.planCacheKey`.
|
||||
- [ ] DSSE verifications logged for prompts, profiles, policy bundle.
|
||||
- [ ] Metrics scrape shows queue depth < 75% and latency within SLO.
|
||||
- [ ] Ops log updated with any config overrides (queue size, clamps, remote inference toggles).
|
||||
44
docs/operations/runbooks/concelier-airgap-bundle-deploy.md
Normal file
44
docs/operations/runbooks/concelier-airgap-bundle-deploy.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Concelier Air-Gap Bundle Deploy Runbook (CONCELIER-AIRGAP-56-003)
|
||||
|
||||
Status: draft · 2025-11-24
|
||||
Scope: deploy sealed-mode Concelier evidence bundles using deterministic NDJSON + manifest/entry-trace outputs.
|
||||
|
||||
## Inputs
|
||||
- Bundle: `concelier-airgap.ndjson`
|
||||
- Manifest: `bundle.manifest.json`
|
||||
- Entry trace: `bundle.entry-trace.json`
|
||||
- Hashes: SHA256 recorded in manifest and entry-trace; verify before import.
|
||||
|
||||
## Preconditions
|
||||
- Concelier WebService running with `concelier:features:airgap` enabled.
|
||||
- No external egress; only local file system allowed for bundle path.
|
||||
- PostgreSQL indexes applied (`advisory_observations`, `advisory_linksets` tables).
|
||||
|
||||
## Steps
|
||||
1) Transfer bundle directory to offline controller host.
|
||||
2) Verify hashes:
|
||||
```bash
|
||||
sha256sum concelier-airgap.ndjson | diff - <(jq -r .bundleSha256 bundle.manifest.json)
|
||||
jq -r '.[].sha256' bundle.entry-trace.json | nl | sed 's/\t/:/' > entry.hashes
|
||||
paste -d' ' <(cut -d: -f1 entry.hashes) <(cut -d: -f2 entry.hashes)
|
||||
```
|
||||
3) Import:
|
||||
```bash
|
||||
curl -sSf -X POST \
|
||||
-H 'Content-Type: application/x-ndjson' \
|
||||
--data-binary @concelier-airgap.ndjson \
|
||||
http://localhost:5000/internal/airgap/import
|
||||
```
|
||||
4) Validate import:
|
||||
```bash
|
||||
curl -sSf http://localhost:5000/internal/airgap/status | jq
|
||||
```
|
||||
5) Record evidence:
|
||||
- Store manifest + entry-trace alongside TRX/logs in `artifacts/airgap/<date>/`.
|
||||
|
||||
## Determinism notes
|
||||
- NDJSON ordering is lexicographic; do not re-sort downstream.
|
||||
- Entry-trace hashes must match post-transfer; any mismatch aborts import.
|
||||
|
||||
## Rollback
|
||||
- Delete imported batch by `bundleId` from `advisory_observations` and `advisory_linksets` (requires DBA approval); rerun import after fixing hash.
|
||||
17
docs/operations/runbooks/incidents.md
Normal file
17
docs/operations/runbooks/incidents.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Incident Mode Runbook (outline)
|
||||
|
||||
- Activation, escalation, retention, verification checklist TBD from Ops Guild.
|
||||
|
||||
## Pending Inputs
|
||||
- See sprint SPRINT_0309_0001_0009_docs_tasks_md_ix action tracker; inputs due 2025-12-09..12 from owning guilds.
|
||||
|
||||
## Determinism Checklist
|
||||
- [ ] Hash any inbound assets/payloads; place sums alongside artifacts (e.g., SHA256SUMS in this folder).
|
||||
- [ ] Keep examples offline-friendly and deterministic (fixed seeds, pinned versions, stable ordering).
|
||||
- [ ] Note source/approver for any provided captures or schemas.
|
||||
|
||||
## Sections to fill (once inputs arrive)
|
||||
- Activation criteria and toggle steps.
|
||||
- Escalation paths and roles.
|
||||
- Retention/cleanup impacts.
|
||||
- Verification checklist and imposed-rule banner text.
|
||||
50
docs/operations/runbooks/policy-incident.md
Normal file
50
docs/operations/runbooks/policy-incident.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Policy Publish / Incident Runbook (draft)
|
||||
|
||||
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
|
||||
|
||||
## Scope
|
||||
- Policy Registry publish/promote workflows (canary → full rollout).
|
||||
- Emergency freeze for publish endpoints.
|
||||
- Evidence capture for audits and postmortems.
|
||||
|
||||
## Pre-flight checks (dev vs. prod)
|
||||
1) Validate manifests
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
|
||||
2) Render deployment plan (no apply yet)
|
||||
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
|
||||
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
|
||||
3) Backups
|
||||
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
|
||||
|
||||
## Canary publish → promote
|
||||
1) Prepare override (temporary)
|
||||
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
|
||||
- Keep `mock.enabled=false`; only use real digests when available.
|
||||
2) Dry-run render
|
||||
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
|
||||
3) Apply canary
|
||||
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
|
||||
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
|
||||
4) Promote
|
||||
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
|
||||
- Update the release manifest with final policy digests and rerun `release-manifest-verify`.
|
||||
|
||||
## Emergency freeze
|
||||
- Hard stop publishes while keeping read access
|
||||
- `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
|
||||
- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
|
||||
- Manifest gate
|
||||
- Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
|
||||
|
||||
## Evidence capture
|
||||
- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
|
||||
- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
|
||||
- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
|
||||
- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
|
||||
|
||||
## Open items (blockers)
|
||||
- Replace mock digests with production pins in `deploy/releases/*` once provided.
|
||||
- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
|
||||
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.
|
||||
63
docs/operations/runbooks/reachability-runtime.md
Normal file
63
docs/operations/runbooks/reachability-runtime.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Reachability Runtime Ingestion Runbook
|
||||
|
||||
> **Imposed rule:** Runtime traces must never bypass CAS/DSSE verification; ingest only CAS-addressed NDJSON with hashes logged to Timeline and Evidence Locker.
|
||||
|
||||
This runbook guides operators through ingesting runtime reachability evidence (EntryTrace, probes, Signals ingestion) and wiring it into the reachability evidence chain.
|
||||
|
||||
## 1. Prerequisites
|
||||
- Services: `Signals` API, `Zastava Observer` (or other probes), `Evidence Locker`, optional `Attestor` for DSSE.
|
||||
- Reachability schema: `docs/modules/reach-graph/guides/function-level-evidence.md`, `docs/modules/reach-graph/guides/evidence-schema.md`.
|
||||
- CAS: configured bucket/path for `cas://reachability/runtime/*` and `.../graphs/*`.
|
||||
- Time sync: AirGap Time anchor if sealed; otherwise NTP with drift <200ms.
|
||||
|
||||
## 2. Ingestion workflow (online)
|
||||
1) **Capture traces** from Observer/probes → NDJSON (`runtime-trace.ndjson.gz`) with `symbol_id`, `purl`, `timestamp`, `pid`, `container`, `count`.
|
||||
2) **Stage to CAS**: upload file, record `sha256`, store at `cas://reachability/runtime/<sha256>`.
|
||||
3) **Optionally sign**: wrap CAS digest in DSSE (`stella attest runtime --bundle runtime.dsse.json`).
|
||||
4) **Ingest** via Signals API:
|
||||
```sh
|
||||
curl -H "X-Stella-Tenant: acme" \
|
||||
-H "Content-Type: application/x-ndjson" \
|
||||
--data-binary @runtime-trace.ndjson.gz \
|
||||
"https://signals.example/api/v1/runtime-facts?graph_hash=<graph>"
|
||||
```
|
||||
Headers returned: `Content-SHA256`, `X-Graph-Hash`, `X-Ingest-Id`.
|
||||
5) **Emit timeline**: ensure Timeline event `reach.runtime.ingested` with CAS digest and ingest id.
|
||||
6) **Verify**: run `stella graph verify --runtime runtime-trace.ndjson.gz --graph <graph_hash>` to confirm edges mapped.
|
||||
|
||||
## 3. Ingestion workflow (air-gap)
|
||||
1) Receive runtime bundle containing `runtime-trace.ndjson.gz`, `manifest.json` (hashes), optional DSSE.
|
||||
2) Validate hashes against manifest; if present, verify DSSE bundle.
|
||||
3) Import into CAS path `cas://reachability/runtime/<sha256>` using offline loader.
|
||||
4) Run Signals offline ingest tool:
|
||||
```sh
|
||||
signals-offline ingest-runtime \
|
||||
--tenant acme \
|
||||
--graph-hash <graph_hash> \
|
||||
--runtime runtime-trace.ndjson.gz \
|
||||
--manifest manifest.json
|
||||
```
|
||||
5) Export ingest receipt and add to Evidence Locker; update Timeline when reconnected.
|
||||
|
||||
## 4. Checks & alerts
|
||||
- **Drift**: block ingest if time anchor age > configured budget; surface `staleness_seconds`.
|
||||
- **Hash mismatch**: fail ingest; write `runtime.ingest.failed` event with reason.
|
||||
- **Orphan traces**: if no matching `graph_hash`, queue for retry and alert `reachability.orphan_traces` counter.
|
||||
|
||||
## 5. Troubleshooting
|
||||
- **400 Bad Request**: validate NDJSON schema; run `scripts/reachability/validate_runtime_trace.py`.
|
||||
- **Hash mismatch**: recompute `sha256sum runtime-trace.ndjson.gz`; compare to manifest.
|
||||
- **Missing symbols**: ensure symbol manifest ingested (see `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`); rerun `stella graph verify`.
|
||||
- **High drift**: refresh time anchor (AirGap Time service) or resync NTP; retry ingest.
|
||||
|
||||
## 6. Artefact checklist
|
||||
- `runtime-trace.ndjson.gz` (or `.json`), `sha256` recorded.
|
||||
- Optional `runtime.dsse.json` DSSE bundle.
|
||||
- Ingest receipt (ingest id, graph hash, CAS digest, tenant).
|
||||
- Timeline event `reach.runtime.ingested` and Evidence Locker record (bundle + receipt).
|
||||
|
||||
## 7. References
|
||||
- `docs/modules/reach-graph/guides/DELIVERY_GUIDE.md`
|
||||
- `docs/modules/reach-graph/guides/function-level-evidence.md`
|
||||
- `docs/modules/reach-graph/guides/evidence-schema.md`
|
||||
- `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`
|
||||
96
docs/operations/runbooks/replay_ops.md
Normal file
96
docs/operations/runbooks/replay_ops.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# Runbook - Replay Operations
|
||||
|
||||
> **Audience:** Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor
|
||||
> **Prereqs:** `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`, `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`, `docs/modules/replay/guides/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1 Terminology
|
||||
|
||||
- **Replay Manifest** - `manifest.json` describing scan inputs, outputs, signatures.
|
||||
- **Input Bundle** - `inputbundle.tar.zst` containing feeds, policies, tools, env.
|
||||
- **Output Bundle** - `outputbundle.tar.zst` with SBOM, findings, VEX, logs.
|
||||
- **DSSE Envelope** - Signed metadata produced by Authority/Signer.
|
||||
- **RootPack** - Trusted key bundle used to validate DSSE signatures offline.
|
||||
|
||||
---
|
||||
|
||||
## 2 Normal operations
|
||||
|
||||
1. **Ingestion**
|
||||
- Scanner WebService writes manifest metadata to `replay_runs`.
|
||||
- Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
|
||||
- Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
|
||||
2. **Verification**
|
||||
- Nightly job runs `stella verify` on the latest N replay manifests per tenant.
|
||||
- Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
|
||||
- Failures alert `#ops-replay` via PagerDuty with runbook link.
|
||||
3. **Retention**
|
||||
- Hot CAS retention: 180 days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
|
||||
- Cold storage (Evidence Locker): 2 years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
|
||||
- Retention declaration: validate against `docs/schemas/replay-retention.schema.json` (frozen 2025-12-10). Include `retention_policy_id`, `tenant_id`, `bundle_type`, `retention_days`, `legal_hold`, `purge_after`, `checksum`, `created_at`. Audit checksum via DSSE envelope when persisting.
|
||||
4. **Access control**
|
||||
- Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
|
||||
|
||||
---
|
||||
|
||||
## 3 Incident response (Replay Integrity)
|
||||
|
||||
| Step | Action | Owner | Notes |
|
||||
|------|--------|-------|-------|
|
||||
| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
|
||||
| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
|
||||
| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
|
||||
| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
|
||||
| 5 | If tool hash drift -> coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
|
||||
| 6 | Update incident timeline (`docs/operations/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
|
||||
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
|
||||
|
||||
---
|
||||
|
||||
## 4 Air-gapped workflow
|
||||
|
||||
1. Receive Offline Kit bundle containing:
|
||||
- `offline/replay/<scan-id>/manifest.json`
|
||||
- Bundles + DSSE signatures
|
||||
- RootPack snapshot
|
||||
2. Run `stella replay manifest.json --strict --offline` using local CLI.
|
||||
3. Load feed/policy snapshots from kit; never hit external networks.
|
||||
4. Store verification logs under `ops/offline/replay/<scan-id>/`.
|
||||
5. Sync results back to Evidence Locker once connectivity restored.
|
||||
|
||||
---
|
||||
|
||||
## 5 Maintenance checklist
|
||||
|
||||
- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
|
||||
- [ ] CAS retention job executed successfully in the past 24 hours.
|
||||
- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
|
||||
- [ ] Runbook incident log updated (see section 6) for the last drill.
|
||||
- [ ] Offline kit instructions verified against current CLI version.
|
||||
|
||||
---
|
||||
|
||||
## 6 Incident log
|
||||
|
||||
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|
||||
|------------|-------------|--------|---------|-----------|
|
||||
| _TBD_ | | | | |
|
||||
|
||||
---
|
||||
|
||||
## 7 References
|
||||
|
||||
- `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`
|
||||
- `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`
|
||||
- `docs/modules/replay/guides/TEST_STRATEGY.md`
|
||||
- `docs/modules/platform/architecture-overview.md` section 5
|
||||
- `docs/modules/evidence-locker/architecture.md`
|
||||
- `docs/modules/telemetry/architecture.md`
|
||||
- `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`
|
||||
|
||||
---
|
||||
|
||||
*Created: 2025-11-03 - Update alongside replay task status changes.*
|
||||
35
docs/operations/runbooks/vex-ops.md
Normal file
35
docs/operations/runbooks/vex-ops.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# VEX Ops Runbook (dev-mock ready)
|
||||
|
||||
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts wait on policy/VEX final digests.
|
||||
|
||||
## Pre-flight (dev vs. prod)
|
||||
1) Release manifest guard
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
|
||||
3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
|
||||
- Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.
|
||||
|
||||
## Rollback
|
||||
- Helm: `helm rollback stellaops 1` (choose previous revision). Mock overlay uses `stellaops.dev/mock: "true"` annotations; safe to tear down after tests.
|
||||
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
|
||||
|
||||
## Troubleshooting
|
||||
- Recompute storms: throttle via `VEX_LENS__MAX_PARALLELISM` env (set in values once schema lands); for now scale deployment down to 1 replica to reduce concurrency.
|
||||
- Mapping failures: capture request/response with `kubectl logs ... --since=10m`; rerun after clearing queue.
|
||||
- Signature errors: confirm Authority token audience/issuer; mock overlay uses the same auth settings as dev compose.
|
||||
|
||||
## Evidence capture
|
||||
- Save `/tmp/vex-mock.yaml` and `/tmp/vex-compose.yaml` with the manifest used.
|
||||
- `kubectl get deploy/pod,svc -n stellaops -l app=vex-lens -o yaml > /tmp/vex-live.yaml`.
|
||||
- Tarball: `tar -czf vex-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/vex-*`.
|
||||
|
||||
## Open TODOs
|
||||
- Replace mock digests with production pins and add env/schema knobs for VEX Lens once published.
|
||||
- Add Grafana panels for recompute throughput and mapping failure rate after metrics are exposed.
|
||||
40
docs/operations/runbooks/vuln-ops.md
Normal file
40
docs/operations/runbooks/vuln-ops.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Vuln / Findings Ops Runbook (dev-mock ready)
|
||||
|
||||
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps need final digests and schema from DEPLOY-VULN-29-001.
|
||||
|
||||
## Scope
|
||||
- Findings Ledger + projector + Vuln Explorer API deployment/rollback, plus common incident drills (lag, storms, export failures).
|
||||
|
||||
## Pre-flight (dev vs. prod)
|
||||
1) Release manifest guard
|
||||
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
|
||||
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
|
||||
2) Render plan
|
||||
- Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
|
||||
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
|
||||
3) Backups (prod only)
|
||||
- PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.
|
||||
|
||||
## Deploy (mock path)
|
||||
- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
|
||||
- Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).
|
||||
|
||||
## Incident drills
|
||||
- Projector lag: scale projector worker up (`kubectl scale deploy/findings-ledger -n stellaops --replicas=2`) then back down; monitor queue length (metric hook pending).
|
||||
- Resolver storms: temporarily set `ASPNETCORE_THREADPOOL_MINTHREADS` higher or scale API horizontally; in compose, use `docker compose restart vuln-explorer-api` after bumping `VULNEXPLORER__MAX_CONCURRENCY` env once schema lands.
|
||||
- Export failures: re-run export job after verifying hashes in `deploy/releases/*`; mock path skips signing but still exercises checksum validation via `ops/devops/release/check_release_manifest.py`.
|
||||
|
||||
## Rollback
|
||||
- Helm: `helm rollback stellaops 1` to previous revision.
|
||||
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
|
||||
|
||||
## Evidence capture
|
||||
- Keep `/tmp/vuln-mock.yaml`, `/tmp/vuln-compose.yaml`, and the release manifest used.
|
||||
- `kubectl logs deployment/findings-ledger -n stellaops --since=30m > /tmp/ledger-logs.txt`
|
||||
- DB snapshot checksums if taken; bundle into `vuln-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz`.
|
||||
|
||||
## Open TODOs
|
||||
- Replace mock digests with production pins; add concrete env knobs for projector and API when schemas publish.
|
||||
- Hook Prometheus counters for projector lag and resolver storm dashboards once metrics are exported.
|
||||
|
||||
_Last updated: 2025-12-06 (UTC)_
|
||||
Reference in New Issue
Block a user