3.4 KiB
3.4 KiB
Assistant Ops Runbook (DOCS-AIAI-31-009)
Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111
This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.
1) Warmup & cache priming
- Ensure Offline Kit fixtures are staged:
- CLI guardrail bundles:
out/console/guardrails/cli-vuln-29-001/,out/console/guardrails/cli-vex-30-001/. - SBOM context fixtures: copy into
data/advisory-ai/fixtures/sbom/and record hashes inSHA256SUMS. - Profiles/prompts manifests: ensure
profiles.catalog.jsonandprompts.manifesthashes matchAdvisoryAI:Provenancesettings.
- CLI guardrail bundles:
- Start services and prime caches using cache-only calls:
stella advise run summary --advisory-key <id> --timeout 0 --json(should return cached/empty context, exit 0).stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json(verifies SBOM clamps without executing inference).
2) Guardrail & provenance verification
- Run guardrail self-test:
dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail(offline-safe). - Validate DSSE bundles:
slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifestslsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>
- Confirm
AdvisoryAI:Guardrails:BlockedPhrasesfile matches the hash captured during pack build; diff againstprompts.manifest.
3) Scaling & queue health
- Defaults: queue capacity 1024, dequeue wait 1s (see
docs/policy/assistant-parameters.md). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism. - Metrics to watch:
advisory_ai_queue_depth,advisory_ai_latency_seconds,advisory_ai_guardrail_blocks_total. - If queue depth > 75% for 5 minutes, add one worker pod or increase
Queue:Capacityby 25% (record change in ops log).
4) Outage handling
- SBOM service down: switch to
NullSbomContextClientby unsettingADVISORYAI__SBOM__BASEADDRESS; Advisory AI returns deterministic responses withsbomSummarycounts at 0. - Policy Engine unavailable: pin last-known
policyVersion; setAdvisoryAI:Guardrails:RequireCitations=trueto avoid drift; raiseadvisory.remediation.policyHoldin responses. - Remote profile disabled: keep
profile=cloud-openaiblocked; returnadvisory.inference.remoteDisabledwith exit code 12 in CLI (seedocs/advisory-ai/cli.md).
5) Air-gap / offline posture
- All external calls are disabled by default. To re-enable remote inference, set
ADVISORYAI__INFERENCE__MODE=Remoteand provide an allowlistedRemote.BaseAddress; record the consent in Authority and in the ops log. - Mirror the guardrail artefact folders and
hashes.sha256into the Offline Kit; re-run the guardrail self-test after mirroring.
6) Checklist before declaring healthy
- Guardrail self-test suite green.
- Cache-only CLI probes return 0 with correct
context.planCacheKey. - DSSE verifications logged for prompts, profiles, policy bundle.
- Metrics scrape shows queue depth < 75% and latency within SLO.
- Ops log updated with any config overrides (queue size, clamps, remote inference toggles).