stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot 6bee1fdcf5

AOC Guard CI / aoc-guard (push) Has been cancelled

Details

AOC Guard CI / aoc-verify (push) Has been cancelled

Details

Concelier Attestation Tests / attestation-tests (push) Has been cancelled

Details

Docs CI / lint-and-preview (push) Has been cancelled

Details

work

2025-11-25 08:01:23 +02:00

3.4 KiB

Raw Blame History

Assistant Ops Runbook (DOCS-AIAI-31-009)

Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111

This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.

1) Warmup & cache priming

Ensure Offline Kit fixtures are staged:
- CLI guardrail bundles: out/console/guardrails/cli-vuln-29-001/, out/console/guardrails/cli-vex-30-001/.
- SBOM context fixtures: copy into data/advisory-ai/fixtures/sbom/ and record hashes in SHA256SUMS.
- Profiles/prompts manifests: ensure profiles.catalog.json and prompts.manifest hashes match AdvisoryAI:Provenance settings.
Start services and prime caches using cache-only calls:
- stella advise run summary --advisory-key <id> --timeout 0 --json (should return cached/empty context, exit 0).
- stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json (verifies SBOM clamps without executing inference).

2) Guardrail & provenance verification

Run guardrail self-test: dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail (offline-safe).
Validate DSSE bundles:
- slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest
- slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>
Confirm AdvisoryAI:Guardrails:BlockedPhrases file matches the hash captured during pack build; diff against prompts.manifest.

3) Scaling & queue health

Defaults: queue capacity 1024, dequeue wait 1s (see docs/policy/assistant-parameters.md). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
Metrics to watch: advisory_ai_queue_depth, advisory_ai_latency_seconds, advisory_ai_guardrail_blocks_total.
If queue depth > 75% for 5 minutes, add one worker pod or increase Queue:Capacity by 25% (record change in ops log).

4) Outage handling

SBOM service down: switch to NullSbomContextClient by unsetting ADVISORYAI__SBOM__BASEADDRESS; Advisory AI returns deterministic responses with sbomSummary counts at 0.
Policy Engine unavailable: pin last-known policyVersion; set AdvisoryAI:Guardrails:RequireCitations=true to avoid drift; raise advisory.remediation.policyHold in responses.
Remote profile disabled: keep profile=cloud-openai blocked; return advisory.inference.remoteDisabled with exit code 12 in CLI (see docs/advisory-ai/cli.md).

5) Air-gap / offline posture

All external calls are disabled by default. To re-enable remote inference, set ADVISORYAI__INFERENCE__MODE=Remote and provide an allowlisted Remote.BaseAddress; record the consent in Authority and in the ops log.
Mirror the guardrail artefact folders and hashes.sha256 into the Offline Kit; re-run the guardrail self-test after mirroring.

6) Checklist before declaring healthy

Guardrail self-test suite green.
Cache-only CLI probes return 0 with correct context.planCacheKey.
DSSE verifications logged for prompts, profiles, policy bundle.
Metrics scrape shows queue depth < 75% and latency within SLO.
Ops log updated with any config overrides (queue size, clamps, remote inference toggles).

3.4 KiB Raw Blame History

Assistant Ops Runbook (DOCS-AIAI-31-009)

1) Warmup & cache priming

2) Guardrail & provenance verification

3) Scaling & queue health

4) Outage handling

5) Air-gap / offline posture

6) Checklist before declaring healthy

3.4 KiB

Raw Blame History