# Sovereign/Offline AI Inference with Signed Model Bundles ## Module AdvisoryAI ## Status IMPLEMENTED ## Description Local LLM inference for air-gapped environments via a pluggable provider architecture supporting llama.cpp server, Ollama, OpenAI, Claude, and Gemini. DSSE-signed model bundle management with regional crypto support (eIDAS/FIPS/GOST/SM), digest verification at load time, deterministic output config (temperature=0, fixed seed), inference caching, benchmarking harness, and offline replay verification. ## Implementation Details - **Modules**: `src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/` - **Key Classes**: - `SignedModelBundleManager` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/SignedModelBundleManager.cs`) - manages DSSE-signed model bundles with digest verification at load time - `ModelBundle` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ModelBundle.cs`) - model bundle metadata including hash, signature, and regional crypto info - `LlamaCppRuntime` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlamaCppRuntime.cs`) - llama.cpp local inference runtime - `OnnxRuntime` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/OnnxRuntime.cs`) - ONNX runtime for local model inference - `AdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/AdvisoryInferenceClient.cs`) - main inference client with provider routing - `ProviderBasedAdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs`) - provider-based inference with caching - `LlmBenchmark` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmBenchmark.cs`) - benchmarking harness for inference performance - `LocalInferenceOptions` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LocalInferenceOptions.cs`) - configuration for local inference (temperature, seed, context size) - `LocalLlmConfig` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LocalLlmConfig.cs`) - local LLM configuration (model path, quantization, GPU layers) - `LocalChatInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Chat/Services/LocalChatInferenceClient.cs`) - chat-specific local inference client - **Interfaces**: `ILocalLlmRuntime` - **Source**: SPRINT_20251226_019_AI_offline_inference.md ## E2E Test Plan - [ ] Load a signed model bundle via `SignedModelBundleManager` and verify DSSE signature and digest are validated - [ ] Verify `SignedModelBundleManager` rejects a model bundle with a tampered digest - [ ] Run inference through `LlamaCppRuntime` with temperature=0 and fixed seed and verify deterministic output - [ ] Run `LlmBenchmark` and verify it measures tokens/second and latency metrics - [ ] Verify `OnnxRuntime` loads and runs inference with an ONNX model - [ ] Configure `LocalInferenceOptions` with air-gap settings and verify no external network calls are made - [ ] Verify `ProviderBasedAdvisoryInferenceClient` caches deterministic responses and returns cached results on repeat queries