# Offline AI Model Bundles > **Sprint:** SPRINT_20251226_019_AI_offline_inference > **Task:** OFFLINE-23, OFFLINE-26 This guide covers transferring and configuring AI model bundles for air-gapped deployments. ## Overview Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles. ## Model Bundle Format ``` /offline/models// ├── manifest.json # Bundle metadata + file digests ├── signature.dsse # DSSE envelope with model signature ├── weights/ │ ├── model.gguf # Quantized weights (llama.cpp format) │ └── model.gguf.sha256 # SHA-256 digest ├── tokenizer/ │ ├── tokenizer.json # Tokenizer config │ └── special_tokens.json # Special tokens map └── config/ ├── model_config.json # Model architecture config └── inference.json # Recommended inference settings ``` ## Manifest Schema ```json { "bundle_id": "llama3-8b-q4km-v1", "model_family": "llama3", "model_size": "8B", "quantization": "Q4_K_M", "license": "Apache-2.0", "created_at": "2025-12-26T00:00:00Z", "files": [ { "path": "weights/model.gguf", "digest": "sha256:a1b2c3d4e5f6...", "size": 4893456789 }, { "path": "tokenizer/tokenizer.json", "digest": "sha256:1a2b3c4d5e6f...", "size": 1842 } ], "crypto_scheme": "ed25519", "signature_id": "ed25519-20251226-a1b2c3d4" } ``` ## Transfer Workflow ### 1. Export on Connected Machine ```bash # Pull model from registry and create signed bundle stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/ # Verify bundle before transfer stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose ``` ### 2. Transfer Verification Before physically transferring the media, verify the bundle integrity: ```bash # Generate transfer manifest with all digests stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json # Print weights digest for phone/radio verification sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf # Example output: a1b2c3d4... model.gguf # Cross-check against manifest jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json ``` ### 3. Import on Air-Gapped Host ```bash # Import with signature verification stella model import /mnt/usb/models/llama3-8b-q4km/ \ --verify-signature \ --destination /var/lib/stellaops/models/ # Verify loaded model matches expected digest stella model info llama3-8b-q4km --verify # List all installed models stella model list ``` ## CLI Model Commands | Command | Description | |---------|-------------| | `stella model list` | List installed model bundles | | `stella model pull --offline` | Download bundle to local path for transfer | | `stella model verify ` | Verify bundle integrity and signature | | `stella model import ` | Import bundle from external media | | `stella model info ` | Display bundle details and verification status | | `stella model remove ` | Remove installed model bundle | ### Command Examples ```bash # List models with details stella model list --verbose # Pull specific model variant stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/ # Verify all installed bundles stella model verify --all # Get model info including signature status stella model info llama3-8b-q4km --show-signature # Remove model bundle stella model remove llama3-8b-q4km --force ``` ## Configuration ### Local Inference Configuration Configure in `etc/advisory-ai.yaml`: ```yaml advisoryAi: inference: mode: Local # Local | Remote local: bundlePath: /var/lib/stellaops/models/llama3-8b-q4km requiredDigest: "sha256:a1b2c3d4e5f6..." verifySignature: true deviceType: CPU # CPU | GPU | NPU # Determinism settings (required for replay) contextLength: 4096 temperature: 0.0 seed: 42 # Performance tuning threads: 4 batchSize: 512 gpuLayers: 0 # 0 = CPU only ``` ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `ADVISORYAI_INFERENCE_MODE` | `Local` or `Remote` | `Local` | | `ADVISORYAI_MODEL_PATH` | Path to model bundle | `/var/lib/stellaops/models` | | `ADVISORYAI_MODEL_VERIFY` | Verify signature on load | `true` | | `ADVISORYAI_INFERENCE_THREADS` | CPU threads for inference | `4` | ## Hardware Requirements | Model Size | Quantization | RAM Required | GPU VRAM | Inference Speed | |------------|--------------|--------------|----------|-----------------| | 7-8B | Q4_K_M | 8 GB | N/A (CPU) | ~10 tokens/sec | | 7-8B | FP16 | 16 GB | 8 GB | ~50 tokens/sec | | 13B | Q4_K_M | 16 GB | N/A (CPU) | ~5 tokens/sec | | 13B | FP16 | 32 GB | 16 GB | ~30 tokens/sec | ### Recommended Configurations **Minimal (CPU-only, 8GB RAM):** - Model: Llama 3 8B Q4_K_M - Settings: `threads: 4`, `batchSize: 256` - Expected: ~10 tokens/sec **Standard (CPU, 16GB RAM):** - Model: Llama 3 8B Q4_K_M or 13B Q4_K_M - Settings: `threads: 8`, `batchSize: 512` - Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B) **GPU-Accelerated (8GB VRAM):** - Model: Llama 3 8B FP16 - Settings: `gpuLayers: 35`, `batchSize: 512` - Expected: ~50 tokens/sec ## Signing and Verification ### Model Bundle Signing Bundles are signed using DSSE (Dead Simple Signing Envelope) format: ```json { "payloadType": "application/vnd.stellaops.model-bundle+json", "payload": "", "signatures": [ { "keyId": "stellaops-model-signer-2025", "sig": "" } ] } ``` ### Regional Crypto Support | Region | Algorithm | Key Type | |--------|-----------|----------| | Default | Ed25519 | Ed25519 | | FIPS (US) | ECDSA-P256 | NIST P-256 | | GOST (RU) | GOST 34.10-2012 | GOST R 34.10-2012 | | SM (CN) | SM2 | SM2 | ### Verification at Load Time When a model is loaded, the following checks occur: 1. **Signature verification**: DSSE envelope is verified against known keys 2. **Manifest integrity**: All file digests are recalculated and compared 3. **Bundle completeness**: All required files are present 4. **Configuration validation**: Inference settings are within safe bounds ## Deterministic Inference For reproducible AI outputs (required for attestation replay): ```yaml advisoryAi: inference: local: # CRITICAL: These settings ensure deterministic output temperature: 0.0 seed: 42 topK: 1 topP: 1.0 ``` With these settings, the same prompt will produce identical output across runs, enabling: - AI artifact replay for compliance audits - Divergence detection between environments - Attestation verification ## Benchmarking Run local inference benchmarks: ```bash # Run standard benchmark suite stella model benchmark llama3-8b-q4km --iterations 10 # Output includes: # - Latency: mean, median, p95, p99, TTFT # - Throughput: tokens/sec, requests/min # - Resource usage: peak memory, CPU utilization ``` ## Troubleshooting | Symptom | Cause | Resolution | |---------|-------|------------| | `signature verification failed` | Bundle tampered or wrong key | Re-download bundle, verify chain of custody | | `digest mismatch` | Corrupted during transfer | Re-copy from source, verify SHA-256 | | `model not found` | Wrong bundle path | Check `bundlePath` in config | | `out of memory` | Model too large | Use smaller quantization (Q4_K_M) | | `inference timeout` | CPU too slow | Increase timeout or enable GPU | | `non-deterministic output` | Wrong settings | Set `temperature: 0`, `seed: 42` | ## Related Documentation - [Advisory AI Architecture](../architecture.md) - [Offline Kit Overview](../../../24_OFFLINE_KIT.md) - [AI Attestations](../../../implplan/SPRINT_20251226_018_AI_attestations.md) - [Replay Semantics](./replay-semantics.md)