Files
git.stella-ops.org/docs/modules/advisory-ai/guides/offline-model-bundles.md

7.9 KiB

Offline AI Model Bundles

Sprint: SPRINT_20251226_019_AI_offline_inference Task: OFFLINE-23, OFFLINE-26

This guide covers transferring and configuring AI model bundles for air-gapped deployments.

Overview

Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.

Model Bundle Format

/offline/models/<model-id>/
  ├── manifest.json           # Bundle metadata + file digests
  ├── signature.dsse          # DSSE envelope with model signature
  ├── weights/
  │   ├── model.gguf          # Quantized weights (llama.cpp format)
  │   └── model.gguf.sha256   # SHA-256 digest
  ├── tokenizer/
  │   ├── tokenizer.json      # Tokenizer config
  │   └── special_tokens.json # Special tokens map
  └── config/
      ├── model_config.json   # Model architecture config
      └── inference.json      # Recommended inference settings

Manifest Schema

{
  "bundle_id": "llama3-8b-q4km-v1",
  "model_family": "llama3",
  "model_size": "8B",
  "quantization": "Q4_K_M",
  "license": "Apache-2.0",
  "created_at": "2025-12-26T00:00:00Z",
  "files": [
    {
      "path": "weights/model.gguf",
      "digest": "sha256:a1b2c3d4e5f6...",
      "size": 4893456789
    },
    {
      "path": "tokenizer/tokenizer.json",
      "digest": "sha256:1a2b3c4d5e6f...",
      "size": 1842
    }
  ],
  "crypto_scheme": "ed25519",
  "signature_id": "ed25519-20251226-a1b2c3d4"
}

Transfer Workflow

1. Export on Connected Machine

# Pull model from registry and create signed bundle
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/

# Verify bundle before transfer
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose

2. Transfer Verification

Before physically transferring the media, verify the bundle integrity:

# Generate transfer manifest with all digests
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json

# Print weights digest for phone/radio verification
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
# Example output: a1b2c3d4... model.gguf

# Cross-check against manifest
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json

3. Import on Air-Gapped Host

# Import with signature verification
stella model import /mnt/usb/models/llama3-8b-q4km/ \
  --verify-signature \
  --destination /var/lib/stellaops/models/

# Verify loaded model matches expected digest
stella model info llama3-8b-q4km --verify

# List all installed models
stella model list

CLI Model Commands

Command Description
stella model list List installed model bundles
stella model pull --offline Download bundle to local path for transfer
stella model verify <path> Verify bundle integrity and signature
stella model import <path> Import bundle from external media
stella model info <model-id> Display bundle details and verification status
stella model remove <model-id> Remove installed model bundle

Command Examples

# List models with details
stella model list --verbose

# Pull specific model variant
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/

# Verify all installed bundles
stella model verify --all

# Get model info including signature status
stella model info llama3-8b-q4km --show-signature

# Remove model bundle
stella model remove llama3-8b-q4km --force

Configuration

Local Inference Configuration

Configure in etc/advisory-ai.yaml:

advisoryAi:
  inference:
    mode: Local  # Local | Remote
    local:
      bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
      requiredDigest: "sha256:a1b2c3d4e5f6..."
      verifySignature: true
      deviceType: CPU  # CPU | GPU | NPU

      # Determinism settings (required for replay)
      contextLength: 4096
      temperature: 0.0
      seed: 42

      # Performance tuning
      threads: 4
      batchSize: 512
      gpuLayers: 0  # 0 = CPU only

Environment Variables

Variable Description Default
ADVISORYAI_INFERENCE_MODE Local or Remote Local
ADVISORYAI_MODEL_PATH Path to model bundle /var/lib/stellaops/models
ADVISORYAI_MODEL_VERIFY Verify signature on load true
ADVISORYAI_INFERENCE_THREADS CPU threads for inference 4

Hardware Requirements

Model Size Quantization RAM Required GPU VRAM Inference Speed
7-8B Q4_K_M 8 GB N/A (CPU) ~10 tokens/sec
7-8B FP16 16 GB 8 GB ~50 tokens/sec
13B Q4_K_M 16 GB N/A (CPU) ~5 tokens/sec
13B FP16 32 GB 16 GB ~30 tokens/sec

Minimal (CPU-only, 8GB RAM):

  • Model: Llama 3 8B Q4_K_M
  • Settings: threads: 4, batchSize: 256
  • Expected: ~10 tokens/sec

Standard (CPU, 16GB RAM):

  • Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
  • Settings: threads: 8, batchSize: 512
  • Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)

GPU-Accelerated (8GB VRAM):

  • Model: Llama 3 8B FP16
  • Settings: gpuLayers: 35, batchSize: 512
  • Expected: ~50 tokens/sec

Signing and Verification

Model Bundle Signing

Bundles are signed using DSSE (Dead Simple Signing Envelope) format:

{
  "payloadType": "application/vnd.stellaops.model-bundle+json",
  "payload": "<base64-encoded-manifest-digest>",
  "signatures": [
    {
      "keyId": "stellaops-model-signer-2025",
      "sig": "<base64-signature>"
    }
  ]
}

Regional Crypto Support

Region Algorithm Key Type
Default Ed25519 Ed25519
FIPS (US) ECDSA-P256 NIST P-256
GOST (RU) GOST 34.10-2012 GOST R 34.10-2012
SM (CN) SM2 SM2

Verification at Load Time

When a model is loaded, the following checks occur:

  1. Signature verification: DSSE envelope is verified against known keys
  2. Manifest integrity: All file digests are recalculated and compared
  3. Bundle completeness: All required files are present
  4. Configuration validation: Inference settings are within safe bounds

Deterministic Inference

For reproducible AI outputs (required for attestation replay):

advisoryAi:
  inference:
    local:
      # CRITICAL: These settings ensure deterministic output
      temperature: 0.0
      seed: 42
      topK: 1
      topP: 1.0

With these settings, the same prompt will produce identical output across runs, enabling:

  • AI artifact replay for compliance audits
  • Divergence detection between environments
  • Attestation verification

Benchmarking

Run local inference benchmarks:

# Run standard benchmark suite
stella model benchmark llama3-8b-q4km --iterations 10

# Output includes:
# - Latency: mean, median, p95, p99, TTFT
# - Throughput: tokens/sec, requests/min
# - Resource usage: peak memory, CPU utilization

Troubleshooting

Symptom Cause Resolution
signature verification failed Bundle tampered or wrong key Re-download bundle, verify chain of custody
digest mismatch Corrupted during transfer Re-copy from source, verify SHA-256
model not found Wrong bundle path Check bundlePath in config
out of memory Model too large Use smaller quantization (Q4_K_M)
inference timeout CPU too slow Increase timeout or enable GPU
non-deterministic output Wrong settings Set temperature: 0, seed: 42