# Offline AI Model Bundles

> **Sprint:** SPRINT_20251226_019_AI_offline_inference
> **Task:** OFFLINE-23, OFFLINE-26

This guide covers transferring and configuring AI model bundles for air-gapped deployments.

## Overview

Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.

## Model Bundle Format

```
/offline/models/<model-id>/
  ├── manifest.json           # Bundle metadata + file digests
  ├── signature.dsse          # DSSE envelope with model signature
  ├── weights/
  │   ├── model.gguf          # Quantized weights (llama.cpp format)
  │   └── model.gguf.sha256   # SHA-256 digest
  ├── tokenizer/
  │   ├── tokenizer.json      # Tokenizer config
  │   └── special_tokens.json # Special tokens map
  └── config/
      ├── model_config.json   # Model architecture config
      └── inference.json      # Recommended inference settings
```

## Manifest Schema

```json
{
  "bundle_id": "llama3-8b-q4km-v1",
  "model_family": "llama3",
  "model_size": "8B",
  "quantization": "Q4_K_M",
  "license": "Apache-2.0",
  "created_at": "2025-12-26T00:00:00Z",
  "files": [
    {
      "path": "weights/model.gguf",
      "digest": "sha256:a1b2c3d4e5f6...",
      "size": 4893456789
    },
    {
      "path": "tokenizer/tokenizer.json",
      "digest": "sha256:1a2b3c4d5e6f...",
      "size": 1842
    }
  ],
  "crypto_scheme": "ed25519",
  "signature_id": "ed25519-20251226-a1b2c3d4"
}
```

## Transfer Workflow

### 1. Export on Connected Machine

```bash
# Pull model from registry and create signed bundle
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/

# Verify bundle before transfer
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose
```

### 2. Transfer Verification

Before physically transferring the media, verify the bundle integrity:

```bash
# Generate transfer manifest with all digests
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json

# Print weights digest for phone/radio verification
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
# Example output: a1b2c3d4... model.gguf

# Cross-check against manifest
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json
```

### 3. Import on Air-Gapped Host

```bash
# Import with signature verification
stella model import /mnt/usb/models/llama3-8b-q4km/ \
  --verify-signature \
  --destination /var/lib/stellaops/models/

# Verify loaded model matches expected digest
stella model info llama3-8b-q4km --verify

# List all installed models
stella model list
```

## CLI Model Commands

| Command | Description |
|---------|-------------|
| `stella model list` | List installed model bundles |
| `stella model pull --offline` | Download bundle to local path for transfer |
| `stella model verify <path>` | Verify bundle integrity and signature |
| `stella model import <path>` | Import bundle from external media |
| `stella model info <model-id>` | Display bundle details and verification status |
| `stella model remove <model-id>` | Remove installed model bundle |

### Command Examples

```bash
# List models with details
stella model list --verbose

# Pull specific model variant
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/

# Verify all installed bundles
stella model verify --all

# Get model info including signature status
stella model info llama3-8b-q4km --show-signature

# Remove model bundle
stella model remove llama3-8b-q4km --force
```

## Configuration

### Local Inference Configuration

Configure in `etc/advisory-ai.yaml`:

```yaml
advisoryAi:
  inference:
    mode: Local  # Local | Remote
    local:
      bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
      requiredDigest: "sha256:a1b2c3d4e5f6..."
      verifySignature: true
      deviceType: CPU  # CPU | GPU | NPU

      # Determinism settings (required for replay)
      contextLength: 4096
      temperature: 0.0
      seed: 42

      # Performance tuning
      threads: 4
      batchSize: 512
      gpuLayers: 0  # 0 = CPU only
```

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `ADVISORYAI_INFERENCE_MODE` | `Local` or `Remote` | `Local` |
| `ADVISORYAI_MODEL_PATH` | Path to model bundle | `/var/lib/stellaops/models` |
| `ADVISORYAI_MODEL_VERIFY` | Verify signature on load | `true` |
| `ADVISORYAI_INFERENCE_THREADS` | CPU threads for inference | `4` |

## Hardware Requirements

| Model Size | Quantization | RAM Required | GPU VRAM | Inference Speed |
|------------|--------------|--------------|----------|-----------------|
| 7-8B | Q4_K_M | 8 GB | N/A (CPU) | ~10 tokens/sec |
| 7-8B | FP16 | 16 GB | 8 GB | ~50 tokens/sec |
| 13B | Q4_K_M | 16 GB | N/A (CPU) | ~5 tokens/sec |
| 13B | FP16 | 32 GB | 16 GB | ~30 tokens/sec |

### Recommended Configurations

**Minimal (CPU-only, 8GB RAM):**
- Model: Llama 3 8B Q4_K_M
- Settings: `threads: 4`, `batchSize: 256`
- Expected: ~10 tokens/sec

**Standard (CPU, 16GB RAM):**
- Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
- Settings: `threads: 8`, `batchSize: 512`
- Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)

**GPU-Accelerated (8GB VRAM):**
- Model: Llama 3 8B FP16
- Settings: `gpuLayers: 35`, `batchSize: 512`
- Expected: ~50 tokens/sec

## Signing and Verification

### Model Bundle Signing

Bundles are signed using DSSE (Dead Simple Signing Envelope) format:

```json
{
  "payloadType": "application/vnd.stellaops.model-bundle+json",
  "payload": "<base64-encoded-manifest-digest>",
  "signatures": [
    {
      "keyId": "stellaops-model-signer-2025",
      "sig": "<base64-signature>"
    }
  ]
}
```

### Regional Crypto Support

| Region | Algorithm | Key Type |
|--------|-----------|----------|
| Default | Ed25519 | Ed25519 |
| FIPS (US) | ECDSA-P256 | NIST P-256 |
| GOST (RU) | GOST 34.10-2012 | GOST R 34.10-2012 |
| SM (CN) | SM2 | SM2 |

### Verification at Load Time

When a model is loaded, the following checks occur:

1. **Signature verification**: DSSE envelope is verified against known keys
2. **Manifest integrity**: All file digests are recalculated and compared
3. **Bundle completeness**: All required files are present
4. **Configuration validation**: Inference settings are within safe bounds

## Deterministic Inference

For reproducible AI outputs (required for attestation replay):

```yaml
advisoryAi:
  inference:
    local:
      # CRITICAL: These settings ensure deterministic output
      temperature: 0.0
      seed: 42
      topK: 1
      topP: 1.0
```

With these settings, the same prompt will produce identical output across runs, enabling:
- AI artifact replay for compliance audits
- Divergence detection between environments
- Attestation verification

## Benchmarking

Run local inference benchmarks:

```bash
# Run standard benchmark suite
stella model benchmark llama3-8b-q4km --iterations 10

# Output includes:
# - Latency: mean, median, p95, p99, TTFT
# - Throughput: tokens/sec, requests/min
# - Resource usage: peak memory, CPU utilization
```

## Troubleshooting

| Symptom | Cause | Resolution |
|---------|-------|------------|
| `signature verification failed` | Bundle tampered or wrong key | Re-download bundle, verify chain of custody |
| `digest mismatch` | Corrupted during transfer | Re-copy from source, verify SHA-256 |
| `model not found` | Wrong bundle path | Check `bundlePath` in config |
| `out of memory` | Model too large | Use smaller quantization (Q4_K_M) |
| `inference timeout` | CPU too slow | Increase timeout or enable GPU |
| `non-deterministic output` | Wrong settings | Set `temperature: 0`, `seed: 42` |

## Related Documentation

- [Advisory AI Architecture](../architecture.md)
- [Offline Kit Overview](../../../24_OFFLINE_KIT.md)
- [AI Attestations](../../../implplan/SPRINT_20251226_018_AI_attestations.md)
- [Replay Semantics](./replay-semantics.md)