stella-ops.org/git.stella-ops.org

Fork 0

Files

StellaOps Bot f10d83c444 Refactor code structure and optimize performance across multiple modules

2025-12-26 20:03:41 +02:00

7.9 KiB

Raw Blame History

Offline AI Model Bundles

Sprint: SPRINT_20251226_019_AI_offline_inference Task: OFFLINE-23, OFFLINE-26

This guide covers transferring and configuring AI model bundles for air-gapped deployments.

Overview

Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.

Model Bundle Format

/offline/models/<model-id>/
  ├── manifest.json           # Bundle metadata + file digests
  ├── signature.dsse          # DSSE envelope with model signature
  ├── weights/
  │   ├── model.gguf          # Quantized weights (llama.cpp format)
  │   └── model.gguf.sha256   # SHA-256 digest
  ├── tokenizer/
  │   ├── tokenizer.json      # Tokenizer config
  │   └── special_tokens.json # Special tokens map
  └── config/
      ├── model_config.json   # Model architecture config
      └── inference.json      # Recommended inference settings

Manifest Schema

{
  "bundle_id": "llama3-8b-q4km-v1",
  "model_family": "llama3",
  "model_size": "8B",
  "quantization": "Q4_K_M",
  "license": "Apache-2.0",
  "created_at": "2025-12-26T00:00:00Z",
  "files": [
    {
      "path": "weights/model.gguf",
      "digest": "sha256:a1b2c3d4e5f6...",
      "size": 4893456789
    },
    {
      "path": "tokenizer/tokenizer.json",
      "digest": "sha256:1a2b3c4d5e6f...",
      "size": 1842
    }
  ],
  "crypto_scheme": "ed25519",
  "signature_id": "ed25519-20251226-a1b2c3d4"
}

Transfer Workflow

1. Export on Connected Machine

# Pull model from registry and create signed bundle
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/

# Verify bundle before transfer
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose

2. Transfer Verification

Before physically transferring the media, verify the bundle integrity:

# Generate transfer manifest with all digests
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json

# Print weights digest for phone/radio verification
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
# Example output: a1b2c3d4... model.gguf

# Cross-check against manifest
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json

3. Import on Air-Gapped Host

# Import with signature verification
stella model import /mnt/usb/models/llama3-8b-q4km/ \
  --verify-signature \
  --destination /var/lib/stellaops/models/

# Verify loaded model matches expected digest
stella model info llama3-8b-q4km --verify

# List all installed models
stella model list

CLI Model Commands

Command	Description
`stella model list`	List installed model bundles
`stella model pull --offline`	Download bundle to local path for transfer
`stella model verify <path>`	Verify bundle integrity and signature
`stella model import <path>`	Import bundle from external media
`stella model info <model-id>`	Display bundle details and verification status
`stella model remove <model-id>`	Remove installed model bundle

Command Examples

# List models with details
stella model list --verbose

# Pull specific model variant
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/

# Verify all installed bundles
stella model verify --all

# Get model info including signature status
stella model info llama3-8b-q4km --show-signature

# Remove model bundle
stella model remove llama3-8b-q4km --force

Configuration

Local Inference Configuration

Configure in etc/advisory-ai.yaml:

advisoryAi:
  inference:
    mode: Local  # Local | Remote
    local:
      bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
      requiredDigest: "sha256:a1b2c3d4e5f6..."
      verifySignature: true
      deviceType: CPU  # CPU | GPU | NPU

      # Determinism settings (required for replay)
      contextLength: 4096
      temperature: 0.0
      seed: 42

      # Performance tuning
      threads: 4
      batchSize: 512
      gpuLayers: 0  # 0 = CPU only

Environment Variables

Variable	Description	Default
`ADVISORYAI_INFERENCE_MODE`	`Local` or `Remote`	`Local`
`ADVISORYAI_MODEL_PATH`	Path to model bundle	`/var/lib/stellaops/models`
`ADVISORYAI_MODEL_VERIFY`	Verify signature on load	`true`
`ADVISORYAI_INFERENCE_THREADS`	CPU threads for inference	`4`

Hardware Requirements

Model Size	Quantization	RAM Required	GPU VRAM	Inference Speed
7-8B	Q4_K_M	8 GB	N/A (CPU)	~10 tokens/sec
7-8B	FP16	16 GB	8 GB	~50 tokens/sec
13B	Q4_K_M	16 GB	N/A (CPU)	~5 tokens/sec
13B	FP16	32 GB	16 GB	~30 tokens/sec

Recommended Configurations

Minimal (CPU-only, 8GB RAM):

Model: Llama 3 8B Q4_K_M
Settings: threads: 4, batchSize: 256
Expected: ~10 tokens/sec

Standard (CPU, 16GB RAM):

Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
Settings: threads: 8, batchSize: 512
Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)

GPU-Accelerated (8GB VRAM):

Model: Llama 3 8B FP16
Settings: gpuLayers: 35, batchSize: 512
Expected: ~50 tokens/sec

Signing and Verification

Model Bundle Signing

Bundles are signed using DSSE (Dead Simple Signing Envelope) format:

{
  "payloadType": "application/vnd.stellaops.model-bundle+json",
  "payload": "<base64-encoded-manifest-digest>",
  "signatures": [
    {
      "keyId": "stellaops-model-signer-2025",
      "sig": "<base64-signature>"
    }
  ]
}

Regional Crypto Support

Region	Algorithm	Key Type
Default	Ed25519	Ed25519
FIPS (US)	ECDSA-P256	NIST P-256
GOST (RU)	GOST 34.10-2012	GOST R 34.10-2012
SM (CN)	SM2	SM2

Verification at Load Time

When a model is loaded, the following checks occur:

Signature verification: DSSE envelope is verified against known keys
Manifest integrity: All file digests are recalculated and compared
Bundle completeness: All required files are present
Configuration validation: Inference settings are within safe bounds

Deterministic Inference

For reproducible AI outputs (required for attestation replay):

advisoryAi:
  inference:
    local:
      # CRITICAL: These settings ensure deterministic output
      temperature: 0.0
      seed: 42
      topK: 1
      topP: 1.0

With these settings, the same prompt will produce identical output across runs, enabling:

AI artifact replay for compliance audits
Divergence detection between environments
Attestation verification

Benchmarking

Run local inference benchmarks:

# Run standard benchmark suite
stella model benchmark llama3-8b-q4km --iterations 10

# Output includes:
# - Latency: mean, median, p95, p99, TTFT
# - Throughput: tokens/sec, requests/min
# - Resource usage: peak memory, CPU utilization

Troubleshooting

Symptom	Cause	Resolution
`signature verification failed`	Bundle tampered or wrong key	Re-download bundle, verify chain of custody
`digest mismatch`	Corrupted during transfer	Re-copy from source, verify SHA-256
`model not found`	Wrong bundle path	Check `bundlePath` in config
`out of memory`	Model too large	Use smaller quantization (Q4_K_M)
`inference timeout`	CPU too slow	Increase timeout or enable GPU
`non-deterministic output`	Wrong settings	Set `temperature: 0`, `seed: 42`

7.9 KiB Raw Blame History