# LLM Provider Plugins

> **Sprint:** SPRINT_20251226_019_AI_offline_inference
> **Tasks:** OFFLINE-07, OFFLINE-08, OFFLINE-09

This guide documents the LLM (Large Language Model) provider plugin architecture for AI-powered advisory analysis, explanations, and remediation planning.

## Overview

StellaOps supports multiple LLM backends through a unified plugin architecture:

| Provider | Type | Use Case | Priority |
|----------|------|----------|----------|
| **llama-server** | Local | Airgap/Offline deployment | 10 (highest) |
| **ollama** | Local | Development, edge deployment | 20 |
| **openai** | Cloud | GPT-4o for high-quality output | 100 |
| **claude** | Cloud | Claude Sonnet for complex reasoning | 100 |

## Architecture

### Plugin Interface

```csharp
public interface ILlmProviderPlugin : IAvailabilityPlugin
{
    string ProviderId { get; }              // "openai", "claude", "llama-server", "ollama"
    string DisplayName { get; }              // Human-readable name
    string Description { get; }              // Provider description
    string DefaultConfigFileName { get; }    // "openai.yaml", etc.

    ILlmProvider Create(IServiceProvider services, IConfiguration configuration);
    LlmProviderConfigValidation ValidateConfiguration(IConfiguration configuration);
}
```

### Provider Interface

```csharp
public interface ILlmProvider : IDisposable
{
    string ProviderId { get; }

    Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);

    Task<LlmCompletionResult> CompleteAsync(
        LlmCompletionRequest request,
        CancellationToken cancellationToken = default);

    IAsyncEnumerable<LlmStreamChunk> CompleteStreamAsync(
        LlmCompletionRequest request,
        CancellationToken cancellationToken = default);
}
```

### Request and Response

```csharp
public record LlmCompletionRequest
{
    string? SystemPrompt { get; init; }
    required string UserPrompt { get; init; }
    string? Model { get; init; }
    double Temperature { get; init; } = 0;      // 0 = deterministic
    int MaxTokens { get; init; } = 4096;
    int? Seed { get; init; }                     // For reproducibility
    IReadOnlyList<string>? StopSequences { get; init; }
    string? RequestId { get; init; }
}

public record LlmCompletionResult
{
    required string Content { get; init; }
    required string ModelId { get; init; }
    required string ProviderId { get; init; }
    int? InputTokens { get; init; }
    int? OutputTokens { get; init; }
    long? TotalTimeMs { get; init; }
    string? FinishReason { get; init; }
    bool Deterministic { get; init; }
}
```

## Configuration

### Directory Structure

```
etc/
  llm-providers/
    openai.yaml          # OpenAI configuration
    claude.yaml          # Claude/Anthropic configuration
    llama-server.yaml    # llama.cpp server configuration
    ollama.yaml          # Ollama configuration
```

### Environment Variables

| Variable | Provider | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | OpenAI | API key for OpenAI |
| `ANTHROPIC_API_KEY` | Claude | API key for Anthropic |

### Priority System

Providers are selected by priority (lower = higher preference):

```yaml
# llama-server.yaml - highest priority for offline
priority: 10

# ollama.yaml - second priority for local
priority: 20

# openai.yaml / claude.yaml - cloud fallback
priority: 100
```

## Provider Details

### OpenAI Provider

Supports OpenAI API and Azure OpenAI Service.

```yaml
# etc/llm-providers/openai.yaml
enabled: true
priority: 100

api:
  apiKey: "${OPENAI_API_KEY}"
  baseUrl: "https://api.openai.com/v1"
  organizationId: ""
  apiVersion: ""  # Required for Azure OpenAI

model:
  name: "gpt-4o"
  fallbacks:
    - "gpt-4o-mini"

inference:
  temperature: 0.0
  maxTokens: 4096
  seed: 42
  topP: 1.0
  frequencyPenalty: 0.0
  presencePenalty: 0.0

request:
  timeout: "00:02:00"
  maxRetries: 3
```

**Azure OpenAI Configuration:**

```yaml
api:
  baseUrl: "https://{resource}.openai.azure.com/openai/deployments/{deployment}"
  apiKey: "${AZURE_OPENAI_KEY}"
  apiVersion: "2024-02-15-preview"
```

### Claude Provider

Supports Anthropic Claude API.

```yaml
# etc/llm-providers/claude.yaml
enabled: true
priority: 100

api:
  apiKey: "${ANTHROPIC_API_KEY}"
  baseUrl: "https://api.anthropic.com"
  apiVersion: "2023-06-01"

model:
  name: "claude-sonnet-4-20250514"
  fallbacks:
    - "claude-3-5-sonnet-20241022"

inference:
  temperature: 0.0
  maxTokens: 4096
  topP: 1.0
  topK: 0

thinking:
  enabled: false
  budgetTokens: 10000

request:
  timeout: "00:02:00"
  maxRetries: 3
```

### llama.cpp Server Provider

**Primary provider for airgap/offline deployments.**

```yaml
# etc/llm-providers/llama-server.yaml
enabled: true
priority: 10  # Highest priority

server:
  baseUrl: "http://localhost:8080"
  apiKey: ""
  healthEndpoint: "/health"

model:
  name: "llama3-8b-q4km"
  modelPath: "/models/llama-3-8b-instruct.Q4_K_M.gguf"
  expectedDigest: "sha256:..."  # For airgap verification

inference:
  temperature: 0.0
  maxTokens: 4096
  seed: 42
  topP: 1.0
  topK: 40
  repeatPenalty: 1.1
  contextLength: 4096

bundle:
  bundlePath: "/bundles/llama3-8b.stellaops-model"
  verifySignature: true
  cryptoScheme: "ed25519"

request:
  timeout: "00:05:00"
  maxRetries: 2
```

**Starting llama.cpp server:**

```bash
# Basic server
llama-server -m model.gguf --host 0.0.0.0 --port 8080

# With GPU acceleration
llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 35

# With API key authentication
llama-server -m model.gguf --host 0.0.0.0 --port 8080 --api-key "your-key"
```

### Ollama Provider

For local development and edge deployments.

```yaml
# etc/llm-providers/ollama.yaml
enabled: true
priority: 20

server:
  baseUrl: "http://localhost:11434"
  healthEndpoint: "/api/tags"

model:
  name: "llama3:8b"
  fallbacks:
    - "mistral:7b"
  keepAlive: "5m"

inference:
  temperature: 0.0
  maxTokens: 4096
  seed: 42
  topP: 1.0
  topK: 40
  repeatPenalty: 1.1
  numCtx: 4096

gpu:
  numGpu: 0  # 0 = CPU only, -1 = all layers on GPU

management:
  autoPull: false  # Disable for airgap
  verifyPull: true

request:
  timeout: "00:05:00"
  maxRetries: 2
```

## Usage

### Dependency Injection

```csharp
// Program.cs or Startup.cs
services.AddLlmProviderPlugins("etc/llm-providers");

// Or with explicit configuration
services.AddLlmProviderPlugins(catalog =>
{
    catalog.LoadConfigurationsFromDirectory("etc/llm-providers");
    // Optionally register custom plugins
    catalog.RegisterPlugin(new CustomLlmProviderPlugin());
});
```

### Using the Provider Factory

```csharp
public class AdvisoryExplanationService
{
    private readonly ILlmProviderFactory _providerFactory;

    public async Task<string> GenerateExplanationAsync(
        string vulnerabilityId,
        CancellationToken cancellationToken)
    {
        // Get the default (highest priority available) provider
        var provider = _providerFactory.GetDefaultProvider();

        var request = new LlmCompletionRequest
        {
            SystemPrompt = "You are a security analyst explaining vulnerabilities.",
            UserPrompt = $"Explain {vulnerabilityId} in plain language.",
            Temperature = 0,  // Deterministic
            Seed = 42,        // Reproducible
            MaxTokens = 2048
        };

        var result = await provider.CompleteAsync(request, cancellationToken);
        return result.Content;
    }
}
```

### Provider Selection

```csharp
// Get specific provider
var openaiProvider = _providerFactory.GetProvider("openai");
var claudeProvider = _providerFactory.GetProvider("claude");
var llamaProvider = _providerFactory.GetProvider("llama-server");

// List available providers
var available = _providerFactory.AvailableProviders;
// Returns: ["llama-server", "ollama", "openai", "claude"]
```

### Automatic Fallback

```csharp
// Create a fallback provider that tries providers in order
var fallbackProvider = new FallbackLlmProvider(
    _providerFactory,
    providerOrder: ["llama-server", "ollama", "openai", "claude"],
    _logger);

// Uses first available provider, falls back on failure
var result = await fallbackProvider.CompleteAsync(request, cancellationToken);
```

### Streaming Responses

```csharp
var provider = _providerFactory.GetDefaultProvider();

await foreach (var chunk in provider.CompleteStreamAsync(request, cancellationToken))
{
    Console.Write(chunk.Content);

    if (chunk.IsFinal)
    {
        Console.WriteLine($"\n[Finished: {chunk.FinishReason}]");
    }
}
```

## Determinism Requirements

For reproducible AI outputs (required for attestations):

| Setting | Value | Purpose |
|---------|-------|---------|
| `temperature` | `0.0` | No randomness in token selection |
| `seed` | `42` | Fixed random seed |
| `topK` | `1` | Single token selection (optional) |

```yaml
inference:
  temperature: 0.0
  seed: 42
  topK: 1  # Most deterministic
```

**Verification:**

```csharp
var result = await provider.CompleteAsync(request, cancellationToken);

if (!result.Deterministic)
{
    _logger.LogWarning("Output may not be reproducible");
}
```

## Offline/Airgap Deployment

### Recommended Configuration

```
etc/llm-providers/
  llama-server.yaml    # Primary - enabled, priority: 10
  ollama.yaml          # Backup - enabled, priority: 20
  openai.yaml          # Disabled or missing
  claude.yaml          # Disabled or missing
```

### Model Bundle Verification

For airgap environments, use signed model bundles:

```yaml
# llama-server.yaml
bundle:
  bundlePath: "/bundles/llama3-8b.stellaops-model"
  verifySignature: true
  cryptoScheme: "ed25519"

model:
  expectedDigest: "sha256:abc123..."
```

**Creating a model bundle:**

```bash
# Create signed bundle
stella model bundle \
  --model /models/llama-3-8b-instruct.Q4_K_M.gguf \
  --sign \
  --output /bundles/llama3-8b.stellaops-model

# Verify bundle
stella model verify /bundles/llama3-8b.stellaops-model
```

## Custom Plugins

To add support for a new LLM provider:

```csharp
public sealed class CustomLlmProviderPlugin : ILlmProviderPlugin
{
    public string Name => "Custom LLM Provider";
    public string ProviderId => "custom";
    public string DisplayName => "Custom LLM";
    public string Description => "Custom LLM backend";
    public string DefaultConfigFileName => "custom.yaml";

    public bool IsAvailable(IServiceProvider services) => true;

    public ILlmProvider Create(IServiceProvider services, IConfiguration configuration)
    {
        var config = CustomConfig.FromConfiguration(configuration);
        var httpClientFactory = services.GetRequiredService<IHttpClientFactory>();
        var logger = services.GetRequiredService<ILogger<CustomLlmProvider>>();
        return new CustomLlmProvider(httpClientFactory.CreateClient(), config, logger);
    }

    public LlmProviderConfigValidation ValidateConfiguration(IConfiguration configuration)
    {
        // Validate configuration
        return LlmProviderConfigValidation.Success();
    }
}
```

Register the custom plugin:

```csharp
services.AddLlmProviderPlugins(catalog =>
{
    catalog.RegisterPlugin(new CustomLlmProviderPlugin());
    catalog.LoadConfigurationsFromDirectory("etc/llm-providers");
});
```

## Telemetry

LLM operations emit structured logs:

```json
{
  "timestamp": "2025-12-26T10:30:00Z",
  "operation": "llm_completion",
  "providerId": "llama-server",
  "model": "llama3-8b-q4km",
  "inputTokens": 1234,
  "outputTokens": 567,
  "totalTimeMs": 2345,
  "deterministic": true,
  "finishReason": "stop"
}
```

## Performance Comparison

| Provider | Latency (TTFT) | Throughput | Cost | Offline |
|----------|---------------|------------|------|---------|
| **llama-server** | 50-200ms | 20-50 tok/s | Free | Yes |
| **ollama** | 100-500ms | 15-40 tok/s | Free | Yes |
| **openai (gpt-4o)** | 200-500ms | 50-100 tok/s | $$$ | No |
| **claude (sonnet)** | 300-600ms | 40-80 tok/s | $$$ | No |

*Note: Local performance depends heavily on hardware (GPU, RAM, CPU).*

## Troubleshooting

### Provider Not Available

```
InvalidOperationException: No LLM providers are available.
```

**Solutions:**
1. Check configuration files exist in `etc/llm-providers/`
2. Verify API keys are set (environment variables or config)
3. For local providers, ensure server is running:
   ```bash
   # llama-server
   curl http://localhost:8080/health

   # ollama
   curl http://localhost:11434/api/tags
   ```

### Non-Deterministic Output

```
Warning: Output may not be reproducible
```

**Solutions:**
1. Set `temperature: 0.0` in configuration
2. Set `seed: 42` (or any fixed value)
3. Use the same model version across environments

### Timeout Errors

```
TaskCanceledException: The request was canceled due to timeout.
```

**Solutions:**
1. Increase `request.timeout` in configuration
2. For local inference, ensure sufficient hardware resources
3. Reduce `maxTokens` if appropriate

## Related Documentation

- [AI Attestations](./ai-attestations.md)
- [Offline Model Bundles](./offline-model-bundles.md)
- [Advisory AI Architecture](../architecture.md)
- [Configuration Reference](../../../../etc/llm-providers/)