# LLM Inference Response Caching

## Module
AdvisoryAI

## Status
IMPLEMENTED

## Description
In-memory LLM inference cache that deduplicates identical prompt+model combinations. Reduces API costs and latency by caching deterministic responses keyed by content hash.

## Implementation Details
- **Modules**: `src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/`
- **Key Classes**:
  - `LlmInferenceCache` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmInferenceCache.cs`) - in-memory cache keyed by content hash of prompt+model+parameters
  - `LlmProviderFactory` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderFactory.cs`) - factory that wraps providers with caching layer
  - `LlmProviderOptions` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderOptions.cs`) - provider options including cache TTL and size limits
  - `ProviderBasedAdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs`) - inference client that uses the caching layer
- **Interfaces**: `ILlmProvider`
- **Source**: SPRINT_20251226_019_AI_offline_inference.md

## E2E Test Plan
- [ ] Send identical prompts twice and verify `LlmInferenceCache` returns the cached response on the second call without hitting the LLM
- [ ] Verify cache keys include model ID and parameters: same prompt with different temperature results in cache miss
- [ ] Verify cache TTL: cached responses expire after configured duration
- [ ] Verify cache size limits: when max entries are reached, oldest entries are evicted
- [ ] Verify cache bypass: non-deterministic requests (temperature > 0) are not cached
- [ ] Verify `ProviderBasedAdvisoryInferenceClient` correctly integrates caching with the provider pipeline