# LLM Inference Response Caching ## Module AdvisoryAI ## Status IMPLEMENTED ## Description In-memory LLM inference cache that deduplicates identical prompt+model combinations. Reduces API costs and latency by caching deterministic responses keyed by content hash. ## Implementation Details - **Modules**: `src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/` - **Key Classes**: - `LlmInferenceCache` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmInferenceCache.cs`) - in-memory cache keyed by content hash of prompt+model+parameters - `LlmProviderFactory` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderFactory.cs`) - factory that wraps providers with caching layer - `LlmProviderOptions` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderOptions.cs`) - provider options including cache TTL and size limits - `ProviderBasedAdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs`) - inference client that uses the caching layer - **Interfaces**: `ILlmProvider` - **Source**: SPRINT_20251226_019_AI_offline_inference.md ## E2E Test Plan - [ ] Send identical prompts twice and verify `LlmInferenceCache` returns the cached response on the second call without hitting the LLM - [ ] Verify cache keys include model ID and parameters: same prompt with different temperature results in cache miss - [ ] Verify cache TTL: cached responses expire after configured duration - [ ] Verify cache size limits: when max entries are reached, oldest entries are evicted - [ ] Verify cache bypass: non-deterministic requests (temperature > 0) are not cached - [ ] Verify `ProviderBasedAdvisoryInferenceClient` correctly integrates caching with the provider pipeline