1.8 KiB
1.8 KiB
LLM Inference Response Caching
Module
AdvisoryAI
Status
IMPLEMENTED
Description
In-memory LLM inference cache that deduplicates identical prompt+model combinations. Reduces API costs and latency by caching deterministic responses keyed by content hash.
Implementation Details
- Modules:
src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/ - Key Classes:
LlmInferenceCache(src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmInferenceCache.cs) - in-memory cache keyed by content hash of prompt+model+parametersLlmProviderFactory(src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderFactory.cs) - factory that wraps providers with caching layerLlmProviderOptions(src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderOptions.cs) - provider options including cache TTL and size limitsProviderBasedAdvisoryInferenceClient(src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs) - inference client that uses the caching layer
- Interfaces:
ILlmProvider - Source: SPRINT_20251226_019_AI_offline_inference.md
E2E Test Plan
- Send identical prompts twice and verify
LlmInferenceCachereturns the cached response on the second call without hitting the LLM - Verify cache keys include model ID and parameters: same prompt with different temperature results in cache miss
- Verify cache TTL: cached responses expire after configured duration
- Verify cache size limits: when max entries are reached, oldest entries are evicted
- Verify cache bypass: non-deterministic requests (temperature > 0) are not cached
- Verify
ProviderBasedAdvisoryInferenceClientcorrectly integrates caching with the provider pipeline