LLM Inference Response Caching

Module

AdvisoryAI

Status

IMPLEMENTED

Description

In-memory LLM inference cache that deduplicates identical prompt+model combinations. Reduces API costs and latency by caching deterministic responses keyed by content hash.

Implementation Details

Modules: src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/
Key Classes:
- LlmInferenceCache (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmInferenceCache.cs) - in-memory cache keyed by content hash of prompt+model+parameters
- LlmProviderFactory (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderFactory.cs) - factory that wraps providers with caching layer
- LlmProviderOptions (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderOptions.cs) - provider options including cache TTL and size limits
- ProviderBasedAdvisoryInferenceClient (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs) - inference client that uses the caching layer
Interfaces: ILlmProvider
Source: SPRINT_20251226_019_AI_offline_inference.md

E2E Test Plan

Send identical prompts twice and verify LlmInferenceCache returns the cached response on the second call without hitting the LLM
Verify cache keys include model ID and parameters: same prompt with different temperature results in cache miss
Verify cache TTL: cached responses expire after configured duration
Verify cache size limits: when max entries are reached, oldest entries are evicted
Verify cache bypass: non-deterministic requests (temperature > 0) are not cached
Verify ProviderBasedAdvisoryInferenceClient correctly integrates caching with the provider pipeline

1.8 KiB Raw Blame History

LLM Inference Response Caching

Module

Status

Description

Implementation Details

E2E Test Plan

1.8 KiB

Raw Blame History