Files
git.stella-ops.org/docs/features/unchecked/advisoryai/llm-inference-response-caching.md

1.8 KiB

LLM Inference Response Caching

Module

AdvisoryAI

Status

IMPLEMENTED

Description

In-memory LLM inference cache that deduplicates identical prompt+model combinations. Reduces API costs and latency by caching deterministic responses keyed by content hash.

Implementation Details

  • Modules: src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/
  • Key Classes:
    • LlmInferenceCache (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmInferenceCache.cs) - in-memory cache keyed by content hash of prompt+model+parameters
    • LlmProviderFactory (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderFactory.cs) - factory that wraps providers with caching layer
    • LlmProviderOptions (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmProviders/LlmProviderOptions.cs) - provider options including cache TTL and size limits
    • ProviderBasedAdvisoryInferenceClient (src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs) - inference client that uses the caching layer
  • Interfaces: ILlmProvider
  • Source: SPRINT_20251226_019_AI_offline_inference.md

E2E Test Plan

  • Send identical prompts twice and verify LlmInferenceCache returns the cached response on the second call without hitting the LLM
  • Verify cache keys include model ID and parameters: same prompt with different temperature results in cache miss
  • Verify cache TTL: cached responses expire after configured duration
  • Verify cache size limits: when max entries are reached, oldest entries are evicted
  • Verify cache bypass: non-deterministic requests (temperature > 0) are not cached
  • Verify ProviderBasedAdvisoryInferenceClient correctly integrates caching with the provider pipeline