Prompt Caching for LLM Pipelines: Fast Responses Without Stale Logic
Prompt caching can cut cost and latency, but stale cached outputs become dangerous when prompt templates evolve. You need semantic keys and strict invalidation rules.
Step 1: include template version in cache key
def cache_key(task_id, template_version, input_hash):
return f"{task_id}:{template_version}:{input_hash}"
Step 2: add policy-driven TTL by task sensitivity
TTL = {
"draft_blog": 3600,
"security_summary": 300,
}
Step 3: invalidate on taxonomy or style-rubric changes
{
"event": "style_rules_updated",
"invalidate_prefix": "draft_blog:v12:"
}
Pitfall
Caching by input text only. It ignores policy, prompt, and model changes that alter acceptable output.