文章列表: 1 篇
2026-05-07
Cache Hits in LLMs
2026-05-07 ~ 2026-05-07

The note is generated by Gemini.

Overview

In the context of Large Language Models (LLMs), a Cache Hit occurs when the inference engine identifies that specific data (tokens, KV pairs, or full responses) from a previous request is already stored in memory.


🚀 Key Types of Cache Hits

1. KV (Key-Value) Cache Hit

  • Level: Inference Engine (e.g., vLLM, TensorRT-LLM).
  • Mechanism: During the “Prefill” stage, the model stores the Key and Value tensors of the attention mechanism.
  • The Hit: Reuses stored tensors for identical prefixes.
  • Result: Drastically reduces Time to First Token (TTFT).

2. Semantic Cache Hit

  • Level: Application Layer.
  • Mechanism: Uses Vector Embeddings to compare queries.
  • The Hit: Returns a saved answer if the meaning is the same.
  • Result: Zero GPU cost and near-zero latency.

📊 Comparison Table

MetricCache HitCache Miss
ComputationLow (Retrieval)High (Full Forward Pass)
LatencyMillisecondsSeconds
CostNegligibleFull Token Usage

Tip

Maximizing your Cache Hit Ratio is the most effective way to scale LLM apps.