文章列表: 1 篇
2026-05-07
2026-05-07 ~ 2026-05-07
The note is generated by Gemini.
Overview
In the context of Large Language Models (LLMs), a Cache Hit occurs when the inference engine identifies that specific data (tokens, KV pairs, or full responses) from a previous request is already stored in memory.
🚀 Key Types of Cache Hits
1. KV (Key-Value) Cache Hit
- Level: Inference Engine (e.g., vLLM, TensorRT-LLM).
- Mechanism: During the “Prefill” stage, the model stores the Key and Value tensors of the attention mechanism.
- The Hit: Reuses stored tensors for identical prefixes.
- Result: Drastically reduces Time to First Token (TTFT).
2. Semantic Cache Hit
- Level: Application Layer.
- Mechanism: Uses Vector Embeddings to compare queries.
- The Hit: Returns a saved answer if the meaning is the same.
- Result: Zero GPU cost and near-zero latency.
📊 Comparison Table
| Metric | Cache Hit | Cache Miss |
|---|---|---|
| Computation | Low (Retrieval) | High (Full Forward Pass) |
| Latency | Milliseconds | Seconds |
| Cost | Negligible | Full Token Usage |