LLM

文章列表： 1 篇

Cache Hits in LLMs

2026-05-07

Cache Hits in LLMs

2026-05-07 ~ 2026-05-07

LLM

The note is generated by Gemini.

Overview

In the context of Large Language Models (LLMs), a Cache Hit occurs when the inference engine identifies that specific data (tokens, KV pairs, or full responses) from a previous request is already stored in memory.

🚀 Key Types of Cache Hits

1. KV (Key-Value) Cache Hit

Level: Inference Engine (e.g., vLLM, TensorRT-LLM).
Mechanism: During the “Prefill” stage, the model stores the Key and Value tensors of the attention mechanism.
The Hit: Reuses stored tensors for identical prefixes.
Result: Drastically reduces Time to First Token (TTFT).

2. Semantic Cache Hit

Level: Application Layer.
Mechanism: Uses Vector Embeddings to compare queries.
The Hit: Returns a saved answer if the meaning is the same.
Result: Zero GPU cost and near-zero latency.

📊 Comparison Table

Metric	Cache Hit	Cache Miss
Computation	Low (Retrieval)	High (Full Forward Pass)
Latency	Milliseconds	Seconds
Cost	Negligible	Full Token Usage

Tip

Maximizing your Cache Hit Ratio is the most effective way to scale LLM apps.

····