2026-04-07|[ai, performance, gpu, llm, infrastructure]

The GPU KV Cache: Why Your LLM's Memory Matters More Than You Think

Every LLM interaction has a hidden cost: the KV Cache. It grows with every token, dictating speed, cost, and context limits. Here's how it works and why it matters.

Every time you send a prompt to an LLM — whether it's Claude, GPT, or a local Llama model — there's a hidden cost that most people never think about. It's not the model's intelligence. It's not the training data. It's the KV Cache, a chunk of GPU memory that grows with every token in your conversation and quietly dictates how fast, how expensive, and how long your interactions can be.

If you've ever wondered why long-context prompts are slower, why Claude Code can burn through tokens surprisingly fast, or why running a 70B model locally makes your GPU sweat — this post will connect the dots.

What Is the KV Cache?

Transformers generate text one token at a time. Each new token needs to "look back" at every previous token through a mechanism called attention. To compute attention, the model creates three things for each token:

Query (Q): "What am I looking for?"
Key (K): "What kind of information do I hold?"
Value (V): "Here's my actual content."

The new token's Query is compared against every previous token's Key to figure out which tokens are relevant. Then the corresponding Values are blended together, weighted by relevance, to produce the output.

  Token Generation: Attention Mechanism
  ─────────────────────────────────────

  New Token ──► [Query Q]
                   │
                   ▼
        ┌──────────────────────┐
        │  Compare Q with all  │
        │  previous Keys (K)   │
        └──────────┬───────────┘
                   │
          Attention Scores
         ┌────┬────┬────┐
         │0.05│0.02│0.85│ ◄── relevance weights
         └──┬─┴──┬─┴──┬─┘
            ▼    ▼    ▼
         [V₁]  [V₂]  [V₃]    ◄── Value vectors
            \    │    /
             \   │   /
              ▼  ▼  ▼
        ┌──────────────┐
        │  Weighted Sum │
        │  of Values    │
        └──────┬───────┘
               ▼
          Output Token

Here's the problem: without caching, the model would recompute every previous token's Key and Value from scratch at every single generation step. For a sequence of 10,000 tokens, that means recomputing 9,999 K/V pairs when generating token 10,000 — an enormous waste, since those values never change.

The KV Cache solves this by storing all previously computed Key and Value vectors in GPU memory. Each new step only computes K/V for the new token, appends it to the cache, and moves on. Simple, effective, and absolutely essential for practical inference speed.

  Without KV Cache (Wasteful)         With KV Cache (Efficient)
  ───────────────────────────         ─────────────────────────

  Step 1: Compute K,V for t₁         Step 1: Compute K,V for t₁
  Step 2: Compute K,V for t₁,t₂      Step 2: Compute K,V for t₂ only
  Step 3: Compute K,V for t₁,t₂,t₃   Step 3: Compute K,V for t₃ only
  Step 4: Compute K,V for t₁..t₄     Step 4: Compute K,V for t₄ only
  ...                                 ...
  Step N: Compute K,V for t₁..tₙ     Step N: Compute K,V for tₙ only

  Total: N(N+1)/2 computations        Total: N computations
  ───────────── O(N²) ──────────      ────────── O(N) ──────────

The Classroom Analogy

Think of a classroom where the teacher asks: "What did Ravi eat for lunch?"

Every student in the class represents a token. Each student has written two cards:

A Key card describing what kind of information they hold — like a label on a folder.
A Value card with the actual information inside.

Ravi's Key card says "Ravi, food, lunch." His Value card says "Ravi ate pasta for lunch." Priya's Key card says "Priya, homework, math." Amit's Key says "Amit, food, breakfast."

The teacher's question becomes a Query — a search magnet. It gets compared to every student's Key card. Ravi's card is a strong match (score: 95), Amit's is partial (score: 30, right topic but wrong person), and Priya's barely registers (score: 2).

  Teacher's Query: "What did Ravi eat for lunch?"
  ────────────────────────────────────────────────

  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │  Ravi     │  │  Priya   │  │  Amit    │
  ├──────────┤  ├──────────┤  ├──────────┤
  │ K: Ravi, │  │ K: Priya,│  │ K: Amit, │
  │ food,    │  │ homework,│  │ food,    │
  │ lunch    │  │ math     │  │ breakfast│
  ├──────────┤  ├──────────┤  ├──────────┤
  │ V: "Ravi │  │ V: "Priya│  │ V: "Amit │
  │ ate pasta│  │ solved   │  │ had toast│
  │ for      │  │ equation │  │ for      │
  │ lunch"   │  │ #5"      │  │ bkfst"   │
  └────┬─────┘  └────┬─────┘  └────┬─────┘
       │             │             │
   Score: 95     Score: 2     Score: 30
   ████████░░    ░░░░░░░░░    ███░░░░░░░
       │             │             │
       ▼             ▼             ▼
  ┌──────────────────────────────────────┐
  │  Result: "Ravi ate pasta for lunch"  │
  │  (95% Ravi + 2% Priya + 3% Amit)    │
  └──────────────────────────────────────┘

The model reads everyone's Value cards, but pays attention proportional to the match score. It focuses almost entirely on Ravi's value and produces: "Ravi ate pasta for lunch."

Now, without a KV Cache, every time the teacher asks a new question, all the students would have to rewrite their cards from scratch. With the cache, the cards stay on the desk. Only the new question (Query) needs to be formed. Much faster.

Those "cards" are actually vectors — lists of numbers like [0.9, 0.1, 0.8, 0.7]. "Matching" means computing how similar two lists of numbers are. The more similar, the higher the attention score.

Scaling to Large Models

That classroom analogy works for intuition, but a real large model is less like a classroom and more like an 80-story school building.

A 70B parameter model like Llama 70B has 80 layers. The input passes through every layer sequentially, with each layer running its own attention computation using its own set of Keys and Values. Early layers learn basic patterns (grammar, entity recognition). Middle layers learn relationships ("Ravi is the one who ate"). Late layers form the actual answer.

  Llama 70B: The 80-Layer Tower
  ─────────────────────────────

  ┌───────────────────────────────┐
  │  Layer 80    Answer formation │ ◄── Late layers
  ├───────────────────────────────┤
  │  Layer 79    ...              │
  │  ...                          │
  │  Layer 61    ...              │
  ├───────────────────────────────┤
  │  Layer 60    Reasoning        │ ◄── Middle layers
  │  ...                          │
  │  Layer 41    Relationships    │
  ├───────────────────────────────┤
  │  Layer 40    ...              │
  │  ...                          │
  │  Layer 21    ...              │
  ├───────────────────────────────┤
  │  Layer 20    Entity recog.    │ ◄── Early layers
  │  ...                          │
  │  Layer 1     Grammar/syntax   │
  └───────────────┬───────────────┘
                  │
          Input: "What did
          Ravi eat for lunch?"

  Each layer has:
  ├── 64 Query heads
  ├──  8 KV heads (shared via GQA)
  └── 128 dimensions per head

Each layer has multiple attention heads — think of them as different teachers on the same floor, each examining the question from a different angle. Llama 70B has 64 query heads and 8 KV heads per layer (using a technique called Grouped Query Attention, where multiple query heads share a set of KV heads to save memory).

The KV Cache must store Keys and Values for every token, across every head, across all 80 layers. Quick math for a 70B model with 50,000 tokens in context:

50,000 tokens × 8 KV heads × 128 dimensions × 80 layers × 2 (K+V) × 2 bytes (FP16) ≈ 12.8 GB

Push that to 1 million tokens and you're looking at north of 200 GB — just for the cache, on top of the 140 GB the model weights already occupy.

  KV Cache Memory vs. Context Length (Llama 70B, FP16)
  ─────────────────────────────────────────────────────

  Memory
  (GB)
  250 ┤
      │                                          ╱
  200 ┤                                        ╱
      │                                      ╱
  150 ┤                                    ╱
      │                                  ╱
  100 ┤                               ╱
      │                            ╱
   50 ┤                         ╱
      │                      ╱
   25 ┤               ····╱
      │          ····╱
  12.8┤·····─────
      │
    0 ┼────┬────┬────┬────┬────┬────┬────┬────┬────┬───
      0   50K 100K 200K 300K 400K 500K 700K 850K  1M
                    Context Length (tokens)

  Model weights alone: ~140 GB
  At 50K context:  12.8 GB cache + 140 GB weights = ~153 GB
  At 1M context:  ~256 GB cache + 140 GB weights = ~396 GB

Longer Context = Slower and More Expensive

This is the part that catches people off guard. The KV Cache grows linearly with context length, but the compute cost of attention grows quadratically. Every new token must attend to all previous tokens.

At position 100,000 in a sequence, generating a single output token requires 100,000 dot products per head per layer. With 64 heads and 80 layers, that's over 500 million dot products for one token.

The practical impact is straightforward:

32K context: Fast, efficient, affordable.
128K context: Noticeably slower, meaningful cost increase.
1M context: Slow, expensive, and justified only when you need holistic understanding across a massive document.

This is exactly why RAG (Retrieval-Augmented Generation) is so popular. Instead of stuffing a million tokens into context, you retrieve the 5–10 most relevant chunks, keep your context at 4–16K tokens, and get both speed and accuracy.

A Real Cost Example

Suppose you have 150K tokens of context and need to make 20 API calls. Each call sends the full context:

Input: 150K × 20 = 3 million input tokens
Output: ~1K × 20 = 20K output tokens

On a model priced at $3 per million input tokens, that's roughly $9–10 just in input costs for what might feel like a short task. With a pricier model, you could be looking at $45–50.

  Cost Scaling: 20 API Calls at Different Context Sizes
  ─────────────────────────────────────────────────────

  Context     Input Tokens     Cost @ $3/M    Cost @ $15/M
  ──────────  ───────────────  ────────────   ────────────
    4K           80,000         $0.24           $1.20
   16K          320,000         $0.96           $4.80
   32K          640,000         $1.92           $9.60
  128K        2,560,000         $7.68          $38.40
  150K        3,000,000         $9.00          $45.00
    1M       20,000,000        $60.00         $300.00

This is also why tools like Claude Code can surprise you with token usage. Every interaction sends the system prompt, full conversation history, file contents, and tool outputs — and the context compounds with each turn. A deep debugging session can easily burn through millions of tokens.

Where the Real Bottleneck Lives

You might assume the bottleneck is raw computation — all those dot products. It usually isn't. The real bottleneck is memory bandwidth.

To generate each new token, the GPU must read the entire KV cache from VRAM. An NVIDIA H100 has about 3.35 TB/s of memory bandwidth. If your KV cache is 200 GB, just reading it takes roughly 60 milliseconds — before any math happens. This is why long-context generation feels sluggish toward the end of a sequence.

  The Memory Bandwidth Wall
  ─────────────────────────

  ┌──────────────────────────────────────────────────┐
  │                    GPU Die                        │
  │                                                   │
  │  ┌──────────────┐    ┌──────────────────────────┐ │
  │  │  On-chip SRAM │    │     Compute Cores        │ │
  │  │   ~50 TB/s    │◄──►│  (Tensor Cores, CUDA)    │ │
  │  │   (20 MB)     │    │                          │ │
  │  └──────────────┘    └──────────────────────────┘ │
  │         ▲                                         │
  │         │  ◄── This is the bottleneck             │
  │         ▼                                         │
  │  ┌──────────────────────────────────────────────┐ │
  │  │            HBM3 VRAM  (80 GB)                │ │
  │  │              3.35 TB/s                        │ │
  │  │                                               │ │
  │  │    ┌────────────┐  ┌────────────────────┐     │ │
  │  │    │Model Weights│  │     KV Cache       │     │ │
  │  │    │  ~140 GB    │  │  12-256 GB         │     │ │
  │  │    └────────────┘  └────────────────────┘     │ │
  │  └──────────────────────────────────────────────┘ │
  └──────────────────────────────────────────────────┘

  Time to read 200 GB KV cache: 200 / 3,350 ≈ 60 ms
  That's 60 ms PER TOKEN, just for memory reads!

This bandwidth wall is also why GPU hardware is evolving toward higher memory bandwidth (HBM3, HBM3e) rather than just more compute cores.

How the Industry Is Fighting Back

Several techniques exist to tame the KV Cache:

Grouped Query Attention (GQA) reduces the number of KV heads. Instead of each query head having its own KV pair, multiple query heads share a smaller set. Llama 70B uses 64 query heads but only 8 KV heads — an 8× reduction in cache size.

  Multi-Head Attention (MHA)      Grouped Query Attention (GQA)
  ─────────────────────────       ────────────────────────────

  Q₁ ─► K₁,V₁                    Q₁ ─┐
  Q₂ ─► K₂,V₂                    Q₂ ─┤
  Q₃ ─► K₃,V₃                    Q₃ ─┼─► K₁,V₁
  Q₄ ─► K₄,V₄                    Q₄ ─┘
  Q₅ ─► K₅,V₅                    Q₅ ─┐
  Q₆ ─► K₆,V₆                    Q₆ ─┤
  Q₇ ─► K₇,V₇                    Q₇ ─┼─► K₂,V₂
  Q₈ ─► K₈,V₈                    Q₈ ─┘

  8 KV pairs (100%)               2 KV pairs (25%)
  Full memory cost                4× memory savings

PagedAttention, used in serving frameworks like vLLM, manages KV cache memory the way an operating system manages RAM — in small pages rather than one continuous block. This eliminates memory fragmentation and dramatically improves batch throughput.

Flash Attention avoids reading and writing the full attention matrix to GPU VRAM. Instead, it processes attention in small tiles using the GPU's fast on-chip SRAM (~50 TB/s vs. 3.35 TB/s for main memory), fusing operations to minimize data movement.

KV Cache Quantization stores cached vectors in lower precision (FP8 or INT4 instead of FP16), cutting memory usage by 2–4× with minimal quality loss.

Sliding Window Attention, used in models like Mistral, only caches the last N tokens, trading distant-context recall for dramatically lower memory usage.

Local Inference: A Different Game

If you run models locally with tools like Ollama, the KV Cache works differently than cloud APIs.

Cloud APIs are stateless. Every call rebuilds the KV cache from scratch. Your conversation history is sent as input tokens every time, processed, and the cache is discarded after the response.

Ollama is stateful. It keeps the KV cache in GPU memory between turns. Your first prompt is slow (building the cache), but follow-ups are faster because the cache is reused. The tradeoff is that your GPU VRAM stays occupied even when idle — Ollama is holding those KV vectors for you.

  Cloud API (Stateless)            Ollama (Stateful)
  ─────────────────────            ──────────────────

  Turn 1:                          Turn 1:
  [Build cache] ─► Response        [Build cache] ─► Response
  [Discard cache]                   [Keep cache in VRAM]

  Turn 2:                          Turn 2:
  [Rebuild entire cache] ─► Resp   [Append to cache] ─► Response
  [Discard cache]                   [Keep cache in VRAM]

  Turn 3:                          Turn 3:
  [Rebuild entire cache] ─► Resp   [Append to cache] ─► Response
  [Discard cache]                   [Keep cache in VRAM]

  Cost: Reprocess all tokens       Cost: Only new tokens
        every turn                       per turn
  VRAM: Free between turns         VRAM: Occupied always

This is also why running multiple concurrent conversations locally is impractical. Each session needs its own KV cache, and your single GPU only has so much memory.

Multi-GPU and Multi-Server Serving

When a model is too large for a single GPU, you need to split it. There are two main strategies:

Tensor Parallelism splits each layer horizontally across multiple GPUs within the same server. All GPUs work on the same layer simultaneously, communicating via ultra-fast NVLink (~900 GB/s). This is the primary strategy and works well because the communication is fast and constant.

Pipeline Parallelism distributes groups of layers across different servers. Server 1 handles layers 1–20, Server 2 handles 21–40, and so on. The challenge is that without careful scheduling, only one server works at a time while the others wait. Micro-batching solves this by feeding multiple requests through the pipeline so all servers stay busy — like an assembly line processing multiple cars at once.

The hard constraint is always communication speed. The entire art of distributed inference is minimizing communication across the slowest links.

  Communication Speed Hierarchy
  ─────────────────────────────

  ┌─────────────────────────────────────────────────┐
  │  On-chip SRAM         ~50,000 GB/s   ████████████│
  │  (within GPU core)                               │
  ├─────────────────────────────────────────────────┤
  │  HBM3 VRAM            ~3,350 GB/s   █████████   │
  │  (GPU ↔ memory)                                  │
  ├─────────────────────────────────────────────────┤
  │  NVLink               ~900 GB/s     ███████     │
  │  (GPU ↔ GPU, same server)                        │
  ├─────────────────────────────────────────────────┤
  │  InfiniBand           ~50-100 GB/s  ██          │
  │  (server ↔ server)                               │
  ├─────────────────────────────────────────────────┤
  │  Ethernet             ~10-25 GB/s   █           │
  │  (datacenter network)                            │
  └─────────────────────────────────────────────────┘

  Each boundary crossing: ~10× speed drop

The Takeaway

The KV Cache is one of those invisible mechanisms that profoundly shapes the LLM experience. It determines how fast your responses arrive, how much your API calls cost, how long your context can be, and what hardware you need.

The practical lessons are simple:

Keep context as short as possible. Use RAG or summarization instead of dumping everything into the prompt.
Understand what you're paying for. Input tokens aren't just about sending data — they drive KV cache construction and attention computation.
Respect the memory bandwidth wall. Longer context isn't just "more memory" — it's slower generation at every step.
Choose the right tool for scale. Local inference with Ollama is great for experimentation, but production serving at scale requires techniques like PagedAttention, Flash Attention, and multi-GPU parallelism.

The next time you notice a long-context response slowing down toward the end, you'll know exactly what's happening: somewhere in a data center, a GPU is reading through millions of Key-Value vectors, one layer at a time, 80 floors deep, searching for the right answer to your question.