Determine GPU memory overhead for KV cache in LLM serving
A single NVIDIA H100-80GB GPU allocates approximately 2 GB of VRAM for the Key-Value (KV) cache of a Llama-2-7B model at a context length of 4096 tokens using FP16 precision. This overhead is non-negotiable for autoregressive inference.

The Mechanics of KV Cache Scaling: Why Memory Bottlenecks Occur
The KV cache footprint is a dynamic variable determined by four primary factors: batch size, sequence length, number of layers, and the hidden dimension of the model. Unlike model weights, which remain static once loaded into VRAM, the KV cache expands linearly with every active request. This creates a volatile memory environment where a sudden spike in input sequence length can trigger an Out-of-Memory (OOM) error, even if the initial deployment appeared stable.
Memory allocation in standard inference engines traditionally relies on contiguous blocks. This approach is inefficient. When a request requires more memory than initially allocated, the engine must either pre-allocate a maximum-length buffer—leading to internal fragmentation—or re-allocate and move data, which spikes latency. The memory overhead is not merely the sum of the tensors; it includes the "slack" or wasted space reserved for potential sequence expansion.
Consider the concrete mechanics. Each transformer layer maintains its own Key and Value tensors. At generation step *t*, the model computes a new Key-Value pair and appends it to the running cache. The cache at step *t* holds *t* vectors per layer for both K and V. At step 1, you store two vectors. At step 4096, you store 8192 vectors per layer (4096 Keys + 4096 Values). The memory consumption is predictable and linear—but its absolute size is what catches teams off guard during production scaling.
KV cache memory overhead scales linearly with sequence length and batch size, making it the primary constraint for high-throughput LLM deployments.
The second-order effect is throughput collapse. When the KV cache consumes the majority of available VRAM, the maximum batch size drops. Fewer concurrent requests means lower GPU utilization. The hardware sits idle between memory-bound operations, burning power without producing proportional output. This is the mechanism by which KV cache mismanagement translates directly into inflated cost-per-token.
Mathematical Framework for Estimating Per-Token VRAM Consumption
To determine GPU memory overhead for KV cache in LLM serving, the calculation must account for both Key and Value tensors across all layers of the transformer. The formula for calculating the bytes required per token is:
Memory per Token (Bytes) = 2 × n_layers × n_heads × head_dim × precision_bytes
* 2: Represents two tensors (Key and Value).
* n_layers: The number of transformer blocks in the architecture.
* n_heads: The number of attention heads (or KV heads for GQA models).
* head_dim: The dimensionality of each head (typically hidden_size / n_heads).
* precision_bytes: The size of the data type (4 for FP32, 2 for FP16/BF16, 1 for INT8).
For a Llama-2-7B model:
* Layers: 32
* Query Heads: 32
* KV Heads: 32 (MHA, no head sharing)
* Head Dim: 128
* Precision: FP16 (2 bytes)
Calculation: 2 × 32 × 32 × 128 × 2 = 524,288 bytes per token (0.5 MB).
At a context length of 4096 tokens, the cache for a single request is 2,048 MB, or exactly 2 GB.
The formula above uses KV heads, not query heads. This distinction becomes critical for architectures employing Grouped-Query Attention (GQA), where n_kv_heads < n_heads. Failure to substitute the correct head count produces inflated estimates that mislead capacity planning.
| Parameter | Llama-2-7B | Llama-2-70B | Mistral-7B-v0.1 |
|---|---|---|---|
| Attention Mechanism | MHA | GQA (8:1) | GQA (4:1) |
| Total Layers | 32 | 80 | 32 |
| Query Heads | 32 | 64 | 32 |
| KV Heads | 32 | 8 | 8 |
| Head Dimension | 128 | 128 | 128 |
| KV Cache per Token (FP16) | 512 KB | 160 KB | 128 KB |
| KV Cache for 4k Context | 2.05 GB | 0.64 GB | 0.51 GB |
The table demonstrates that architectural choices like Grouped-Query Attention significantly alter the memory profile, allowing larger models to maintain manageable cache sizes relative to their parameter count. Llama-2-70B, despite having 10× the parameters of its 7B sibling, consumes less than one-third the KV cache per request. The constraint shifts from cache size to weight loading—a fundamentally different optimization problem.
Batch Multiplier Effect
Single-request estimates are misleading. Production serving handles concurrent requests, and each carries its own KV cache. The total KV cache VRAM consumption at any instant is:
Total KV Cache (Bytes) = batch_size × sequence_length × per_token_bytes
If the Llama-2-7B example above is deployed with a batch size of 32 at full 4096 context, the KV cache alone consumes 64 GB—occupying the majority of an H100-80GB's memory before accounting for model weights, activations, or framework overhead. This arithmetic explains why naive deployments hit OOM within the first few minutes of load testing.
Architectural Strategies to Mitigate Cache Bloat: MQA and GQA
Multi-Head Attention (MHA) is memory-expensive because it requires a unique Key and Value head for every Query head. Multi-Query Attention (MQA), introduced by Noam Shazeer in 2019, reduces this overhead by sharing a single Key and Value head across all Query heads within a layer. While MQA drastically cuts memory usage—by a factor equal to n_heads—it can degrade model quality and convergence stability, particularly on tasks requiring fine-grained attention patterns.
Grouped-Query Attention (GQA) serves as the industry-standard middle ground. It partitions Query heads into groups, with each group sharing one KV head. The reduction factor is calculated as (n_heads / n_kv_heads). In the Llama-3-70B architecture, GQA reduces the KV cache size by a factor of 8 compared to MHA. This reduction is critical for maintaining high batch sizes on hardware with limited VRAM, such as the NVIDIA A100-40GB.
The practical impact is measurable. A Llama-2-70B model using MHA would require 512 KB per token in FP16. With GQA (8 KV heads), the same model drops to 160 KB per token—a 3.2× reduction. At 4096 tokens and batch size 16, the difference is approximately 35 GB of VRAM. That is the delta between fitting on a single H100 and requiring multi-GPU tensor parallelism.
Without GQA, the memory pressure from the KV cache would prevent the use of large batch sizes, effectively capping the throughput of the inference server regardless of the GPU's compute (TFLOPS) capability. This is why every major model release since 2023—Llama-3, Mistral, Gemma, Qwen-2—has adopted GQA as default. The architecture is not optional; it is a prerequisite for cost-effective serving.
Dynamic Memory Management with PagedAttention and vLLM
Traditional inference engines allocate KV cache contiguously, mirroring how standard C++ or Python arrays are handled. This leads to two types of waste:
1. Reserved (Internal) Fragmentation: Memory allocated for the maximum possible sequence length (e.g., 8192) that is never fully used by shorter requests.
2. External Fragmentation: Small gaps between allocated blocks that are too small for new requests.
If you pre-allocate a 8192-token buffer for every incoming request, and the average request length is 512 tokens, you waste 93.75% of every allocated block. Across a batch of 64 requests, this amounts to gigabytes of dead VRAM that cannot serve any useful purpose.
PagedAttention, the core innovation of the vLLM project (2023), solves this by partitioning the KV cache into fixed-size blocks—typically 16 tokens each. These blocks do not need to be contiguous in physical VRAM. A block table maps logical blocks of the sequence to physical blocks in the GPU memory. This is architecturally identical to how virtual memory and paging function in modern operating systems.
The mechanics work as follows: when a request starts, the engine allocates a single block for the first 16 tokens. As the sequence grows, new blocks are allocated on demand from a global free list. When a request completes, its blocks return to the pool immediately. No compaction is required. No defragmentation pass runs. The block table indirection layer handles all address translation at negligible computational cost.
By implementing PagedAttention, memory waste is reduced to near-zero within the blocks, and the only remaining waste occurs in the final block of a sequence (at most 15 tokens of unused space). Performance benchmarks indicate that PagedAttention can reclaim up to 90% of VRAM previously lost to fragmentation. Optimizing memory layout requires the same rigor as a professional transport company selecting specific packing materials to minimize volume while ensuring structural integrity; in LLM serving, this equates to choosing between contiguous and non-contiguous memory blocks.
PagedAttention eliminates the requirement for contiguous memory blocks, reducing fragmentation-related waste by up to 90%.
The secondary benefit is memory sharing during beam search and parallel sampling. When multiple candidate sequences share a common prefix, PagedAttention allows them to reference the same physical blocks for the shared portion. Copy-on-write semantics apply: a block is duplicated only when one branch diverges. In beam search with width 4, this can reduce KV cache memory by 50–70% depending on the overlap between beams.
Quantization Tactics for Reducing KV Cache Footprint by 4x
Model weight quantization (e.g., 4-bit or 8-bit) is standard for fitting large models into consumer hardware. However, KV cache quantization is a separate optimization targeted specifically at inference runtime. Since the KV cache consists of activations rather than static weights, quantization must be performed dynamically or using calibrated scales.
The core challenge with KV cache quantization is outlier sensitivity. Transformer activations—particularly the Key tensors—contain outlier features with magnitudes 10–100× larger than the median. Naive per-tensor quantization collapses the representational range for the majority of values. Solutions include per-channel scaling, group-wise quantization, and outlier-aware encoding schemes.
* INT8 KV Cache: Reduces memory usage by 2× compared to FP16. This involves mapping the range of values in the Key and Value tensors to an 8-bit integer format. While memory is halved, a slight increase in perplexity is often observed unless per-token or per-channel scaling factors are used. The computational overhead of quantization and dequantization is minimal on modern GPUs with INT8 tensor cores.
* FP8 KV Cache: Leveraging the native FP8 support in NVIDIA Hopper (H100) and Blackwell (B200) architectures. FP8 provides a 2× reduction with minimal precision loss compared to INT8, as the floating-point format better captures the distribution of activation values. E4M3 (4-bit exponent, 3-bit mantissa) is the preferred format for inference, offering sufficient dynamic range for attention computation.
* 4-bit Quantization: Experimental techniques can reduce the cache by 4×. However, 4-bit KV caches typically require complex dequantization kernels that can introduce latency overhead, potentially negating the benefits of increased batch sizes. Recent work on KVQuant and KIVI demonstrates that 4-bit Key and 4-bit Value caches with per-channel scaling maintain acceptable quality on models up to 70B parameters.
The trade-off is clear: lower precision allows for higher batch sizes and longer sequences but risks degrading the model's reasoning capabilities. In production, the decision matrix depends on the task. Retrieval-augmented generation and summarization tolerate aggressive quantization (INT4 or FP8). Code generation and mathematical reasoning are more sensitive—FP8 is typically the safe lower bound.
Quantization Impact on Effective Capacity
The arithmetic is straightforward. An H100-80GB serving Llama-3-70B with GQA at FP16 KV cache:
* Model weights (FP16): ~140 GB → requires 2 GPUs with tensor parallelism.
* KV cache per token: ~160 bytes.
* Remaining VRAM on a 2-GPU setup: ~20 GB for KV cache.
* At 8k context, max batch ≈ 150 requests.
Switch to FP8 KV cache:
* KV cache per token: ~80 bytes.
* Max batch ≈ 300 requests at the same context length.
* Throughput roughly doubles without any architectural changes.
This is the compounding effect: quantization and GQA and PagedAttention stack multiplicatively. A system implementing all three can serve 5–10× more concurrent requests than a naive MHA + FP16 + contiguous allocation baseline on the same hardware.
Hardware Utilization and Throughput Dynamics
The relationship between KV cache and throughput is inverse. Larger KV caches per request mean fewer requests can fit into the GPU simultaneously. If a Llama-3-8B model is served on an A100-80GB:
1. Model weights (FP16) occupy ~16 GB.
2. Framework and activation overhead: ~5 GB.
3. Remaining VRAM for KV cache: ~59 GB.
4. If each request at 8k context requires ~1 GB of KV cache, the maximum batch size is ~59.
Pushing beyond this limit forces the engine to swap KV cache to CPU memory or reject incoming requests. CPU offloading introduces 10–50× latency penalties per token due to PCIe bandwidth constraints (64 GB/s on PCIe 5.0 vs. 3.35 TB/s on HBM3). Neither outcome is acceptable for latency-sensitive applications.
To increase throughput, the infrastructure lead must either reduce the per-request cache size (via quantization or GQA) or increase the available VRAM (larger GPUs or multi-GPU sharding). When the memory limit is reached, the inference engine enters a "memory-bound" state. In this state, the GPU compute cores are underutilized (low TFLOPS) because the system is waiting for memory I/O or simply cannot fit more data into the registers.
The diagnostic signal is clear: high memory bandwidth utilization (>80%) combined with low compute utilization (<30%) indicates a KV cache bottleneck. This pattern is visible in nvidia-smi and more granularly through tools like Nsight Systems. The fix is never "buy more compute"—it is "reduce memory pressure per request."
Final Performance-to-Cost Verdict
Determining GPU memory overhead for KV cache is a requirement for calculating the Total Cost of Ownership (TCO) of LLM infrastructure.
* MHA-based models (e.g., original Llama-7B) are inefficient for long-context serving and should be avoided in production environments favoring throughput.
* GQA-based models (e.g., Llama-3, Mistral) are mandatory for cost-effective scaling.
* PagedAttention is the baseline requirement for any inference stack; contiguous allocation is obsolete.
* FP8 Quantization is the recommended optimization path for H100 deployments, providing a 2× memory reduction with negligible impact on output quality.
Failure to accurately calculate and manage KV cache overhead results in sub-optimal GPU utilization, where expensive H100 clusters operate at 20–30% of their theoretical throughput due to memory-induced bottlenecks. The engineers who treat KV cache math as a first-class operational concern—running the numbers before deploying, not after—consistently extract 3–5× more value per GPU dollar. Precision in cache estimation is the difference between a viable product and an infrastructure deficit.