Estimate VRAM needed to run or fine-tune any LLM at any quantization, then see which GPUs fit
Total VRAM needed
14.0 GB
For
7B model · FP16 · 4K context · batch 1
Will it fit?
| GPU / Device | VRAM | Headroom | Fit |
|---|
GPU VRAM Calculator tells you whether a specific LLM will fit on specific hardware. Choose a model parameter count, quantization format, context length, and batch size — and get a VRAM breakdown (weights + KV cache + activations) with a fit/no-fit verdict against a curated hardware table covering consumer GPUs, datacenter cards, Apple Silicon, and edge devices.
VRAM for LLM inference has three components:
params × bytes_per_param
| Quantization | Bytes/param | 7B model | 70B model |
|---|---|---|---|
| FP32 | 4.0 | 28 GB | 280 GB |
| FP16 / BF16 | 2.0 | 14 GB | 140 GB |
| FP8 | 1.0 | 7 GB | 70 GB |
| INT8 / Q8_0 | 1.0 | 7 GB | 70 GB |
| Q5_K_M | 0.6875 | 4.8 GB | 48 GB |
| Q4_K_M | 0.5625 | 3.9 GB | 39 GB |
| INT4 / Q4_0 | 0.5 | 3.5 GB | 35 GB |
| Q3_K_M | 0.4375 | 3.1 GB | 31 GB |
| Q2_K | 0.3125 | 2.2 GB | 22 GB |
2 × layers × num_heads × head_dim × seq_len × batch × bytes_per_element
KV cache grows linearly with context length and batch size. For a 70B model at 32K context with batch 1 and FP16 KV, this is approximately 10 GB on top of the weights. Doubling context length doubles the KV cache. This is why long-context inference is significantly more expensive than short-context inference even for the same model weights.
Approximately 10–15% of weights memory for inference (12% used as the default). This covers the intermediate tensors produced during each forward pass.
LoRA fine-tuning adds:
This is the single most important thing to understand about LLM inference on consumer hardware: VRAM is the binding constraint, not compute (TFLOPS).
A 70B model at FP16 needs 140 GB of VRAM for the weights alone. An RTX 4090 has 24 GB. It doesn’t matter that the 4090 has 82.6 TFLOPS of FP16 performance — the model physically cannot fit. TFLOPS determines how fast tokens generate once the model is loaded; VRAM determines whether it can load at all.
This is why quantization exists and why it’s the first lever engineers reach for. Quantization converts FP16 (2 bytes per parameter) to lower precision (INT4 = 0.5 bytes per parameter) — a 4× reduction in memory footprint. The tradeoff is quality degradation that increases as precision decreases.
The quantization quality/size tradeoff in practice:
Practical example: can I run Llama 3 70B Q4 on an RTX 4090?
Practical example: can I run Llama 3 8B Q4 on an RTX 4090?
Standard multi-head attention (MHA) allocates a separate KV cache entry per attention head. For a 70B model with 64 heads, this is expensive. Modern models (Llama 3+, Mistral, Gemma) use Grouped-Query Attention (GQA), which shares KV cache across groups of query heads. GQA reduces KV cache memory by 4–8×.
This calculator models standard MHA — if you’re running a GQA model (which most recent models are), the actual KV cache will be significantly lower than shown. Subtract roughly 75–87% from the KV cache figure for GQA models at group size 8. This means context length pressure is much less severe for modern GQA models than older MHA models.
LoRA (Low-Rank Adaptation) is the standard approach for fine-tuning large models on consumer hardware. Instead of updating all 70 billion parameters, LoRA inserts low-rank adapter matrices into each attention layer and trains only those. A rank-16 LoRA adapter for a 70B model has approximately 150M trainable parameters — about 0.2% of the total. This makes fine-tuning feasible on hardware that can barely run inference:
With 8-bit Adam and rank 16, a 13B model fine-tunes on approximately 16–20 GB of VRAM — achievable on an RTX 3090/4090.
The tool checks fit against NVIDIA consumer (RTX 3060 through 5090), NVIDIA datacenter (A100 40/80GB, H100, H200, A6000), Apple Silicon (M2 Ultra through M4 Max), and edge devices (Pi AI HAT 2, Jetson Orin).
For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.