What It Does

Choose your hardware (Pi 5, Mac M4 Pro, RTX 4090, H100, custom) and your target model (Llama 4 Maverick, Qwen 3 32B, GPT-OSS 120B, custom). Tool returns:

Estimated VRAM/RAM usage including KV cache at your context length
Recommended --gpu-layers (-ngl) for partial offload
Verdict: fits fully on GPU / partial offload / RAM-only / OOM
Complete llama-cli command with all flags
Equivalent Ollama Modelfile

Why It Saves Time

For every new model you deploy locally, you tune the same flags:

--gpu-layers (binary-search until it fits)
--ctx-size (depends on KV cache memory)
--threads (depends on CPU)
--batch-size (depends on use case)
--cache-type-k/v (only matters if you’re VRAM-constrained)
--mlock (only on systems prone to swap)

This calculator does the math for all of them based on hardware + model dimensions.

Quantization Reference

Type	Bits	Quality	Use when
Q2_K	2.625	Bad	You really cannot fit the full model
Q3_K_M	3.91	Acceptable	Aggressive VRAM constraints
Q4_K_M	4.83	Sweet spot	Default — best quality/size ratio
Q5_K_M	5.69	Very good	Have headroom, want better quality
Q6_K	6.56	Near-lossless	Have plenty of VRAM
Q8_0	8.5	Lossless	Reference for evals
F16	16	Lossless	Full precision (rarely useful)

KV Cache Precision

Type	VRAM	Quality
f16 (default)	100%	None lost
q8_0	50%	Tiny degradation
q4_0	25%	Some degradation, fine for chat

Drop to q8_0 first if you’re tight on VRAM; q4_0 is the last resort.

What’s NOT Included

vLLM, sglang, TGI config (different ecosystem)
Quantization conversion (use llama.cpp’s quantize tool)
Speculative decoding parameters
Multi-GPU sharding (--tensor-split)

GPU VRAM Calculator - inference VRAM only, more model variants
LoRA Memory Calculator - training memory
Reasoning Cost Calculator - when local isn’t worth it

Llama.cpp / Ollama Config Builder

Guide

What It Does

Why It Saves Time

Quantization Reference

KV Cache Precision

What’s NOT Included

Related Tools