Pick hardware + model → get optimal CLI flags and an Ollama Modelfile. No more trial-and-error tuning.
Hardware
Model
Context
Estimated VRAM
— GB
—
llama.cpp command
—
Ollama Modelfile
—
As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.
Choose your hardware (Pi 5, Mac M4 Pro, RTX 4090, H100, custom) and your target model (Llama 4 Maverick, Qwen 3 32B, GPT-OSS 120B, custom). Tool returns:
--gpu-layers (-ngl) for partial offloadllama-cli command with all flagsFor every new model you deploy locally, you tune the same flags:
--gpu-layers (binary-search until it fits)--ctx-size (depends on KV cache memory)--threads (depends on CPU)--batch-size (depends on use case)--cache-type-k/v (only matters if you’re VRAM-constrained)--mlock (only on systems prone to swap)This calculator does the math for all of them based on hardware + model dimensions.
| Type | Bits | Quality | Use when |
|---|---|---|---|
| Q2_K | 2.625 | Bad | You really cannot fit the full model |
| Q3_K_M | 3.91 | Acceptable | Aggressive VRAM constraints |
| Q4_K_M | 4.83 | Sweet spot | Default — best quality/size ratio |
| Q5_K_M | 5.69 | Very good | Have headroom, want better quality |
| Q6_K | 6.56 | Near-lossless | Have plenty of VRAM |
| Q8_0 | 8.5 | Lossless | Reference for evals |
| F16 | 16 | Lossless | Full precision (rarely useful) |
| Type | VRAM | Quality |
|---|---|---|
| f16 (default) | 100% | None lost |
| q8_0 | 50% | Tiny degradation |
| q4_0 | 25% | Some degradation, fine for chat |
Drop to q8_0 first if you’re tight on VRAM; q4_0 is the last resort.
quantize tool)--tensor-split)For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.