gekro
GitHub LinkedIn
AI

Llama.cpp / Ollama Config Builder

Pick hardware + model → get optimal CLI flags and an Ollama Modelfile. No more trial-and-error tuning.

Hardware

Model

Context

Estimated VRAM

— GB

Model weights
KV cache @ context
GPU layers (-ngl)
Fit verdict

llama.cpp command

Ollama Modelfile

Estimates within ±10%. Real-world VRAM also depends on model architecture (MoE vs dense), llama.cpp version, and OS overhead.

As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.

© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Choose your hardware (Pi 5, Mac M4 Pro, RTX 4090, H100, custom) and your target model (Llama 4 Maverick, Qwen 3 32B, GPT-OSS 120B, custom). Tool returns:

  • Estimated VRAM/RAM usage including KV cache at your context length
  • Recommended --gpu-layers (-ngl) for partial offload
  • Verdict: fits fully on GPU / partial offload / RAM-only / OOM
  • Complete llama-cli command with all flags
  • Equivalent Ollama Modelfile

Why It Saves Time

For every new model you deploy locally, you tune the same flags:

  • --gpu-layers (binary-search until it fits)
  • --ctx-size (depends on KV cache memory)
  • --threads (depends on CPU)
  • --batch-size (depends on use case)
  • --cache-type-k/v (only matters if you’re VRAM-constrained)
  • --mlock (only on systems prone to swap)

This calculator does the math for all of them based on hardware + model dimensions.

Quantization Reference

TypeBitsQualityUse when
Q2_K2.625BadYou really cannot fit the full model
Q3_K_M3.91AcceptableAggressive VRAM constraints
Q4_K_M4.83Sweet spotDefault — best quality/size ratio
Q5_K_M5.69Very goodHave headroom, want better quality
Q6_K6.56Near-losslessHave plenty of VRAM
Q8_08.5LosslessReference for evals
F1616LosslessFull precision (rarely useful)

KV Cache Precision

TypeVRAMQuality
f16 (default)100%None lost
q8_050%Tiny degradation
q4_025%Some degradation, fine for chat

Drop to q8_0 first if you’re tight on VRAM; q4_0 is the last resort.

What’s NOT Included

  • vLLM, sglang, TGI config (different ecosystem)
  • Quantization conversion (use llama.cpp’s quantize tool)
  • Speculative decoding parameters
  • Multi-GPU sharding (--tensor-split)

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.