Name: Local Model Browser
Author: Rohit Burani

What It Does

Pick your hardware on the left. The 54-model catalog filters on the right, with weights size and a throughput estimate against your selected quantization.

Hardware presets — Pi 5 16GB / 8GB, Mac M-series (Pro, Max, Ultra) up to 192 GB unified, RTX 3090/4090/5090, A100 40/80, H100, H200, 2× H100, plus Custom
Quantization picker — Q2_K through F16 (default Q4_K_M, the community sweet spot)
Task tabs — Any / Chat / Code / Vision / Reasoning / Agentic
Capability requirements — checkboxes for tool use, JSON mode, vision input, reasoning trace; only show models that support what you need
License filter — toggle Permissive (Apache, MIT), Restricted (Llama Community, Gemma TOU, MRL), Non-commercial (Cohere CC BY-NC) separately
Min context window — 8K / 32K / 128K / 1M
Fit-only toggle — by default hide models that won’t fit your hardware (you can turn this off to browse everything)
Sort — best fit (smallest first), total params, context window, release date, or vendor

Each result card shows:

Vendor pill and any MoE marker (with active params)
Type tags (chat / code / vision / reasoning / agentic) and capability tags (tool / vision / json / reasoning)
Weights size at your chosen quantization
Fit verdict color-coded by tier: green (fits comfortably), yellow (tight or needs CPU offload), red (won’t fit)
Estimated throughput in tokens/sec (colored by tier — 30+ green, 8-30 yellow, <8 red)
Context window, release date, license
ollama pull command and HuggingFace link
Notes when present (e.g. “MLA cuts KV cache by ~93%”, “Tool-use trained”)

How the throughput estimate works

This is the most honest version of model performance estimation I can give without measuring on your hardware:

weights_GB        = (params × bits_per_weight) / 8 / 1024³
active_weights_GB = same, but using params_active for MoE
tokens_per_sec    ≈ (memory_bandwidth_GB/s × 0.8) / active_weights_GB

It’s bandwidth-bound, not FLOPS-bound — for typical LLM inference at batch 1, the GPU/CPU spends most of its time waiting for weights to stream from memory. The 0.8 fudge factor is a generous bandwidth utilization estimate. Real numbers will be lower with longer contexts (KV cache grows), with attention optimizations (FlashAttention), and with quantization-aware kernels.

Don’t make purchase decisions from this number — run llama-bench on the actual hardware for that. Do use it for “is this even feasible?” first-cut filtering.

Why “Browser” not “Recommender”

The original spec was a recommender. We reframed during planning (2026-05-22): we don’t want to say “the best model for your Pi 5 is X.” Hardware-fit is just one dimension; quality on your specific task is another (and varies per benchmark, per prompt, per fine-tune). A facet browser lets you see all options at a glance and pick based on the tradeoffs you care about. That’s the honest framing.

What’s NOT Included

Quality benchmarks — no MMLU / HumanEval / GPQA / AIME scores. Reason: they go stale within weeks (new fine-tunes ship daily), and the leaderboard wars don’t tell you how a model behaves on YOUR task. Use HuggingFace Open LLM Leaderboard and LMArena for those
Actual tokens/sec measurements — the estimate is directional. For measured numbers run llama-bench or [Ollama’s bench]
Cloud / API models — this is specifically open-weight models you can run locally. Hosted API models are tracked separately in reasoning-models.json and hyperscaler-pricing.json
Persistent state — reload starts fresh, by design

Llama.cpp / Ollama Config Builder — once you’ve picked a model here, generate the optimal CLI flags
LoRA / QLoRA Memory Calculator — VRAM math for fine-tuning these same models
GPU VRAM Calculator — manual VRAM math with full breakdown
Reasoning Token Cost Calculator — cloud equivalent for hosted reasoning models
Context Window Visualizer — see what fits in different context windows

Local Model Browser

Guide

What It Does

How the throughput estimate works

Why “Browser” not “Recommender”

What’s NOT Included

Related Tools