gekro
GitHub LinkedIn
AI

Local Model Browser

Filter 54+ open-weight LLMs by hardware + task + license + capability. See which fit your VRAM at your chosen quantization, with a clearly-labeled performance estimate.

54 open-weight models 54 matching your filters

📡 Filterable browser, not an oracle. Performance estimates are based on params × bits-per-weight × hardware FLOPS — directional only. Real numbers come from running the model on your hardware. Catalog last refreshed 2026-06-01.

Hardware

Task

Must support

License

Min context window

Models

As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.

© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Pick your hardware on the left. The 54-model catalog filters on the right, with weights size and a throughput estimate against your selected quantization.

  • Hardware presets — Pi 5 16GB / 8GB, Mac M-series (Pro, Max, Ultra) up to 192 GB unified, RTX 3090/4090/5090, A100 40/80, H100, H200, 2× H100, plus Custom
  • Quantization picker — Q2_K through F16 (default Q4_K_M, the community sweet spot)
  • Task tabs — Any / Chat / Code / Vision / Reasoning / Agentic
  • Capability requirements — checkboxes for tool use, JSON mode, vision input, reasoning trace; only show models that support what you need
  • License filter — toggle Permissive (Apache, MIT), Restricted (Llama Community, Gemma TOU, MRL), Non-commercial (Cohere CC BY-NC) separately
  • Min context window — 8K / 32K / 128K / 1M
  • Fit-only toggle — by default hide models that won’t fit your hardware (you can turn this off to browse everything)
  • Sort — best fit (smallest first), total params, context window, release date, or vendor

Each result card shows:

  • Vendor pill and any MoE marker (with active params)
  • Type tags (chat / code / vision / reasoning / agentic) and capability tags (tool / vision / json / reasoning)
  • Weights size at your chosen quantization
  • Fit verdict color-coded by tier: green (fits comfortably), yellow (tight or needs CPU offload), red (won’t fit)
  • Estimated throughput in tokens/sec (colored by tier — 30+ green, 8-30 yellow, <8 red)
  • Context window, release date, license
  • ollama pull command and HuggingFace link
  • Notes when present (e.g. “MLA cuts KV cache by ~93%”, “Tool-use trained”)

How the throughput estimate works

This is the most honest version of model performance estimation I can give without measuring on your hardware:

weights_GB        = (params × bits_per_weight) / 8 / 1024³
active_weights_GB = same, but using params_active for MoE
tokens_per_sec    ≈ (memory_bandwidth_GB/s × 0.8) / active_weights_GB

It’s bandwidth-bound, not FLOPS-bound — for typical LLM inference at batch 1, the GPU/CPU spends most of its time waiting for weights to stream from memory. The 0.8 fudge factor is a generous bandwidth utilization estimate. Real numbers will be lower with longer contexts (KV cache grows), with attention optimizations (FlashAttention), and with quantization-aware kernels.

Don’t make purchase decisions from this number — run llama-bench on the actual hardware for that. Do use it for “is this even feasible?” first-cut filtering.

Why “Browser” not “Recommender”

The original spec was a recommender. We reframed during planning (2026-05-22): we don’t want to say “the best model for your Pi 5 is X.” Hardware-fit is just one dimension; quality on your specific task is another (and varies per benchmark, per prompt, per fine-tune). A facet browser lets you see all options at a glance and pick based on the tradeoffs you care about. That’s the honest framing.

What’s NOT Included

  • Quality benchmarks — no MMLU / HumanEval / GPQA / AIME scores. Reason: they go stale within weeks (new fine-tunes ship daily), and the leaderboard wars don’t tell you how a model behaves on YOUR task. Use HuggingFace Open LLM Leaderboard and LMArena for those
  • Actual tokens/sec measurements — the estimate is directional. For measured numbers run llama-bench or [Ollama’s bench]
  • Cloud / API models — this is specifically open-weight models you can run locally. Hosted API models are tracked separately in reasoning-models.json and hyperscaler-pricing.json
  • Persistent state — reload starts fresh, by design

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.