gekro
GitHub LinkedIn
AI

RAG Eval Toolkit

Paste a corpus + labelled queries → measure recall@K, precision@K, MRR, nDCG for BM25 (and optionally semantic) retrieval. Local, free, self-hosted alternative to RAGAS / TruLens.

BM25 retrieval · MRR · recall@K

🧪 Local eval harness. Corpus + queries never leave the browser. Honest about scope: this is intrinsic retrieval evaluation (does the retriever find the right docs?), not end-to-end answer evaluation (does the model's final answer match?).

Corpus

0 documents

One document per line, OR paste a JSON array of strings, OR paste JSON array of {"id":..., "text":...}. IDs default to 0-indexed line number.

Test queries

0 queries

One per line, format: query | expected_id_1, expected_id_2, .... IDs are 0-indexed (or match the id field if your corpus is JSON with ids).

Per-query results

Run an eval to see per-query breakdown.

Retrieval

Metrics

MRR
Recall@K
Precision@K
Hit rate
nDCG@K
Queries evaluated0
Docs indexed0
Avg retrieval (ms)

As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.

© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Three retrieval modes, four metrics, instant per-query breakdown:

  • BM25 — classic lexical retrieval, k1=1.5 / b=0.75, small English stopword list. Zero downloads, runs in microseconds per query.
  • Semantic — dense embeddings via all-MiniLM-L6-v2 (Xenova/transformer.js). Lazy-loaded from jsdelivr on first use, then ~10-50 ms per query in your browser.
  • Hybrid — Reciprocal Rank Fusion of BM25 + semantic with RRF_K=60.

Metrics computed across the whole eval set:

MetricDefinition
MRRMean Reciprocal Rank — average of 1/rank-of-first-correct
Recall@Kfraction of expected docs found in top-K, averaged across queries
Precision@Kfraction of top-K that were expected, averaged across queries
Hit Ratefraction of queries with at least one correct doc in top-K
nDCG@Kdiscounted cumulative gain normalized by ideal ordering

Per-query breakdown shows the actual top-K retrieved with hit highlighting, plus a “only failed” toggle for fast debugging of queries with recall@K=0.

Input format

Corpus — three accepted forms:

  1. One doc per line (default) — IDs are the 0-indexed line number.
  2. JSON array of strings — same IDs (array index).
  3. JSON array of objects with {id, text} — your own IDs.

Queries — one per line: query text | expected_id_1, expected_id_2, ...

The IDs you put on the right of the | must match either the line-index of your corpus or the explicit id field in your JSON corpus.

When To Use It

  • Comparing chunking strategies (fixed-size vs semantic vs propositions) on the same corpus
  • Choosing between BM25 / dense / hybrid for a specific domain (medical vs legal vs casual)
  • Debugging “why does my RAG miss this query?” — load the failed queries, see what it retrieved, eyeball the corpus
  • Setting K — sweep K=3, 5, 10, 20 and see at what point you saturate recall

How semantic mode works

When you first click “Semantic” or “Hybrid”, or hit “Load model now”, the app dynamically imports @xenova/transformers@2.17.2 from jsdelivr’s CDN. The library downloads Xenova/all-MiniLM-L6-v2 (the JS port of the popular sentence-transformers model) — about 30 MB of WASM + ONNX weights. The browser caches it in IndexedDB after the first download so subsequent uses are instant.

Embedding happens locally in your browser. Your corpus and queries are never sent anywhere.

Notes on the model: 384-dimensional, mean-pooled + L2-normalized. Cosine similarity reduces to dot product. Good baseline for English; for multilingual see Xenova/paraphrase-multilingual-MiniLM-L12-v2. For better quality at higher cost see Xenova/all-mpnet-base-v2 (~110 MB).

What’s NOT Included (and why)

  • End-to-end faithfulness eval — this measures retrieval, not generation. RAG quality has two stages and they fail differently. For “did the LLM use the retrieved docs correctly?” use RAGAS or TruLens or a custom LLM-as-judge.
  • Larger embedding models — kept to MiniLM for the bandwidth/quality tradeoff. The transformer.js library supports more; advanced users can fork and swap.
  • Re-rankers — cross-encoder rerankers (e.g. bge-reranker-large) are out of scope for v1. Manual workaround: export top-K from this app, feed to your re-ranker offline, compare metrics.
  • Persistent state — your corpus/queries don’t persist across reloads. The 30 MB embedding model IS cached by the browser at the framework level (IndexedDB) — that’s transformer.js behavior, not app behavior.

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.