Paste a corpus + labelled queries → measure recall@K, precision@K, MRR, nDCG for BM25 (and optionally semantic) retrieval. Local, free, self-hosted alternative to RAGAS / TruLens.
BM25 retrieval · — MRR · — recall@K
🧪 Local eval harness. Corpus + queries never leave the browser. Honest about scope: this is intrinsic retrieval evaluation (does the retriever find the right docs?), not end-to-end answer evaluation (does the model's final answer match?).
Corpus
One document per line, OR paste a JSON array of strings, OR paste JSON array of {"id":..., "text":...}. IDs default to 0-indexed line number.
Test queries
One per line, format: query | expected_id_1, expected_id_2, .... IDs are 0-indexed (or match the id field if your corpus is JSON with ids).
Per-query results
Run an eval to see per-query breakdown.
Retrieval
Metrics
As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.
Three retrieval modes, four metrics, instant per-query breakdown:
all-MiniLM-L6-v2 (Xenova/transformer.js). Lazy-loaded from jsdelivr on first use, then ~10-50 ms per query in your browser.Metrics computed across the whole eval set:
| Metric | Definition |
|---|---|
| MRR | Mean Reciprocal Rank — average of 1/rank-of-first-correct |
| Recall@K | fraction of expected docs found in top-K, averaged across queries |
| Precision@K | fraction of top-K that were expected, averaged across queries |
| Hit Rate | fraction of queries with at least one correct doc in top-K |
| nDCG@K | discounted cumulative gain normalized by ideal ordering |
Per-query breakdown shows the actual top-K retrieved with hit highlighting, plus a “only failed” toggle for fast debugging of queries with recall@K=0.
Corpus — three accepted forms:
{id, text} — your own IDs.Queries — one per line: query text | expected_id_1, expected_id_2, ...
The IDs you put on the right of the | must match either the line-index of your corpus or the explicit id field in your JSON corpus.
When you first click “Semantic” or “Hybrid”, or hit “Load model now”, the app dynamically imports @xenova/transformers@2.17.2 from jsdelivr’s CDN. The library downloads Xenova/all-MiniLM-L6-v2 (the JS port of the popular sentence-transformers model) — about 30 MB of WASM + ONNX weights. The browser caches it in IndexedDB after the first download so subsequent uses are instant.
Embedding happens locally in your browser. Your corpus and queries are never sent anywhere.
Notes on the model: 384-dimensional, mean-pooled + L2-normalized. Cosine similarity reduces to dot product. Good baseline for English; for multilingual see Xenova/paraphrase-multilingual-MiniLM-L12-v2. For better quality at higher cost see Xenova/all-mpnet-base-v2 (~110 MB).
bge-reranker-large) are out of scope for v1. Manual workaround: export top-K from this app, feed to your re-ranker offline, compare metrics.For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.