Evaluations

Track answer quality across regression suites and ad-hoc probes.

Report

Retrieval evaluation

Runs the benchmark through production retrieval and scores MRR, nDCG and keyword coverage. Thresholds match the eval prototype.

Click “Run evaluation” to start.

Runs the benchmark through the full pipeline and uses an LLM-as-judge to score accuracy, completeness and relevance on a 1–5 scale.

Click “Run evaluation” to start.