Evaluations
Track answer quality across regression suites and ad-hoc probes.
Retrieval evaluation
Runs the benchmark through production retrieval and scores MRR, nDCG and keyword coverage. Thresholds match the eval prototype.
Click “Run evaluation” to start.
Answer evaluation
Runs the benchmark through the full pipeline and uses an LLM-as-judge to score accuracy, completeness and relevance on a 1–5 scale.
Click “Run evaluation” to start.