Evaluations

Track answer quality across regression suites and ad-hoc probes.

Retrieval evaluation

Runs the benchmark through production retrieval and scores MRR, nDCG and keyword coverage. Thresholds match the eval prototype.

Click “Run evaluation” to start.

Answer evaluation

Runs the benchmark through the full pipeline and uses an LLM-as-judge to score accuracy, completeness and relevance on a 1–5 scale.

Click “Run evaluation” to start.