Eval¶
Synopsis¶
kbolt eval run [--file <path>]
kbolt eval import beir --dataset <name> --source <dir> --output <dir> [--collection <name>]
What eval does¶
eval is the benchmark surface for retrieval quality.
Use it to:
- run an evaluation from an
eval.tomlmanifest - import a BEIR dataset into a local benchmark corpus plus manifest
run¶
Use run to evaluate the current index against an eval manifest:
Without --file, kbolt loads eval.toml from the config directory.
Important rules:
- the top-level
--spaceflag is rejected foreval; set scope inside each eval case instead - each case must include a non-empty
query - each case must include at least one
judgmentsentry - each case must include at least one judgment with
relevance > 0 - judgment paths must be unique within each case
- referenced collections must already exist and have indexed chunks
Minimal manifest shape¶
[[cases]]
query = "trait object vs generic"
space = "bench"
collections = ["rust"]
judgments = [
{ path = "rust/traits.md", relevance = 2 },
{ path = "rust/generics.md", relevance = 1 },
]
Each run reports metrics per search mode, including:
keywordautoauto+reranksemanticwhen an embedder is configureddeep-norerankdeep
import beir¶
Use import beir to turn an extracted BEIR dataset into:
- a
corpus/directory with materialized Markdown documents - an
eval.tomlmanifest
Example:
Required source layout¶
The source directory must contain:
This command always imports the BEIR test split.
Import rules¶
--outputmust point to an empty directory, or to a directory that does not exist yet--collectiondefaults to the dataset name- imported corpus files are written as
<document-id>.md - the generated eval cases use the default benchmark space
bench
After import, the usual path is:
kbolt space add bench
kbolt --space bench collection add /tmp/scifact-bench/corpus --name scifact --no-index
kbolt --space bench update --collection scifact
kbolt eval run --file /tmp/scifact-bench/eval.toml