Skip to content
SIE

Custom Evals

Custom evals let you benchmark models on your domain-specific data. Create tasks in MTEB v2 format and run them alongside standard benchmarks.

Custom tasks use the MTEB v2 format with three files:

FileFormatDescription
corpus.jsonl{"_id": "doc1", "title": "optional", "text": "document text"}Documents to search
queries.jsonl{"_id": "q1", "text": "query text"}Queries to evaluate
qrels/test.tsvquery-id<tab>corpus-id<tab>scoreRelevance judgments (0-3 scale)

Example corpus.jsonl:

{"_id": "doc1", "title": "ML Basics", "text": "Machine learning uses algorithms to learn from data."}
{"_id": "doc2", "text": "The weather forecast predicts rain tomorrow."}

Example queries.jsonl:

{"_id": "q1", "text": "What is machine learning?"}
{"_id": "q2", "text": "How do neural networks work?"}

Example qrels/test.tsv:

q1 doc1 3
q1 doc2 0

Scores follow TREC conventions: 3 = highly relevant, 2 = relevant, 1 = marginally relevant, 0 = not relevant.

Tasks use namespace prefixes to identify their source:

NamespaceDescriptionExample
mteb/MTEB built-in tasksmteb/NFCorpus
beir/BEIR benchmark tasks (via MTEB)beir/SciFact
custom/Custom tasks from evals/ directorycustom/my-domain-task

The custom/ namespace maps to the evals/ directory in your project root.

Create a directory structure under evals/:

evals/
my-domain-task/
corpus.jsonl
queries.jsonl
qrels/
test.tsv

Run your custom task with either path syntax:

Terminal window
# Using custom/ namespace prefix
mise run eval BAAI/bge-m3 -t custom/my-domain-task --type quality
# Using direct path
mise run eval BAAI/bge-m3 -t evals/my-domain-task --type quality

The loader auto-detects custom tasks by checking for the custom/ prefix or evals/ path.

The loader looks for qrels in this order: test.tsv, train.tsv, dev.tsv, then falls back to qrels.tsv at the task root.

evals/
my-task/
corpus.jsonl
queries.jsonl
qrels/
test.tsv # Used by default
train.tsv # For training set evaluation
dev.tsv # For development set

Both 3-column and 4-column (TREC) qrels formats are supported:

# 3-column: query-id, corpus-id, score
q1 doc1 3
# 4-column (TREC): query-id, 0, corpus-id, score
q1 0 doc1 3

Log evaluation results to Weights & Biases for experiment tracking. W&B is ideal for comparing model configurations and A/B testing.

Terminal window
# Basic logging
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
--wandb-project sie-evals
# With team/entity
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
--wandb-project sie-evals --wandb-entity my-team

Install wandb first:

Terminal window
pip install wandb
wandb login

W&B dashboard tips:

  • Filter by model tag to compare different models
  • Use parallel coordinates to visualize metric trade-offs
  • Compare runs with different LoRA adapters
  • Filter by task tag to see performance on specific benchmarks

Log to MLflow for self-hosted experiment tracking. MLflow works with local storage or a remote tracking server.

Terminal window
# Local tracking (saves to ./mlruns)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
--mlflow-experiment embedding-evals
# Remote MLflow server
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
--mlflow-experiment embedding-evals \
--mlflow-uri http://mlflow.internal:5000

Install mlflow first:

Terminal window
pip install mlflow

MLflow notes:

  • Parameters are flattened automatically (nested dicts become dot notation)
  • Artifacts are stored in the configured artifact store (local, S3, GCS, or Azure)
  • Run URLs only work with a tracking server, not local file storage