Custom Evals

Custom evals let you benchmark models on your domain-specific data. Create tasks in MTEB v2 format and run them alongside standard benchmarks.

Custom Task Format

Custom tasks use the MTEB v2 format with three files:

File	Format	Description
`corpus.jsonl`	`{"_id": "doc1", "title": "optional", "text": "document text"}`	Documents to search
`queries.jsonl`	`{"_id": "q1", "text": "query text"}`	Queries to evaluate
`qrels/test.tsv`	`query-id<tab>corpus-id<tab>score`	Relevance judgments (0-3 scale)

Example corpus.jsonl:

{"_id": "doc1", "title": "ML Basics", "text": "Machine learning uses algorithms to learn from data."}
{"_id": "doc2", "text": "The weather forecast predicts rain tomorrow."}

Example queries.jsonl:

{"_id": "q1", "text": "What is machine learning?"}
{"_id": "q2", "text": "How do neural networks work?"}

Example qrels/test.tsv:

q1  doc1  3
q1  doc2  0

Scores follow TREC conventions: 3 = highly relevant, 2 = relevant, 1 = marginally relevant, 0 = not relevant.

Task Namespaces

Tasks use namespace prefixes to identify their source:

Namespace	Description	Example
`mteb/`	MTEB built-in tasks	`mteb/NFCorpus`
`beir/`	BEIR benchmark tasks (via MTEB)	`beir/SciFact`
`custom/`	Custom tasks from `evals/` directory	`custom/my-domain-task`

The custom/ namespace maps to the evals/ directory in your project root.

Adding Custom Tasks

Create a directory structure under evals/:

evals/
  my-domain-task/
    corpus.jsonl
    queries.jsonl
    qrels/
      test.tsv

Run your custom task with either path syntax:

# Using custom/ namespace prefix
mise run eval BAAI/bge-m3 -t custom/my-domain-task --type quality

# Using direct path
mise run eval BAAI/bge-m3 -t evals/my-domain-task --type quality

The loader auto-detects custom tasks by checking for the custom/ prefix or evals/ path.

Multiple Splits

The loader looks for qrels in this order: test.tsv, train.tsv, dev.tsv, then falls back to qrels.tsv at the task root.

evals/
  my-task/
    corpus.jsonl
    queries.jsonl
    qrels/
      test.tsv   # Used by default
      train.tsv  # For training set evaluation
      dev.tsv    # For development set

TREC Format Support

Both 3-column and 4-column (TREC) qrels formats are supported:

# 3-column: query-id, corpus-id, score
q1  doc1  3

# 4-column (TREC): query-id, 0, corpus-id, score
q1  0  doc1  3

W&B Integration

Log evaluation results to Weights & Biases for experiment tracking. W&B is ideal for comparing model configurations and A/B testing.

# Basic logging
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --wandb-project sie-evals

# With team/entity
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --wandb-project sie-evals --wandb-entity my-team

Install wandb first:

pip install wandb
wandb login

W&B dashboard tips:

Filter by model tag to compare different models
Use parallel coordinates to visualize metric trade-offs
Compare runs with different LoRA adapters
Filter by task tag to see performance on specific benchmarks

MLflow Integration

Log to MLflow for self-hosted experiment tracking. MLflow works with local storage or a remote tracking server.

# Local tracking (saves to ./mlruns)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --mlflow-experiment embedding-evals

# Remote MLflow server
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --mlflow-experiment embedding-evals \
  --mlflow-uri http://mlflow.internal:5000

Install mlflow first:

pip install mlflow

MLflow notes:

Parameters are flattened automatically (nested dicts become dot notation)
Artifacts are stored in the configured artifact store (local, S3, GCS, or Azure)
Run URLs only work with a tracking server, not local file storage

What’s Next

Evals Overview - benchmark-driven development philosophy
Performance Evals - latency and throughput benchmarks