Custom Evals
Custom evals let you benchmark models on your domain-specific data. Create tasks in MTEB v2 format and run them alongside standard benchmarks.
Custom Task Format
Section titled “Custom Task Format”Custom tasks use the MTEB v2 format with three files:
| File | Format | Description |
|---|---|---|
corpus.jsonl | {"_id": "doc1", "title": "optional", "text": "document text"} | Documents to search |
queries.jsonl | {"_id": "q1", "text": "query text"} | Queries to evaluate |
qrels/test.tsv | query-id<tab>corpus-id<tab>score | Relevance judgments (0-3 scale) |
Example corpus.jsonl:
{"_id": "doc1", "title": "ML Basics", "text": "Machine learning uses algorithms to learn from data."}{"_id": "doc2", "text": "The weather forecast predicts rain tomorrow."}Example queries.jsonl:
{"_id": "q1", "text": "What is machine learning?"}{"_id": "q2", "text": "How do neural networks work?"}Example qrels/test.tsv:
q1 doc1 3q1 doc2 0Scores follow TREC conventions: 3 = highly relevant, 2 = relevant, 1 = marginally relevant, 0 = not relevant.
Task Namespaces
Section titled “Task Namespaces”Tasks use namespace prefixes to identify their source:
| Namespace | Description | Example |
|---|---|---|
mteb/ | MTEB built-in tasks | mteb/NFCorpus |
beir/ | BEIR benchmark tasks (via MTEB) | beir/SciFact |
custom/ | Custom tasks from evals/ directory | custom/my-domain-task |
The custom/ namespace maps to the evals/ directory in your project root.
Adding Custom Tasks
Section titled “Adding Custom Tasks”Create a directory structure under evals/:
evals/ my-domain-task/ corpus.jsonl queries.jsonl qrels/ test.tsvRun your custom task with either path syntax:
# Using custom/ namespace prefixmise run eval BAAI/bge-m3 -t custom/my-domain-task --type quality
# Using direct pathmise run eval BAAI/bge-m3 -t evals/my-domain-task --type qualityThe loader auto-detects custom tasks by checking for the custom/ prefix or evals/ path.
Multiple Splits
Section titled “Multiple Splits”The loader looks for qrels in this order: test.tsv, train.tsv, dev.tsv, then falls back to qrels.tsv at the task root.
evals/ my-task/ corpus.jsonl queries.jsonl qrels/ test.tsv # Used by default train.tsv # For training set evaluation dev.tsv # For development setTREC Format Support
Section titled “TREC Format Support”Both 3-column and 4-column (TREC) qrels formats are supported:
# 3-column: query-id, corpus-id, scoreq1 doc1 3
# 4-column (TREC): query-id, 0, corpus-id, scoreq1 0 doc1 3W&B Integration
Section titled “W&B Integration”Log evaluation results to Weights & Biases for experiment tracking. W&B is ideal for comparing model configurations and A/B testing.
# Basic loggingmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \ --wandb-project sie-evals
# With team/entitymise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \ --wandb-project sie-evals --wandb-entity my-teamInstall wandb first:
pip install wandbwandb loginW&B dashboard tips:
- Filter by model tag to compare different models
- Use parallel coordinates to visualize metric trade-offs
- Compare runs with different LoRA adapters
- Filter by task tag to see performance on specific benchmarks
MLflow Integration
Section titled “MLflow Integration”Log to MLflow for self-hosted experiment tracking. MLflow works with local storage or a remote tracking server.
# Local tracking (saves to ./mlruns)mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \ --mlflow-experiment embedding-evals
# Remote MLflow servermise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \ --mlflow-experiment embedding-evals \ --mlflow-uri http://mlflow.internal:5000Install mlflow first:
pip install mlflowMLflow notes:
- Parameters are flattened automatically (nested dicts become dot notation)
- Artifacts are stored in the configured artifact store (local, S3, GCS, or Azure)
- Run URLs only work with a tracking server, not local file storage
What’s Next
Section titled “What’s Next”- Evals Overview - benchmark-driven development philosophy
- Performance Evals - latency and throughput benchmarks