Quality Evaluation
Quality evaluation runs MTEB tasks against your SIE server. It measures retrieval quality using standard metrics like NDCG@10 and MAP@10.
MTEB/BEIR Tasks
Section titled “MTEB/BEIR Tasks”SIE supports all MTEB retrieval tasks. Tasks use a namespace format with optional subset filtering.
# Standard MTEB retrieval tasksmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type qualitymise run eval BAAI/bge-m3 -t mteb/NanoFiQA2018Retrieval --type quality
# BEIR namespacemise run eval BAAI/bge-m3 -t beir/SciFact --type quality
# Multilingual tasks with language subsetmise run eval BAAI/bge-m3 -t mteb/Vidore3HrRetrieval/english --type qualityCommon retrieval tasks:
| Task | Domain | Size | Description |
|---|---|---|---|
| mteb/NFCorpus | Medical | 3.6K docs | Biomedical literature retrieval |
| mteb/NanoFiQA2018Retrieval | Finance | 57K docs | Financial question answering |
| beir/SciFact | Scientific | 5K docs | Claim verification |
| mteb/MSMARCO | Web | 8.8M docs | Web search queries |
Running Quality Evals
Section titled “Running Quality Evals”Run quality evaluation with the --type quality flag. The eval harness starts and stops servers automatically.
# Basic quality evaluationmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Evaluate with a specific profilemise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --profile sparseOutput shows scores for each metric:
## Evaluating BAAI/bge-m3 on mteb/NFCorpus (quality)Sources: sie
Source ndcg_at_10 map_at_10 mrr_at_10sie 0.3144 0.1174 0.5243Comparing Sources
Section titled “Comparing Sources”Compare SIE against other inference backends or published benchmarks using the -s flag.
# Compare SIE vs TEImise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE vs published MTEB leaderboard scoresmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark
# Compare SIE vs stored targetsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targetsAvailable sources:
| Source | Description |
|---|---|
sie | SIE server (started automatically) |
tei | HuggingFace Text Embeddings Inference |
infinity | Infinity inference server |
fastembed | FastEmbed library |
benchmark | Published MTEB leaderboard scores |
targets | Stored targets from model config |
measurements | Past SIE measurements from model config |
Targets in Configs
Section titled “Targets in Configs”Each model config stores quality targets under the targets.quality section. Targets come from authoritative sources like the MTEB leaderboard or comparison runs.
targets: quality: mteb-leaderboard/mteb/NFCorpus: ndcg_at_10: 0.3141 map_at_10: 0.1172 mrr_at_10: 0.5232The key format is source/namespace/task where source identifies origin (e.g., mteb-leaderboard, tei@1.8.3).
Measurements from SIE runs are stored separately under measurements.quality:
measurements: quality: sie@11a9c5d/default/mteb/NFCorpus: ndcg_at_10: 0.31437 map_at_10: 0.11743 mrr_at_10: 0.5243Saving Targets
Section titled “Saving Targets”Capture results from any source and save them as targets using --save-targets.
# Save TEI results as quality targetsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets tei
# Save MTEB benchmark scores as targetsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets benchmarkSave SIE results as measurements (for tracking your own baselines):
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sieSaved metrics include ndcg_at_10, map_at_10, and mrr_at_10. The source identifier and git commit hash are recorded for traceability.
CI Integration
Section titled “CI Integration”Use --check-targets in CI to catch quality regressions. The command exits non-zero if SIE scores fall below targets.
# CI command: fails if quality regressesmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targetsSIE must achieve at least 99% of the target score (configurable via quality_margin). Example output:
PASS: ndcg_at_10: 0.3144 >= 0.3110 (target: 0.3141) PASS: map_at_10: 0.1174 >= 0.1160 (target: 0.1172)
Target check PASSEDFor stricter regression detection against past SIE runs, use --check-measurements:
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurementsThis uses a 98% margin, detecting regressions in your own implementation.
What’s Next
Section titled “What’s Next”- Performance Evaluation - Measure throughput and latency
- Model Catalog - Supported models and their targets