Skip to content
SIE

Quality Evaluation

Quality evaluation runs MTEB tasks against your SIE server. It measures retrieval quality using standard metrics like NDCG@10 and MAP@10.

SIE supports all MTEB retrieval tasks. Tasks use a namespace format with optional subset filtering.

Terminal window
# Standard MTEB retrieval tasks
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
mise run eval BAAI/bge-m3 -t mteb/NanoFiQA2018Retrieval --type quality
# BEIR namespace
mise run eval BAAI/bge-m3 -t beir/SciFact --type quality
# Multilingual tasks with language subset
mise run eval BAAI/bge-m3 -t mteb/Vidore3HrRetrieval/english --type quality

Common retrieval tasks:

TaskDomainSizeDescription
mteb/NFCorpusMedical3.6K docsBiomedical literature retrieval
mteb/NanoFiQA2018RetrievalFinance57K docsFinancial question answering
beir/SciFactScientific5K docsClaim verification
mteb/MSMARCOWeb8.8M docsWeb search queries

Run quality evaluation with the --type quality flag. The eval harness starts and stops servers automatically.

Terminal window
# Basic quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Evaluate with a specific profile
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --profile sparse

Output shows scores for each metric:

## Evaluating BAAI/bge-m3 on mteb/NFCorpus (quality)
Sources: sie
Source ndcg_at_10 map_at_10 mrr_at_10
sie 0.3144 0.1174 0.5243

Compare SIE against other inference backends or published benchmarks using the -s flag.

Terminal window
# Compare SIE vs TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE vs published MTEB leaderboard scores
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark
# Compare SIE vs stored targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets

Available sources:

SourceDescription
sieSIE server (started automatically)
teiHuggingFace Text Embeddings Inference
infinityInfinity inference server
fastembedFastEmbed library
benchmarkPublished MTEB leaderboard scores
targetsStored targets from model config
measurementsPast SIE measurements from model config

Each model config stores quality targets under the targets.quality section. Targets come from authoritative sources like the MTEB leaderboard or comparison runs.

packages/sie_server/models/baai-bge-m3.yaml
targets:
quality:
mteb-leaderboard/mteb/NFCorpus:
ndcg_at_10: 0.3141
map_at_10: 0.1172
mrr_at_10: 0.5232

The key format is source/namespace/task where source identifies origin (e.g., mteb-leaderboard, tei@1.8.3).

Measurements from SIE runs are stored separately under measurements.quality:

measurements:
quality:
sie@11a9c5d/default/mteb/NFCorpus:
ndcg_at_10: 0.31437
map_at_10: 0.11743
mrr_at_10: 0.5243

Capture results from any source and save them as targets using --save-targets.

Terminal window
# Save TEI results as quality targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets tei
# Save MTEB benchmark scores as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets benchmark

Save SIE results as measurements (for tracking your own baselines):

Terminal window
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie

Saved metrics include ndcg_at_10, map_at_10, and mrr_at_10. The source identifier and git commit hash are recorded for traceability.

Use --check-targets in CI to catch quality regressions. The command exits non-zero if SIE scores fall below targets.

Terminal window
# CI command: fails if quality regresses
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets

SIE must achieve at least 99% of the target score (configurable via quality_margin). Example output:

PASS: ndcg_at_10: 0.3144 >= 0.3110 (target: 0.3141)
PASS: map_at_10: 0.1174 >= 0.1160 (target: 0.1172)
Target check PASSED

For stricter regression detection against past SIE runs, use --check-measurements:

Terminal window
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements

This uses a 98% margin, detecting regressions in your own implementation.