Skip to content
SIE

Overview

Evals measure model quality and performance. Every model has targets. CI fails when results drift below those targets.

Models break silently. A new dependency, a driver update, or a refactor can degrade quality without triggering errors. SIE solves this with benchmark-driven development:

  1. Capture targets. Run evals and save results as baseline targets in model configs.
  2. Check in CI. Automated pipelines compare current results against saved targets.
  3. Fail on drift. If quality drops below 99% of target (or latency exceeds 250%), CI fails.

This catches regressions before they reach production.

Quality evals measure correctness (NDCG, F1, AP). Performance evals measure speed (latency, throughput). They run different harnesses and have different targets.

TypeMetricsUse Case
qualityndcg@10, map@10, f1, precision, recallVerify model outputs match expected results
perfp50/p99 latency (ms), throughput (tok/s)Verify latency SLAs and throughput targets

Run quality evals after model changes. Run performance evals after infrastructure changes.

The sie-bench CLI runs evaluations. Use mise run eval to invoke it:

Terminal window
# Quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Performance evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare multiple sources
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei,benchmark

For mise tasks with built-in flags, pass flags directly (no --). Use -- only when a task explicitly forwards raw arguments.

OptionDescription
-t, --taskNamespaced task (e.g., mteb/NFCorpus, beir/SciFact)
--typeEvaluation type: quality or perf
-s, --sourcesComma-separated sources to compare (default: sie)
-b, --batch-sizeBatch size for performance evaluation (default: 1)
-c, --concurrencyConcurrency level (default: 16)
-p, --profileNamed profile from model config (e.g., sparse, muvera)
--save-targetsSave results from a source as targets
--check-targetsExit non-zero if results fall below targets

Sources determine where results come from. The eval harness starts and stops servers automatically.

SourceDescription
sieSIE inference server (default)
teiText Embeddings Inference (Hugging Face)
infinityInfinity embedding server
benchmarkPublished scores from MTEB leaderboard
targetsSaved targets from model config
measurementsPast SIE measurements from model config

Compare SIE against alternatives:

Terminal window
# Compare SIE vs TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE vs published benchmark
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark

Capture baseline targets from a trusted source:

Terminal window
# Save SIE results as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie
# Save measurements for regression detection
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie

Check current results against saved targets in CI:

Terminal window
# CI regression check
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets
# Check against past measurements (tighter margins)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements

The --check-targets flag exits non-zero if SIE results fall below 99% of targets. The --check-measurements flag uses tighter margins (98%) for regression detection.