Overview
Evals measure model quality and performance. Every model has targets. CI fails when results drift below those targets.
Philosophy
Section titled “Philosophy”Models break silently. A new dependency, a driver update, or a refactor can degrade quality without triggering errors. SIE solves this with benchmark-driven development:
- Capture targets. Run evals and save results as baseline targets in model configs.
- Check in CI. Automated pipelines compare current results against saved targets.
- Fail on drift. If quality drops below 99% of target (or latency exceeds 250%), CI fails.
This catches regressions before they reach production.
Quality vs Performance
Section titled “Quality vs Performance”Quality evals measure correctness (NDCG, F1, AP). Performance evals measure speed (latency, throughput). They run different harnesses and have different targets.
| Type | Metrics | Use Case |
|---|---|---|
quality | ndcg@10, map@10, f1, precision, recall | Verify model outputs match expected results |
perf | p50/p99 latency (ms), throughput (tok/s) | Verify latency SLAs and throughput targets |
Run quality evals after model changes. Run performance evals after infrastructure changes.
sie-bench CLI
Section titled “sie-bench CLI”The sie-bench CLI runs evaluations. Use mise run eval to invoke it:
# Quality evaluationmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Performance evaluationmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare multiple sourcesmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei,benchmarkFor mise tasks with built-in flags, pass flags directly (no --). Use -- only when a task explicitly forwards raw arguments.
Common Options
Section titled “Common Options”| Option | Description |
|---|---|
-t, --task | Namespaced task (e.g., mteb/NFCorpus, beir/SciFact) |
--type | Evaluation type: quality or perf |
-s, --sources | Comma-separated sources to compare (default: sie) |
-b, --batch-size | Batch size for performance evaluation (default: 1) |
-c, --concurrency | Concurrency level (default: 16) |
-p, --profile | Named profile from model config (e.g., sparse, muvera) |
--save-targets | Save results from a source as targets |
--check-targets | Exit non-zero if results fall below targets |
Evaluation Sources
Section titled “Evaluation Sources”Sources determine where results come from. The eval harness starts and stops servers automatically.
| Source | Description |
|---|---|
sie | SIE inference server (default) |
tei | Text Embeddings Inference (Hugging Face) |
infinity | Infinity embedding server |
benchmark | Published scores from MTEB leaderboard |
targets | Saved targets from model config |
measurements | Past SIE measurements from model config |
Comparing Sources
Section titled “Comparing Sources”Compare SIE against alternatives:
# Compare SIE vs TEImise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE vs published benchmarkmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmarkSaving and Checking Targets
Section titled “Saving and Checking Targets”Capture baseline targets from a trusted source:
# Save SIE results as targetsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie
# Save measurements for regression detectionmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sieCheck current results against saved targets in CI:
# CI regression checkmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets
# Check against past measurements (tighter margins)mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurementsThe --check-targets flag exits non-zero if SIE results fall below 99% of targets. The --check-measurements flag uses tighter margins (98%) for regression detection.
What’s Next
Section titled “What’s Next”- Quality Evals - retrieval, reranking, and extraction quality
- Performance Evals - latency and throughput benchmarks
- Custom Evals - create evaluation tasks for your data