Performance Evaluation
Performance evaluation measures how fast models process requests under various conditions. The benchmark harness tracks latency percentiles, throughput in tokens per second, and behavior under load.
Performance Metrics
Section titled “Performance Metrics”SIE captures the following metrics during performance benchmarks:
| Metric | Description |
|---|---|
| p50 latency | Median response time in milliseconds |
| p90/p95/p99 latency | Tail latency percentiles for SLA planning |
| min/max latency | Range of observed response times |
| tokens/sec | Processing throughput for corpus and query workloads |
| items/sec | Request throughput (tokens/sec divided by average sequence length) |
Corpus throughput measures document encoding speed. Query throughput measures short-text encoding with is_query=True.
Running Performance Evals
Section titled “Running Performance Evals”Use --type perf to run performance benchmarks instead of quality evaluations:
# Performance benchmark on SIEmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare SIE vs TEI performancemise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,tei
# Save results as baseline measurementsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie --save-measurements sieThe eval harness automatically starts and stops servers. Do not manually start Docker containers.
Performance Options
Section titled “Performance Options”| Option | Default | Description |
|---|---|---|
--batch-size | 1 | Items per request |
--concurrency | 16 | Number of parallel clients |
--device | cuda:0 | Inference device |
--timeout | 120.0 | Request timeout in seconds |
Load Testing
Section titled “Load Testing”For sustained load testing against a cluster, use the loadtest command with a YAML scenario file:
mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080The load test harness provides live progress display showing:
- Current and target requests per second
- Rolling p50 and p99 latency
- Success and error counts
- Per-model request distribution
Scenario Configuration
Section titled “Scenario Configuration”models: - BAAI/bge-m3 - BAAI/bge-large-en-v1.5gpu_types: - l4load_profile: pattern: constant target_rps: 100concurrency: 32duration_s: 300warmup_s: 30batch_size: 1timeout_s: 30.0Load Patterns
Section titled “Load Patterns”The load test harness supports four traffic patterns:
| Pattern | Behavior |
|---|---|
| constant | Fixed RPS throughout the test duration |
| ramp | Gradually increase from 0 to target RPS over ramp_duration_s |
| step | Step-wise increase at 25%, 50%, 75%, and 100% of target RPS |
| spike | Normal traffic with periodic spikes at spike_multiplier intensity |
Pattern Examples
Section titled “Pattern Examples”# Constant load at 100 RPSload_profile: pattern: constant target_rps: 100
# Ramp from 0 to 200 RPS over 60 secondsload_profile: pattern: ramp target_rps: 200 ramp_duration_s: 60
# Step through 25/50/75/100 RPSload_profile: pattern: step target_rps: 100 step_levels: [0.25, 0.5, 0.75, 1.0]
# Normal at 50 RPS with 3x spikes every 60 secondsload_profile: pattern: spike target_rps: 50 spike_multiplier: 3.0 spike_duration_s: 10 spike_interval_s: 60Matrix Evaluation
Section titled “Matrix Evaluation”Matrix evaluation runs benchmarks across multiple models, tasks, and GPU types in parallel:
mise run sie-bench -- matrix config.yaml --cluster http://router:8080 --workers 2Matrix Configuration
Section titled “Matrix Configuration”models: - BAAI/bge-m3 - model: BAAI/bge-large-en-v1.5 profiles: all - bundle: defaulttasks: - mteb/NFCorpus - mteb/SciFactgpus: - l4 - a100-80gbtype: - quality - perfperf: batch_size: 1 concurrency: 16 timeout: 120.0Matrix mode creates isolated resource pools per GPU type and runs evaluations concurrently.
Model Specifications
Section titled “Model Specifications”Models can be specified in three ways:
| Format | Example | Description |
|---|---|---|
| String | BAAI/bge-m3 | Single model with default profile |
| Dict with profiles | {model: bge-m3, profiles: all} | Model with specific or all profiles |
| Bundle | {bundle: default} | All models in a bundle |
Load Test Reports
Section titled “Load Test Reports”After a load test completes, the harness generates Markdown and JSON reports:
mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080 --output ./reportsReports include:
- Configuration summary
- Overall request counts and success rate
- Latency percentiles (p50, p90, p95, p99, min, max, mean)
- Throughput in requests/sec and items/sec
- Per-model breakdown for multi-model tests
- ASCII time-series graphs for throughput and p99 latency
- Error breakdown by type
What’s Next
Section titled “What’s Next”- Quality Evaluation - Measure retrieval accuracy with NDCG and MAP metrics
- SDK Reference - Client options for timeout and batch configuration