Skip to content
SIE

Performance Evaluation

Performance evaluation measures how fast models process requests under various conditions. The benchmark harness tracks latency percentiles, throughput in tokens per second, and behavior under load.

SIE captures the following metrics during performance benchmarks:

MetricDescription
p50 latencyMedian response time in milliseconds
p90/p95/p99 latencyTail latency percentiles for SLA planning
min/max latencyRange of observed response times
tokens/secProcessing throughput for corpus and query workloads
items/secRequest throughput (tokens/sec divided by average sequence length)

Corpus throughput measures document encoding speed. Query throughput measures short-text encoding with is_query=True.

Use --type perf to run performance benchmarks instead of quality evaluations:

Terminal window
# Performance benchmark on SIE
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare SIE vs TEI performance
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,tei
# Save results as baseline measurements
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie --save-measurements sie

The eval harness automatically starts and stops servers. Do not manually start Docker containers.

OptionDefaultDescription
--batch-size1Items per request
--concurrency16Number of parallel clients
--devicecuda:0Inference device
--timeout120.0Request timeout in seconds

For sustained load testing against a cluster, use the loadtest command with a YAML scenario file:

Terminal window
mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080

The load test harness provides live progress display showing:

  • Current and target requests per second
  • Rolling p50 and p99 latency
  • Success and error counts
  • Per-model request distribution
scenario.yaml
models:
- BAAI/bge-m3
- BAAI/bge-large-en-v1.5
gpu_types:
- l4
load_profile:
pattern: constant
target_rps: 100
concurrency: 32
duration_s: 300
warmup_s: 30
batch_size: 1
timeout_s: 30.0

The load test harness supports four traffic patterns:

PatternBehavior
constantFixed RPS throughout the test duration
rampGradually increase from 0 to target RPS over ramp_duration_s
stepStep-wise increase at 25%, 50%, 75%, and 100% of target RPS
spikeNormal traffic with periodic spikes at spike_multiplier intensity
# Constant load at 100 RPS
load_profile:
pattern: constant
target_rps: 100
# Ramp from 0 to 200 RPS over 60 seconds
load_profile:
pattern: ramp
target_rps: 200
ramp_duration_s: 60
# Step through 25/50/75/100 RPS
load_profile:
pattern: step
target_rps: 100
step_levels: [0.25, 0.5, 0.75, 1.0]
# Normal at 50 RPS with 3x spikes every 60 seconds
load_profile:
pattern: spike
target_rps: 50
spike_multiplier: 3.0
spike_duration_s: 10
spike_interval_s: 60

Matrix evaluation runs benchmarks across multiple models, tasks, and GPU types in parallel:

Terminal window
mise run sie-bench -- matrix config.yaml --cluster http://router:8080 --workers 2
matrix-config.yaml
models:
- BAAI/bge-m3
- model: BAAI/bge-large-en-v1.5
profiles: all
- bundle: default
tasks:
- mteb/NFCorpus
- mteb/SciFact
gpus:
- l4
- a100-80gb
type:
- quality
- perf
perf:
batch_size: 1
concurrency: 16
timeout: 120.0

Matrix mode creates isolated resource pools per GPU type and runs evaluations concurrently.

Models can be specified in three ways:

FormatExampleDescription
StringBAAI/bge-m3Single model with default profile
Dict with profiles{model: bge-m3, profiles: all}Model with specific or all profiles
Bundle{bundle: default}All models in a bundle

After a load test completes, the harness generates Markdown and JSON reports:

Terminal window
mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080 --output ./reports

Reports include:

  • Configuration summary
  • Overall request counts and success rate
  • Latency percentiles (p50, p90, p95, p99, min, max, mean)
  • Throughput in requests/sec and items/sec
  • Per-model breakdown for multi-model tests
  • ASCII time-series graphs for throughput and p99 latency
  • Error breakdown by type