Performance Evaluation

Performance evaluation measures how fast models process requests under various conditions. The benchmark harness tracks latency percentiles, throughput in tokens per second, and behavior under load.

Performance Metrics

SIE captures the following metrics during performance benchmarks:

Metric	Description
p50 latency	Median response time in milliseconds
p90/p95/p99 latency	Tail latency percentiles for SLA planning
min/max latency	Range of observed response times
tokens/sec	Processing throughput for corpus and query workloads
items/sec	Request throughput (tokens/sec divided by average sequence length)

Corpus throughput measures document encoding speed. Query throughput measures short-text encoding with is_query=True.

Running Performance Evals

Use --type perf to run performance benchmarks instead of quality evaluations:

# Performance benchmark on SIE
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf

# Compare SIE vs TEI performance
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,tei

# Save results as baseline measurements
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie --save-measurements sie

The eval harness automatically starts and stops servers. Do not manually start Docker containers.

Performance Options

Option	Default	Description
`--batch-size`	1	Items per request
`--concurrency`	16	Number of parallel clients
`--device`	cuda:0	Inference device
`--timeout`	120.0	Request timeout in seconds

Load Testing

For sustained load testing against a cluster, use the loadtest command with a YAML scenario file:

mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080

The load test harness provides live progress display showing:

Current and target requests per second
Rolling p50 and p99 latency
Success and error counts
Per-model request distribution

Scenario Configuration

models:
  - BAAI/bge-m3
  - BAAI/bge-large-en-v1.5
gpu_types:
  - l4
load_profile:
  pattern: constant
  target_rps: 100
concurrency: 32
duration_s: 300
warmup_s: 30
batch_size: 1
timeout_s: 30.0

Load Patterns

The load test harness supports four traffic patterns:

Pattern	Behavior
constant	Fixed RPS throughout the test duration
ramp	Gradually increase from 0 to target RPS over `ramp_duration_s`
step	Step-wise increase at 25%, 50%, 75%, and 100% of target RPS
spike	Normal traffic with periodic spikes at `spike_multiplier` intensity

Pattern Examples

# Constant load at 100 RPS
load_profile:
  pattern: constant
  target_rps: 100

# Ramp from 0 to 200 RPS over 60 seconds
load_profile:
  pattern: ramp
  target_rps: 200
  ramp_duration_s: 60

# Step through 25/50/75/100 RPS
load_profile:
  pattern: step
  target_rps: 100
  step_levels: [0.25, 0.5, 0.75, 1.0]

# Normal at 50 RPS with 3x spikes every 60 seconds
load_profile:
  pattern: spike
  target_rps: 50
  spike_multiplier: 3.0
  spike_duration_s: 10
  spike_interval_s: 60

Matrix Evaluation

Matrix evaluation runs benchmarks across multiple models, tasks, and GPU types in parallel:

mise run sie-bench -- matrix config.yaml --cluster http://router:8080 --workers 2

Matrix Configuration

models:
  - BAAI/bge-m3
  - model: BAAI/bge-large-en-v1.5
    profiles: all
  - bundle: default
tasks:
  - mteb/NFCorpus
  - mteb/SciFact
gpus:
  - l4
  - a100-80gb
type:
  - quality
  - perf
perf:
  batch_size: 1
  concurrency: 16
  timeout: 120.0

Matrix mode creates isolated resource pools per GPU type and runs evaluations concurrently.

Model Specifications

Models can be specified in three ways:

Format	Example	Description
String	`BAAI/bge-m3`	Single model with default profile
Dict with profiles	`{model: bge-m3, profiles: all}`	Model with specific or all profiles
Bundle	`{bundle: default}`	All models in a bundle

Load Test Reports

After a load test completes, the harness generates Markdown and JSON reports:

mise run sie-bench -- loadtest scenario.yaml --cluster http://router:8080 --output ./reports

Reports include:

Configuration summary
Overall request counts and success rate
Latency percentiles (p50, p90, p95, p99, min, max, mean)
Throughput in requests/sec and items/sec
Per-model breakdown for multi-model tests
ASCII time-series graphs for throughput and p99 latency
Error breakdown by type

What’s Next

Quality Evaluation - Measure retrieval accuracy with NDCG and MAP metrics
SDK Reference - Client options for timeout and batch configuration