Performance Tuning

SIE provides several tuning parameters that affect throughput, latency, and resource usage. This guide covers the main configuration options.

Batching Parameters

Batching groups requests to maximize GPU utilization. Three parameters control batch formation:

max_batch_cost

Maximum total cost per batch. For text, cost equals token count. Default: 16384 tokens.

# Environment variable
export SIE_MAX_BATCH_COST=32768

Higher values improve throughput at the cost of latency. GPU memory limits how high you can go.

max_batch_wait_ms

Maximum time to wait for more requests before processing a batch. Default: 10ms.

# Environment variable
export SIE_MAX_BATCH_WAIT_MS=20

Lower values reduce latency for sparse traffic. Higher values improve batching efficiency under load.

max_batch_requests

Maximum number of requests per batch. Default: 64.

# Environment variable
export SIE_MAX_BATCH_REQUESTS=128

This is a secondary limit. Cost-based batching typically triggers first for text workloads.

Tuning Strategy

For low-latency workloads, reduce max_batch_wait_ms to 5ms or less. For high-throughput batch processing, increase both max_batch_cost and max_batch_wait_ms.

Memory Thresholds

SIE uses reactive LRU eviction to manage GPU memory. No static VRAM budget is required.

Pressure Threshold

When memory usage exceeds this percentage, the least-recently-used model is evicted. Default: 90%.

# Environment variable
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85

Lower values keep more headroom for inference spikes. Higher values allow more models to stay loaded.

How Eviction Works

The memory manager checks pressure at two points:

Before loading: If above threshold, evict LRU model first
After each batch: Background check for gradual memory growth

Models are tracked by last-use time. The oldest model is evicted first.

# From memory.py - LRU tracking
def touch(self, model_name: str) -> None:
    if model_name in self._models:
        self._models[model_name].touch()
        self._models.move_to_end(model_name)

Device-Specific Behavior

Memory tracking adapts to your hardware:

Device	Memory Source
CUDA	NVML device memory query
MPS	PyTorch allocated memory
CPU	System RAM via psutil

Attention Backend

The attention implementation affects inference speed significantly.

Available Backends

Backend	Requirements	Speedup
`flash_attention_2`	Ampere+ GPU, flash-attn package	2-4x
`sdpa`	PyTorch 2.0+	1.5-2x
`eager`	Any	Baseline

Configuration

# Auto-select best available (default)
export SIE_ATTENTION_BACKEND=auto

# Force specific backend
export SIE_ATTENTION_BACKEND=flash_attention_2
export SIE_ATTENTION_BACKEND=sdpa

Auto mode selects Flash Attention 2 if available, then SDPA, then eager.

Flash Attention Requirements

Flash Attention 2 requires:

CUDA compute capability 8.0+ (Ampere: A100, RTX 30xx, RTX 40xx)
The flash-attn package installed
FP16 or BF16 compute precision (not FP32)

If requirements are not met, the server falls back to SDPA automatically.

Compute Precision

Control the precision used for model inference:

# Options: float16, bfloat16, float32
export SIE_DEFAULT_COMPUTE_PRECISION=float16

Precision	Memory	Speed	Compatibility
`float16`	Low	Fast	All CUDA GPUs
`bfloat16`	Low	Fast	Ampere+, MPS, CPU
`float32`	High	Slow	All devices

BF16 offers better numerical stability than FP16 for some models. FP32 is mainly for debugging.

Preprocessing Workers

Tokenization and image processing run in a CPU thread pool.

# Environment variable
export SIE_PREPROCESSOR_WORKERS=8

Default: min(CPU count, 8). Increase for high request rates. Decrease on memory-constrained systems.

The thread pool is shared across all models. Both tokenization and image preprocessing use the same pool.

Environment Variables

All tuning parameters can be set via environment variables with the SIE_ prefix:

Variable	Default	Description
`SIE_MAX_BATCH_REQUESTS`	64	Max requests per batch
`SIE_MAX_BATCH_WAIT_MS`	10	Max wait time (ms)
`SIE_MAX_CONCURRENT_REQUESTS`	512	Request queue size
`SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT`	90	Eviction trigger (%)
`SIE_PREPROCESSOR_WORKERS`	8	CPU thread pool size
`SIE_ATTENTION_BACKEND`	auto	Attention implementation
`SIE_DEFAULT_COMPUTE_PRECISION`	float16	Model precision

Benchmarking Changes

Use the eval harness to measure the impact of tuning changes:

# Performance benchmark
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie

# Compare before/after
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,targets

The perf eval reports throughput (items/sec), latency percentiles, and GPU utilization.

See the Evals documentation for the full benchmarking workflow.

What’s Next

Request Lifecycle - how batching and memory work together
Evals - benchmark your configuration changes