Performance Tuning
SIE provides several tuning parameters that affect throughput, latency, and resource usage. This guide covers the main configuration options.
Batching Parameters
Section titled “Batching Parameters”Batching groups requests to maximize GPU utilization. Three parameters control batch formation:
max_batch_cost
Section titled “max_batch_cost”Maximum total cost per batch. For text, cost equals token count. Default: 16384 tokens.
# Environment variableexport SIE_MAX_BATCH_COST=32768Higher values improve throughput at the cost of latency. GPU memory limits how high you can go.
max_batch_wait_ms
Section titled “max_batch_wait_ms”Maximum time to wait for more requests before processing a batch. Default: 10ms.
# Environment variableexport SIE_MAX_BATCH_WAIT_MS=20Lower values reduce latency for sparse traffic. Higher values improve batching efficiency under load.
max_batch_requests
Section titled “max_batch_requests”Maximum number of requests per batch. Default: 64.
# Environment variableexport SIE_MAX_BATCH_REQUESTS=128This is a secondary limit. Cost-based batching typically triggers first for text workloads.
Tuning Strategy
Section titled “Tuning Strategy”For low-latency workloads, reduce max_batch_wait_ms to 5ms or less. For high-throughput batch processing, increase both max_batch_cost and max_batch_wait_ms.
Memory Thresholds
Section titled “Memory Thresholds”SIE uses reactive LRU eviction to manage GPU memory. No static VRAM budget is required.
Pressure Threshold
Section titled “Pressure Threshold”When memory usage exceeds this percentage, the least-recently-used model is evicted. Default: 90%.
# Environment variableexport SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85Lower values keep more headroom for inference spikes. Higher values allow more models to stay loaded.
How Eviction Works
Section titled “How Eviction Works”The memory manager checks pressure at two points:
- Before loading: If above threshold, evict LRU model first
- After each batch: Background check for gradual memory growth
Models are tracked by last-use time. The oldest model is evicted first.
# From memory.py - LRU trackingdef touch(self, model_name: str) -> None: if model_name in self._models: self._models[model_name].touch() self._models.move_to_end(model_name)Device-Specific Behavior
Section titled “Device-Specific Behavior”Memory tracking adapts to your hardware:
| Device | Memory Source |
|---|---|
| CUDA | NVML device memory query |
| MPS | PyTorch allocated memory |
| CPU | System RAM via psutil |
Attention Backend
Section titled “Attention Backend”The attention implementation affects inference speed significantly.
Available Backends
Section titled “Available Backends”| Backend | Requirements | Speedup |
|---|---|---|
flash_attention_2 | Ampere+ GPU, flash-attn package | 2-4x |
sdpa | PyTorch 2.0+ | 1.5-2x |
eager | Any | Baseline |
Configuration
Section titled “Configuration”# Auto-select best available (default)export SIE_ATTENTION_BACKEND=auto
# Force specific backendexport SIE_ATTENTION_BACKEND=flash_attention_2export SIE_ATTENTION_BACKEND=sdpaAuto mode selects Flash Attention 2 if available, then SDPA, then eager.
Flash Attention Requirements
Section titled “Flash Attention Requirements”Flash Attention 2 requires:
- CUDA compute capability 8.0+ (Ampere: A100, RTX 30xx, RTX 40xx)
- The
flash-attnpackage installed - FP16 or BF16 compute precision (not FP32)
If requirements are not met, the server falls back to SDPA automatically.
Compute Precision
Section titled “Compute Precision”Control the precision used for model inference:
# Options: float16, bfloat16, float32export SIE_DEFAULT_COMPUTE_PRECISION=float16| Precision | Memory | Speed | Compatibility |
|---|---|---|---|
float16 | Low | Fast | All CUDA GPUs |
bfloat16 | Low | Fast | Ampere+, MPS, CPU |
float32 | High | Slow | All devices |
BF16 offers better numerical stability than FP16 for some models. FP32 is mainly for debugging.
Preprocessing Workers
Section titled “Preprocessing Workers”Tokenization and image processing run in a CPU thread pool.
# Environment variableexport SIE_PREPROCESSOR_WORKERS=8Default: min(CPU count, 8). Increase for high request rates. Decrease on memory-constrained systems.
The thread pool is shared across all models. Both tokenization and image preprocessing use the same pool.
Environment Variables
Section titled “Environment Variables”All tuning parameters can be set via environment variables with the SIE_ prefix:
| Variable | Default | Description |
|---|---|---|
SIE_MAX_BATCH_REQUESTS | 64 | Max requests per batch |
SIE_MAX_BATCH_WAIT_MS | 10 | Max wait time (ms) |
SIE_MAX_CONCURRENT_REQUESTS | 512 | Request queue size |
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT | 90 | Eviction trigger (%) |
SIE_PREPROCESSOR_WORKERS | 8 | CPU thread pool size |
SIE_ATTENTION_BACKEND | auto | Attention implementation |
SIE_DEFAULT_COMPUTE_PRECISION | float16 | Model precision |
Benchmarking Changes
Section titled “Benchmarking Changes”Use the eval harness to measure the impact of tuning changes:
# Performance benchmarkmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie
# Compare before/aftermise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,targetsThe perf eval reports throughput (items/sec), latency percentiles, and GPU utilization.
See the Evals documentation for the full benchmarking workflow.
What’s Next
Section titled “What’s Next”- Request Lifecycle - how batching and memory work together
- Evals - benchmark your configuration changes