Configuration
SIE uses environment variables for server configuration. CLI arguments override environment variables, which override defaults.
Server Configuration
Section titled “Server Configuration”Core settings for device selection, model loading, and server behavior.
| Variable | Default | Description |
|---|---|---|
SIE_DEVICE | auto | Inference device. Options: auto (detect GPU), cuda, cuda:0, mps, cpu |
SIE_MODELS_DIR | ./models | Path to model configs directory. Supports local paths, s3://, or gs:// URLs |
SIE_MODEL_FILTER | None | Comma-separated list of model names to load. If unset, all models are available |
SIE_GPU_TYPE | Auto-detected | Override detected GPU type for routing (e.g., l4, a100-80gb, h100) |
Cache Configuration
Section titled “Cache Configuration”Control where model weights are stored and retrieved.
| Variable | Default | Description |
|---|---|---|
SIE_LOCAL_CACHE | HF_HOME | Local cache directory for model weights |
SIE_CLUSTER_CACHE | None | Cluster cache URL for shared weights (s3:// or gs://) |
SIE_HF_FALLBACK | true | Enable HuggingFace Hub fallback for weight downloads |
Cache resolution order:
- Local cache (
SIE_LOCAL_CACHE) - Cluster cache (
SIE_CLUSTER_CACHE) - HuggingFace Hub (if
SIE_HF_FALLBACK=true)
Batching Configuration
Section titled “Batching Configuration”Control request batching behavior for GPU efficiency.
| Variable | Default | Description |
|---|---|---|
SIE_MAX_BATCH_REQUESTS | 64 | Maximum requests per batch |
SIE_MAX_BATCH_WAIT_MS | 10 | Maximum milliseconds to wait for batch to fill |
SIE_MAX_CONCURRENT_REQUESTS | 512 | Maximum concurrent requests (queue size) |
Tuning guidance:
- Increase
SIE_MAX_BATCH_REQUESTSfor higher throughput on high-memory GPUs - Decrease
SIE_MAX_BATCH_WAIT_MSfor lower latency at the cost of smaller batches - Set
SIE_MAX_CONCURRENT_REQUESTSbased on expected burst traffic
Memory Configuration
Section titled “Memory Configuration”Control memory pressure thresholds and LRU eviction behavior.
| Variable | Default | Description |
|---|---|---|
SIE_MEMORY_PRESSURE_THRESHOLD_PCT | 85 | VRAM usage percent that triggers LRU eviction (0-100) |
SIE_MEMORY_CHECK_INTERVAL_S | 1.0 | Background memory monitor interval in seconds |
How LRU eviction works:
- Background monitor checks memory usage every
SIE_MEMORY_CHECK_INTERVAL_Sseconds - When usage exceeds
SIE_MEMORY_PRESSURE_THRESHOLD_PCT, the least-recently-used model is evicted - Models are re-loaded on-demand when the next request arrives
Logging Configuration
Section titled “Logging Configuration”Control log format and verbosity.
| Variable | Default | Description |
|---|---|---|
SIE_LOG_JSON | false | Enable structured JSON logging for Loki compatibility |
JSON log format includes structured fields:
{ "timestamp": "2025-12-18T10:30:00Z", "level": "INFO", "logger": "sie_server.core.registry", "message": "Inference completed", "model": "bge-m3", "request_id": "abc123", "trace_id": "def456", "latency_ms": 45.2}Tracing Configuration
Section titled “Tracing Configuration”Enable OpenTelemetry distributed tracing.
| Variable | Default | Description |
|---|---|---|
SIE_TRACING_ENABLED | false | Enable OpenTelemetry tracing |
When tracing is enabled, SIE respects standard OpenTelemetry environment variables:
| Variable | Default | Description |
|---|---|---|
OTEL_SERVICE_NAME | sie-server | Service name in traces |
OTEL_TRACES_EXPORTER | otlp | Exporter type (otlp, console, none) |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP collector endpoint |
OTEL_TRACES_SAMPLER | always_on | Sampling strategy |
OTEL_TRACES_SAMPLER_ARG | 1.0 | Sampling rate (for traceidratio sampler) |
Performance Configuration
Section titled “Performance Configuration”Advanced settings for compute precision and preprocessing.
| Variable | Default | Description |
|---|---|---|
SIE_PREPROCESSOR_WORKERS | 4 | Number of preprocessing worker threads |
SIE_IMAGE_WORKERS | 4 | Image preprocessing worker threads (for VLMs) |
SIE_ATTENTION_BACKEND | auto | Attention implementation: auto, flash_attention_2, sdpa, eager |
SIE_DEFAULT_COMPUTE_PRECISION | float16 | Default compute precision: float16, bfloat16, float32 |
SIE_INSTRUMENTATION | false | Enable detailed batch statistics for debugging |
LoRA Configuration
Section titled “LoRA Configuration”Control LoRA adapter loading behavior.
| Variable | Default | Description |
|---|---|---|
SIE_MAX_LORAS_PER_MODEL | 10 | Maximum LoRA adapters to keep loaded per model |
When the limit is reached, the least-recently-used LoRA adapter is evicted.
Example: Production Configuration
Section titled “Example: Production Configuration”# High-throughput production setupexport SIE_DEVICE=cudaexport SIE_MODELS_DIR=s3://my-bucket/models/export SIE_CLUSTER_CACHE=s3://my-bucket/weights/export SIE_LOCAL_CACHE=/mnt/nvme/cache
# Batching optimized for A100-80GBexport SIE_MAX_BATCH_REQUESTS=128export SIE_MAX_BATCH_WAIT_MS=5export SIE_MAX_CONCURRENT_REQUESTS=1024
# Memory managementexport SIE_MEMORY_PRESSURE_THRESHOLD_PCT=90
# Observabilityexport SIE_LOG_JSON=trueexport SIE_TRACING_ENABLED=trueexport OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317Example: Development Configuration
Section titled “Example: Development Configuration”# Local development setupexport SIE_DEVICE=mps # or cuda, cpuexport SIE_MODELS_DIR=./models
# Lower batching for faster iterationexport SIE_MAX_BATCH_REQUESTS=8export SIE_MAX_BATCH_WAIT_MS=1
# Debug loggingexport SIE_INSTRUMENTATION=trueWhat’s Next
Section titled “What’s Next”- CLI Reference - Command-line options that map to these variables
- HTTP API Reference - Endpoints exposed by the configured server