Configuration

SIE uses environment variables for server configuration. CLI arguments override environment variables, which override defaults.

Server Configuration

Core settings for device selection, model loading, and server behavior.

Variable	Default	Description
`SIE_DEVICE`	`auto`	Inference device. Options: `auto` (detect GPU), `cuda`, `cuda:0`, `mps`, `cpu`
`SIE_MODELS_DIR`	`./models`	Path to model configs directory. Supports local paths, `s3://`, or `gs://` URLs
`SIE_MODEL_FILTER`	None	Comma-separated list of model names to load. If unset, all models are available
`SIE_GPU_TYPE`	Auto-detected	Override detected GPU type for routing (e.g., `l4`, `a100-80gb`, `h100`)

Cache Configuration

Control where model weights are stored and retrieved.

Variable	Default	Description
`SIE_LOCAL_CACHE`	`HF_HOME`	Local cache directory for model weights
`SIE_CLUSTER_CACHE`	None	Cluster cache URL for shared weights (`s3://` or `gs://`)
`SIE_HF_FALLBACK`	`true`	Enable HuggingFace Hub fallback for weight downloads

Cache resolution order:

Local cache (SIE_LOCAL_CACHE)
Cluster cache (SIE_CLUSTER_CACHE)
HuggingFace Hub (if SIE_HF_FALLBACK=true)

Batching Configuration

Control request batching behavior for GPU efficiency.

Variable	Default	Description
`SIE_MAX_BATCH_REQUESTS`	`64`	Maximum requests per batch
`SIE_MAX_BATCH_WAIT_MS`	`10`	Maximum milliseconds to wait for batch to fill
`SIE_MAX_CONCURRENT_REQUESTS`	`512`	Maximum concurrent requests (queue size)

Tuning guidance:

Increase SIE_MAX_BATCH_REQUESTS for higher throughput on high-memory GPUs
Decrease SIE_MAX_BATCH_WAIT_MS for lower latency at the cost of smaller batches
Set SIE_MAX_CONCURRENT_REQUESTS based on expected burst traffic

Memory Configuration

Control memory pressure thresholds and LRU eviction behavior.

Variable	Default	Description
`SIE_MEMORY_PRESSURE_THRESHOLD_PCT`	`85`	VRAM usage percent that triggers LRU eviction (0-100)
`SIE_MEMORY_CHECK_INTERVAL_S`	`1.0`	Background memory monitor interval in seconds

How LRU eviction works:

Background monitor checks memory usage every SIE_MEMORY_CHECK_INTERVAL_S seconds
When usage exceeds SIE_MEMORY_PRESSURE_THRESHOLD_PCT, the least-recently-used model is evicted
Models are re-loaded on-demand when the next request arrives

Logging Configuration

Control log format and verbosity.

Variable	Default	Description
`SIE_LOG_JSON`	`false`	Enable structured JSON logging for Loki compatibility

JSON log format includes structured fields:

{
  "timestamp": "2025-12-18T10:30:00Z",
  "level": "INFO",
  "logger": "sie_server.core.registry",
  "message": "Inference completed",
  "model": "bge-m3",
  "request_id": "abc123",
  "trace_id": "def456",
  "latency_ms": 45.2
}

Tracing Configuration

Enable OpenTelemetry distributed tracing.

Variable	Default	Description
`SIE_TRACING_ENABLED`	`false`	Enable OpenTelemetry tracing

When tracing is enabled, SIE respects standard OpenTelemetry environment variables:

Variable	Default	Description
`OTEL_SERVICE_NAME`	`sie-server`	Service name in traces
`OTEL_TRACES_EXPORTER`	`otlp`	Exporter type (`otlp`, `console`, `none`)
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP collector endpoint
`OTEL_TRACES_SAMPLER`	`always_on`	Sampling strategy
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (for `traceidratio` sampler)

Performance Configuration

Advanced settings for compute precision and preprocessing.

Variable	Default	Description
`SIE_PREPROCESSOR_WORKERS`	`4`	Number of preprocessing worker threads
`SIE_IMAGE_WORKERS`	`4`	Image preprocessing worker threads (for VLMs)
`SIE_ATTENTION_BACKEND`	`auto`	Attention implementation: `auto`, `flash_attention_2`, `sdpa`, `eager`
`SIE_DEFAULT_COMPUTE_PRECISION`	`float16`	Default compute precision: `float16`, `bfloat16`, `float32`
`SIE_INSTRUMENTATION`	`false`	Enable detailed batch statistics for debugging

LoRA Configuration

Control LoRA adapter loading behavior.

Variable	Default	Description
`SIE_MAX_LORAS_PER_MODEL`	`10`	Maximum LoRA adapters to keep loaded per model

When the limit is reached, the least-recently-used LoRA adapter is evicted.

Example: Production Configuration

# High-throughput production setup
export SIE_DEVICE=cuda
export SIE_MODELS_DIR=s3://my-bucket/models/
export SIE_CLUSTER_CACHE=s3://my-bucket/weights/
export SIE_LOCAL_CACHE=/mnt/nvme/cache

# Batching optimized for A100-80GB
export SIE_MAX_BATCH_REQUESTS=128
export SIE_MAX_BATCH_WAIT_MS=5
export SIE_MAX_CONCURRENT_REQUESTS=1024

# Memory management
export SIE_MEMORY_PRESSURE_THRESHOLD_PCT=90

# Observability
export SIE_LOG_JSON=true
export SIE_TRACING_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

Example: Development Configuration

# Local development setup
export SIE_DEVICE=mps  # or cuda, cpu
export SIE_MODELS_DIR=./models

# Lower batching for faster iteration
export SIE_MAX_BATCH_REQUESTS=8
export SIE_MAX_BATCH_WAIT_MS=1

# Debug logging
export SIE_INSTRUMENTATION=true

What’s Next

CLI Reference - Command-line options that map to these variables
HTTP API Reference - Endpoints exposed by the configured server