Skip to content
SIE

Configuration

SIE uses environment variables for server configuration. CLI arguments override environment variables, which override defaults.

Core settings for device selection, model loading, and server behavior.

VariableDefaultDescription
SIE_DEVICEautoInference device. Options: auto (detect GPU), cuda, cuda:0, mps, cpu
SIE_MODELS_DIR./modelsPath to model configs directory. Supports local paths, s3://, or gs:// URLs
SIE_MODEL_FILTERNoneComma-separated list of model names to load. If unset, all models are available
SIE_GPU_TYPEAuto-detectedOverride detected GPU type for routing (e.g., l4, a100-80gb, h100)

Control where model weights are stored and retrieved.

VariableDefaultDescription
SIE_LOCAL_CACHEHF_HOMELocal cache directory for model weights
SIE_CLUSTER_CACHENoneCluster cache URL for shared weights (s3:// or gs://)
SIE_HF_FALLBACKtrueEnable HuggingFace Hub fallback for weight downloads

Cache resolution order:

  1. Local cache (SIE_LOCAL_CACHE)
  2. Cluster cache (SIE_CLUSTER_CACHE)
  3. HuggingFace Hub (if SIE_HF_FALLBACK=true)

Control request batching behavior for GPU efficiency.

VariableDefaultDescription
SIE_MAX_BATCH_REQUESTS64Maximum requests per batch
SIE_MAX_BATCH_WAIT_MS10Maximum milliseconds to wait for batch to fill
SIE_MAX_CONCURRENT_REQUESTS512Maximum concurrent requests (queue size)

Tuning guidance:

  • Increase SIE_MAX_BATCH_REQUESTS for higher throughput on high-memory GPUs
  • Decrease SIE_MAX_BATCH_WAIT_MS for lower latency at the cost of smaller batches
  • Set SIE_MAX_CONCURRENT_REQUESTS based on expected burst traffic

Control memory pressure thresholds and LRU eviction behavior.

VariableDefaultDescription
SIE_MEMORY_PRESSURE_THRESHOLD_PCT85VRAM usage percent that triggers LRU eviction (0-100)
SIE_MEMORY_CHECK_INTERVAL_S1.0Background memory monitor interval in seconds

How LRU eviction works:

  1. Background monitor checks memory usage every SIE_MEMORY_CHECK_INTERVAL_S seconds
  2. When usage exceeds SIE_MEMORY_PRESSURE_THRESHOLD_PCT, the least-recently-used model is evicted
  3. Models are re-loaded on-demand when the next request arrives

Control log format and verbosity.

VariableDefaultDescription
SIE_LOG_JSONfalseEnable structured JSON logging for Loki compatibility

JSON log format includes structured fields:

{
"timestamp": "2025-12-18T10:30:00Z",
"level": "INFO",
"logger": "sie_server.core.registry",
"message": "Inference completed",
"model": "bge-m3",
"request_id": "abc123",
"trace_id": "def456",
"latency_ms": 45.2
}

Enable OpenTelemetry distributed tracing.

VariableDefaultDescription
SIE_TRACING_ENABLEDfalseEnable OpenTelemetry tracing

When tracing is enabled, SIE respects standard OpenTelemetry environment variables:

VariableDefaultDescription
OTEL_SERVICE_NAMEsie-serverService name in traces
OTEL_TRACES_EXPORTERotlpExporter type (otlp, console, none)
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTLP collector endpoint
OTEL_TRACES_SAMPLERalways_onSampling strategy
OTEL_TRACES_SAMPLER_ARG1.0Sampling rate (for traceidratio sampler)

Advanced settings for compute precision and preprocessing.

VariableDefaultDescription
SIE_PREPROCESSOR_WORKERS4Number of preprocessing worker threads
SIE_IMAGE_WORKERS4Image preprocessing worker threads (for VLMs)
SIE_ATTENTION_BACKENDautoAttention implementation: auto, flash_attention_2, sdpa, eager
SIE_DEFAULT_COMPUTE_PRECISIONfloat16Default compute precision: float16, bfloat16, float32
SIE_INSTRUMENTATIONfalseEnable detailed batch statistics for debugging

Control LoRA adapter loading behavior.

VariableDefaultDescription
SIE_MAX_LORAS_PER_MODEL10Maximum LoRA adapters to keep loaded per model

When the limit is reached, the least-recently-used LoRA adapter is evicted.


Terminal window
# High-throughput production setup
export SIE_DEVICE=cuda
export SIE_MODELS_DIR=s3://my-bucket/models/
export SIE_CLUSTER_CACHE=s3://my-bucket/weights/
export SIE_LOCAL_CACHE=/mnt/nvme/cache
# Batching optimized for A100-80GB
export SIE_MAX_BATCH_REQUESTS=128
export SIE_MAX_BATCH_WAIT_MS=5
export SIE_MAX_CONCURRENT_REQUESTS=1024
# Memory management
export SIE_MEMORY_PRESSURE_THRESHOLD_PCT=90
# Observability
export SIE_LOG_JSON=true
export SIE_TRACING_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

Terminal window
# Local development setup
export SIE_DEVICE=mps # or cuda, cpu
export SIE_MODELS_DIR=./models
# Lower batching for faster iteration
export SIE_MAX_BATCH_REQUESTS=8
export SIE_MAX_BATCH_WAIT_MS=1
# Debug logging
export SIE_INSTRUMENTATION=true