Skip to content
SIE

Overview

SIE deploys as a single container with no external dependencies. Start with Docker for simplicity. Graduate to Kubernetes when you need scaling or high availability.

Choose your deployment path based on your requirements:

Do you need horizontal scaling or HA?
├─ No → Docker (single instance)
│ └─ Need GPU? → Add --gpus all
└─ Yes → Kubernetes
├─ Single GPU type? → Basic Helm deployment
└─ Multi-GPU pools? → Elastic cloud with router

Start with Docker. Most teams run SIE as a single container for months before needing Kubernetes. The multi-model architecture means one container serves all your embedding workloads.

DeploymentBest ForScalingEffort
DockerDevelopment, small productionVertical (bigger GPU)Minutes
Docker ComposeMulti-container setupsVerticalMinutes
Kubernetes (Helm)Production, HA requiredHorizontal replicasHours

SIE runs on CPU, but GPU acceleration is strongly recommended for production.

ComponentMinimumRecommended
CPU4 cores8+ cores
RAM16 GB32+ GB
GPUNone (CPU mode)NVIDIA T4 or better
VRAMN/A16+ GB
Disk50 GB100+ GB (model cache)
WorkloadGPUNotes
DevelopmentNone or T4CPU works for testing
Small productionT4 / L41-10 models, low traffic
Medium productionA10G / L40S10-50 models, moderate traffic
High throughputA100 / H100Maximum performance

SIE supports multiple GPU vendors and CPU inference:

DeviceFlagNotes
NVIDIA CUDA--device cudaRecommended for production
Apple Silicon--device mpsM1/M2/M3 for local development
CPU--device cpuFallback, significantly slower

Pull and run the official image:

Terminal window
# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie:latest
# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest

The server starts on port 8080. Models load on first request.

Terminal window
# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie:latest
# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest \
sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3
# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/superlinked/sie:latest
# Custom bundle for specific dependencies
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest \
sie-server serve -b gte-qwen2

Verify the server is running:

Terminal window
# Health endpoint
curl http://localhost:8080/healthz
# List available models
curl http://localhost:8080/v1/models
# Test encoding
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
  • Single GPU is sufficient for your throughput
  • You do not require high availability
  • You are in development or early production
  • Vertical scaling (bigger GPU) meets your needs
  • You need horizontal scaling (multiple replicas)
  • High availability is required (pod failover)
  • You have multiple GPU types to utilize
  • You need automated scaling based on load