Overview

SIE deploys as a single container with no external dependencies. Start with Docker for simplicity. Graduate to Kubernetes when you need scaling or high availability.

Decision Tree

Choose your deployment path based on your requirements:

Do you need horizontal scaling or HA?
├─ No  → Docker (single instance)
│        └─ Need GPU? → Add --gpus all
│
└─ Yes → Kubernetes
         ├─ Single GPU type? → Basic Helm deployment
         └─ Multi-GPU pools? → Elastic cloud with router

Start with Docker. Most teams run SIE as a single container for months before needing Kubernetes. The multi-model architecture means one container serves all your embedding workloads.

Quick Comparison

Deployment	Best For	Scaling	Effort
Docker	Development, small production	Vertical (bigger GPU)	Minutes
Docker Compose	Multi-container setups	Vertical	Minutes
Kubernetes (Helm)	Production, HA required	Horizontal replicas	Hours

Hardware Requirements

SIE runs on CPU, but GPU acceleration is strongly recommended for production.

Minimum Specs

Component	Minimum	Recommended
CPU	4 cores	8+ cores
RAM	16 GB	32+ GB
GPU	None (CPU mode)	NVIDIA T4 or better
VRAM	N/A	16+ GB
Disk	50 GB	100+ GB (model cache)

GPU Recommendations by Workload

Workload	GPU	Notes
Development	None or T4	CPU works for testing
Small production	T4 / L4	1-10 models, low traffic
Medium production	A10G / L40S	10-50 models, moderate traffic
High throughput	A100 / H100	Maximum performance

Supported Hardware

SIE supports multiple GPU vendors and CPU inference:

Device	Flag	Notes
NVIDIA CUDA	`--device cuda`	Recommended for production
Apple Silicon	`--device mps`	M1/M2/M3 for local development
CPU	`--device cpu`	Fallback, significantly slower

Docker Quick Start

Pull and run the official image:

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie:latest

# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest

The server starts on port 8080. Models load on first request.

Common Options

# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie:latest

# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest \
  sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3

# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/superlinked/sie:latest

# Custom bundle for specific dependencies
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie:latest \
  sie-server serve -b gte-qwen2

Health Checks

Verify the server is running:

# Health endpoint
curl http://localhost:8080/healthz

# List available models
curl http://localhost:8080/v1/models

# Test encoding
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

When to Upgrade

Stay on Docker When

Single GPU is sufficient for your throughput
You do not require high availability
You are in development or early production
Vertical scaling (bigger GPU) meets your needs

Move to Kubernetes When

You need horizontal scaling (multiple replicas)
High availability is required (pod failover)
You have multiple GPU types to utilize
You need automated scaling based on load

What’s Next

Docker Deployment - detailed Docker configuration
Kubernetes in GCP - Helm charts and GKE deployment
Hardware & Capacity - GPU selection and memory planning
Monitoring & Observability - metrics and health checks