Skip to content
SIE

What is SIE?

SIE is an inference server for small models (80+ supported). It exposes three primitives: encode (text and images to vectors), score (query-document relevance), and extract (entities and structure).

Start with the Quickstart to get your first vectors in 2 minutes. The API Reference and SDK Reference cover the full interface.

# docker run -p 8080:8080 ghcr.io/superlinked/sie:latest
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# encode: text → vectors
result = client.encode("BAAI/bge-m3", Item(text="your text"))
print(result["dense"].shape) # (1024,)
# score: query + items → ranked results
query = Item(text="What is machine learning?")
items = [Item(text="ML learns from data."), Item(text="The weather is nice.")]
scores = client.score("BAAI/bge-reranker-v2-m3", query, items)
# extract: text → entities
result = client.extract("urchade/gliner_multi-v2.1", Item(text="Tim Cook leads Apple."), labels=["person", "org"])

LLM inference tools optimize for one large model across multiple GPUs. Small model inference is the opposite problem. You run many models (encoders, rerankers, extractors) on one GPU with fast switching.

What makes SIE different:

  1. Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three primitives. The server picks the best engine per model automatically.

  2. Multi-model GPU sharing. Load many models on one GPU with LRU eviction. One server instance serves any model at query time.

  3. Laptop to cloud. Same codebase runs locally and in production Kubernetes.

  4. Validated correctness. Every model has quality and latency targets checked in CI.