Model Adapters

Adapters are thin wrappers that connect model families to the inference engine. Each adapter implements a standard protocol for loading, unloading, and running inference. This enables SIE to support 80+ models with consistent behavior.

What Are Adapters

An adapter wraps a specific model architecture or library. It handles:

Loading model weights onto a device (CPU, CUDA, MPS)
Inference via encode(), score(), or extract() methods
Unloading with proper memory cleanup

One adapter can serve many models. For example, SentenceTransformerDenseAdapter works with all-MiniLM, E5, BGE, and hundreds of other compatible models.

Adapter Protocol

Every adapter exposes the same core lifecycle:

Capabilities to declare input/output support
Dimensions for output shapes
Load/Unload for device placement and cleanup
Encode/Score/Extract for inference

Capabilities

Each adapter declares its capabilities:

Field	Type	Description
`inputs`	`list[str]`	Supported input modalities: “text”, “image”, “audio”
`outputs`	`list[str]`	Output types: “dense”, “sparse”, “multivector”
`can_score`	`bool`	Supports reranking via score()
`can_extract`	`bool`	Supports extraction via extract()

Capabilities are static metadata in the model config and adapter implementation.

Dimensions

Adapters report output dimensions for validation and client usage:

Field	Description
`dense`	Dense vector dimensionality (e.g., 1024)
`sparse`	Vocabulary size for sparse vectors
`multivector`	Per-token embedding dimension

Compute Engines

Adapters use different compute backends depending on model architecture:

Flash Attention 2

Flash Attention with variable-length sequences eliminates padding waste. Uses flash_attn_varlen_func to pack sequences and process without padding tokens.

Benefits:

Higher throughput (no wasted compute on padding)
Lower memory usage (no padded tensors)
20-40% speedup on typical workloads

Used by: BertFlashAdapter, Qwen2FlashAdapter, SPLADEFlashAdapter, ColBERTAdapter

SGLang

SGLang provides memory-efficient inference for large LLM embedding models (4B+). Pre-allocates KV cache to prevent OOM under concurrent load.

Benefits:

Stable memory usage with concurrent requests
Handles 4B-8B parameter models reliably
LoRA adapter support via HTTP API

Used by: SGLangEmbeddingAdapter for Qwen3-Embedding-4B, GTE-Qwen2-7B, etc.

PyTorch with SDPA

Standard PyTorch with Scaled Dot-Product Attention. Uses native transformers libraries like sentence-transformers.

Benefits:

Broadest compatibility
Works on CPU, CUDA, and MPS
Simple debugging

Used by: SentenceTransformerDenseAdapter, CrossEncoderAdapter, CLIPAdapter

Adapter Catalog

Dense Embedding Adapters

Adapter	Compute	Models
`SentenceTransformerDenseAdapter`	SDPA	all-MiniLM, BGE-base, GTE-multilingual
`BertFlashAdapter`	Flash	E5-v2 series, BERT-based models
`Qwen2FlashAdapter`	Flash	stella_en_1.5B_v5, GTE-Qwen2 series
`SGLangEmbeddingAdapter`	SGLang	Qwen3-Embedding-4B/8B, E5-Mistral-7B
`BGEM3Adapter`	SDPA	BAAI/bge-m3 (dense, sparse, multivector)
`NoMicFlashAdapter`	Flash	nomic-embed-text-v2-moe
`XLMRobertaFlashAdapter`	Flash	multilingual-e5-large, XLM-R models
`RoPEFlashAdapter`	Flash	Models with rotary position embeddings

Sparse Embedding Adapters

Adapter	Compute	Models
`SentenceTransformerSparseAdapter`	SDPA	sentence-transformers SparseEncoder models
`SPLADEFlashAdapter`	Flash	SPLADE-v3, OpenSearch Neural Sparse
`BGEM3Adapter`	SDPA	BAAI/bge-m3 sparse output

Multi-Vector Adapters (ColBERT)

Adapter	Compute	Models
`ColBERTAdapter`	Flash	jina-colbert-v2, colbertv2.0, answerai-colbert-small
`ColBERTModernBERTFlashAdapter`	Flash	GTE-ModernColBERT-v1, Reason-ModernColBERT
`ColBERTRotaryFlashAdapter`	Flash	ColBERT models with RoPE

Reranker Adapters

Adapter	Compute	Models
`CrossEncoderAdapter`	SDPA	BGE-reranker, Jina-reranker, MS-MARCO
`BertFlashCrossEncoderAdapter`	Flash	BERT-based rerankers
`JinaFlashCrossEncoderAdapter`	Flash	jina-reranker-v2-base-multilingual
`ModernBERTFlashCrossEncoderAdapter`	Flash	gte-reranker-modernbert-base
`Qwen2FlashCrossEncoderAdapter`	Flash	Qwen2-based rerankers

Vision Adapters

Adapter	Modality	Models
`CLIPAdapter`	Text + Image	openai/clip-vit-base-patch32, LAION CLIP
`SigLIPAdapter`	Text + Image	google/siglip-so400m-patch14
`ColPaliAdapter`	Image	vidore/colpali-v1.3-hf
`ColQwen2Adapter`	Image	vidore/colqwen2.5-v0.2
`NemoColEmbedAdapter`	Image	nvidia/llama-nemoretriever-colembed-3b

Extraction Adapters

Adapter	Task	Models
`GLiNERAdapter`	Zero-shot NER	gliner_multi-v2.1, NuNER_Zero
`GLiRELAdapter`	Relation extraction	glirel-large-v0
`GLiClassAdapter`	Classification	gliclass-base-v1.0
`Florence2Adapter`	Document understanding	Florence-2-base, Florence-2-large
`DonutAdapter`	Document parsing	donut-base-finetuned-docvqa
`GroundingDinoAdapter`	Object detection	grounding-dino-tiny, grounding-dino-base
`OwlV2Adapter`	Zero-shot detection	owlv2-base-patch16-ensemble
`NLIClassificationAdapter`	Zero-shot classification	deberta-v3-large-zeroshot-v2.0

Memory Management

Adapters must fully release GPU memory in unload() so LRU eviction is safe.

    if self._model is not None:
        del self._model
        self._model = None

    self._device = None

    # Release GPU memory
    import gc
    gc.collect()
    if device and device.startswith("cuda"):
        torch.cuda.empty_cache()
    elif device == "mps":
        torch.mps.empty_cache()

The registry tracks memory usage via memory_footprint() for LRU eviction.

LoRA Support

Some adapters support dynamic LoRA adapter loading:

def supports_lora(self) -> bool:
    """Return True if this adapter supports LoRA."""
    ...

def load_lora(self, lora_path: str) -> int:
    """Load a LoRA adapter, return memory usage."""
    ...

def set_active_lora(self, lora_name: str | None) -> None:
    """Switch active LoRA before inference."""
    ...

SGLang adapters use the HTTP API for LoRA switching. PEFT-based adapters use the PEFTLoRAMixin for in-process loading.

Writing Custom Adapters

For adding support for new model architectures, see Adding Models.

The typical workflow:

Identify the model architecture (BERT, Qwen2, custom)
Choose a compute backend (SDPA, Flash, SGLang)
Implement the adapter protocol
Create a model config in packages/sie_server/models/

What’s Next

Adding Models - configure new models
Model Catalog - supported encode models