Skip to content
SIE

Model Adapters

Adapters are thin wrappers that connect model families to the inference engine. Each adapter implements a standard protocol for loading, unloading, and running inference. This enables SIE to support 80+ models with consistent behavior.

An adapter wraps a specific model architecture or library. It handles:

  • Loading model weights onto a device (CPU, CUDA, MPS)
  • Inference via encode(), score(), or extract() methods
  • Unloading with proper memory cleanup

One adapter can serve many models. For example, SentenceTransformerDenseAdapter works with all-MiniLM, E5, BGE, and hundreds of other compatible models.

Every adapter exposes the same core lifecycle:

  • Capabilities to declare input/output support
  • Dimensions for output shapes
  • Load/Unload for device placement and cleanup
  • Encode/Score/Extract for inference

Each adapter declares its capabilities:

FieldTypeDescription
inputslist[str]Supported input modalities: “text”, “image”, “audio”
outputslist[str]Output types: “dense”, “sparse”, “multivector”
can_scoreboolSupports reranking via score()
can_extractboolSupports extraction via extract()

Capabilities are static metadata in the model config and adapter implementation.

Adapters report output dimensions for validation and client usage:

FieldDescription
denseDense vector dimensionality (e.g., 1024)
sparseVocabulary size for sparse vectors
multivectorPer-token embedding dimension

Adapters use different compute backends depending on model architecture:

Flash Attention with variable-length sequences eliminates padding waste. Uses flash_attn_varlen_func to pack sequences and process without padding tokens.

Benefits:

  • Higher throughput (no wasted compute on padding)
  • Lower memory usage (no padded tensors)
  • 20-40% speedup on typical workloads

Used by: BertFlashAdapter, Qwen2FlashAdapter, SPLADEFlashAdapter, ColBERTAdapter

SGLang provides memory-efficient inference for large LLM embedding models (4B+). Pre-allocates KV cache to prevent OOM under concurrent load.

Benefits:

  • Stable memory usage with concurrent requests
  • Handles 4B-8B parameter models reliably
  • LoRA adapter support via HTTP API

Used by: SGLangEmbeddingAdapter for Qwen3-Embedding-4B, GTE-Qwen2-7B, etc.

Standard PyTorch with Scaled Dot-Product Attention. Uses native transformers libraries like sentence-transformers.

Benefits:

  • Broadest compatibility
  • Works on CPU, CUDA, and MPS
  • Simple debugging

Used by: SentenceTransformerDenseAdapter, CrossEncoderAdapter, CLIPAdapter

AdapterComputeModels
SentenceTransformerDenseAdapterSDPAall-MiniLM, BGE-base, GTE-multilingual
BertFlashAdapterFlashE5-v2 series, BERT-based models
Qwen2FlashAdapterFlashstella_en_1.5B_v5, GTE-Qwen2 series
SGLangEmbeddingAdapterSGLangQwen3-Embedding-4B/8B, E5-Mistral-7B
BGEM3AdapterSDPABAAI/bge-m3 (dense, sparse, multivector)
NoMicFlashAdapterFlashnomic-embed-text-v2-moe
XLMRobertaFlashAdapterFlashmultilingual-e5-large, XLM-R models
RoPEFlashAdapterFlashModels with rotary position embeddings
AdapterComputeModels
SentenceTransformerSparseAdapterSDPAsentence-transformers SparseEncoder models
SPLADEFlashAdapterFlashSPLADE-v3, OpenSearch Neural Sparse
BGEM3AdapterSDPABAAI/bge-m3 sparse output
AdapterComputeModels
ColBERTAdapterFlashjina-colbert-v2, colbertv2.0, answerai-colbert-small
ColBERTModernBERTFlashAdapterFlashGTE-ModernColBERT-v1, Reason-ModernColBERT
ColBERTRotaryFlashAdapterFlashColBERT models with RoPE
AdapterComputeModels
CrossEncoderAdapterSDPABGE-reranker, Jina-reranker, MS-MARCO
BertFlashCrossEncoderAdapterFlashBERT-based rerankers
JinaFlashCrossEncoderAdapterFlashjina-reranker-v2-base-multilingual
ModernBERTFlashCrossEncoderAdapterFlashgte-reranker-modernbert-base
Qwen2FlashCrossEncoderAdapterFlashQwen2-based rerankers
AdapterModalityModels
CLIPAdapterText + Imageopenai/clip-vit-base-patch32, LAION CLIP
SigLIPAdapterText + Imagegoogle/siglip-so400m-patch14
ColPaliAdapterImagevidore/colpali-v1.3-hf
ColQwen2AdapterImagevidore/colqwen2.5-v0.2
NemoColEmbedAdapterImagenvidia/llama-nemoretriever-colembed-3b
AdapterTaskModels
GLiNERAdapterZero-shot NERgliner_multi-v2.1, NuNER_Zero
GLiRELAdapterRelation extractionglirel-large-v0
GLiClassAdapterClassificationgliclass-base-v1.0
Florence2AdapterDocument understandingFlorence-2-base, Florence-2-large
DonutAdapterDocument parsingdonut-base-finetuned-docvqa
GroundingDinoAdapterObject detectiongrounding-dino-tiny, grounding-dino-base
OwlV2AdapterZero-shot detectionowlv2-base-patch16-ensemble
NLIClassificationAdapterZero-shot classificationdeberta-v3-large-zeroshot-v2.0

Adapters must fully release GPU memory in unload() so LRU eviction is safe.

if self._model is not None:
del self._model
self._model = None
self._device = None
# Release GPU memory
import gc
gc.collect()
if device and device.startswith("cuda"):
torch.cuda.empty_cache()
elif device == "mps":
torch.mps.empty_cache()

The registry tracks memory usage via memory_footprint() for LRU eviction.

Some adapters support dynamic LoRA adapter loading:

def supports_lora(self) -> bool:
"""Return True if this adapter supports LoRA."""
...
def load_lora(self, lora_path: str) -> int:
"""Load a LoRA adapter, return memory usage."""
...
def set_active_lora(self, lora_name: str | None) -> None:
"""Switch active LoRA before inference."""
...

SGLang adapters use the HTTP API for LoRA switching. PEFT-based adapters use the PEFTLoRAMixin for in-process loading.

For adding support for new model architectures, see Adding Models.

The typical workflow:

  1. Identify the model architecture (BERT, Qwen2, custom)
  2. Choose a compute backend (SDPA, Flash, SGLang)
  3. Implement the adapter protocol
  4. Create a model config in packages/sie_server/models/