Model Adapters
Adapters are thin wrappers that connect model families to the inference engine. Each adapter implements a standard protocol for loading, unloading, and running inference. This enables SIE to support 80+ models with consistent behavior.
What Are Adapters
Section titled “What Are Adapters”An adapter wraps a specific model architecture or library. It handles:
- Loading model weights onto a device (CPU, CUDA, MPS)
- Inference via encode(), score(), or extract() methods
- Unloading with proper memory cleanup
One adapter can serve many models. For example, SentenceTransformerDenseAdapter works with all-MiniLM, E5, BGE, and hundreds of other compatible models.
Adapter Protocol
Section titled “Adapter Protocol”Every adapter exposes the same core lifecycle:
- Capabilities to declare input/output support
- Dimensions for output shapes
- Load/Unload for device placement and cleanup
- Encode/Score/Extract for inference
Capabilities
Section titled “Capabilities”Each adapter declares its capabilities:
| Field | Type | Description |
|---|---|---|
inputs | list[str] | Supported input modalities: “text”, “image”, “audio” |
outputs | list[str] | Output types: “dense”, “sparse”, “multivector” |
can_score | bool | Supports reranking via score() |
can_extract | bool | Supports extraction via extract() |
Capabilities are static metadata in the model config and adapter implementation.
Dimensions
Section titled “Dimensions”Adapters report output dimensions for validation and client usage:
| Field | Description |
|---|---|
dense | Dense vector dimensionality (e.g., 1024) |
sparse | Vocabulary size for sparse vectors |
multivector | Per-token embedding dimension |
Compute Engines
Section titled “Compute Engines”Adapters use different compute backends depending on model architecture:
Flash Attention 2
Section titled “Flash Attention 2”Flash Attention with variable-length sequences eliminates padding waste. Uses flash_attn_varlen_func to pack sequences and process without padding tokens.
Benefits:
- Higher throughput (no wasted compute on padding)
- Lower memory usage (no padded tensors)
- 20-40% speedup on typical workloads
Used by: BertFlashAdapter, Qwen2FlashAdapter, SPLADEFlashAdapter, ColBERTAdapter
SGLang
Section titled “SGLang”SGLang provides memory-efficient inference for large LLM embedding models (4B+). Pre-allocates KV cache to prevent OOM under concurrent load.
Benefits:
- Stable memory usage with concurrent requests
- Handles 4B-8B parameter models reliably
- LoRA adapter support via HTTP API
Used by: SGLangEmbeddingAdapter for Qwen3-Embedding-4B, GTE-Qwen2-7B, etc.
PyTorch with SDPA
Section titled “PyTorch with SDPA”Standard PyTorch with Scaled Dot-Product Attention. Uses native transformers libraries like sentence-transformers.
Benefits:
- Broadest compatibility
- Works on CPU, CUDA, and MPS
- Simple debugging
Used by: SentenceTransformerDenseAdapter, CrossEncoderAdapter, CLIPAdapter
Adapter Catalog
Section titled “Adapter Catalog”Dense Embedding Adapters
Section titled “Dense Embedding Adapters”| Adapter | Compute | Models |
|---|---|---|
SentenceTransformerDenseAdapter | SDPA | all-MiniLM, BGE-base, GTE-multilingual |
BertFlashAdapter | Flash | E5-v2 series, BERT-based models |
Qwen2FlashAdapter | Flash | stella_en_1.5B_v5, GTE-Qwen2 series |
SGLangEmbeddingAdapter | SGLang | Qwen3-Embedding-4B/8B, E5-Mistral-7B |
BGEM3Adapter | SDPA | BAAI/bge-m3 (dense, sparse, multivector) |
NoMicFlashAdapter | Flash | nomic-embed-text-v2-moe |
XLMRobertaFlashAdapter | Flash | multilingual-e5-large, XLM-R models |
RoPEFlashAdapter | Flash | Models with rotary position embeddings |
Sparse Embedding Adapters
Section titled “Sparse Embedding Adapters”| Adapter | Compute | Models |
|---|---|---|
SentenceTransformerSparseAdapter | SDPA | sentence-transformers SparseEncoder models |
SPLADEFlashAdapter | Flash | SPLADE-v3, OpenSearch Neural Sparse |
BGEM3Adapter | SDPA | BAAI/bge-m3 sparse output |
Multi-Vector Adapters (ColBERT)
Section titled “Multi-Vector Adapters (ColBERT)”| Adapter | Compute | Models |
|---|---|---|
ColBERTAdapter | Flash | jina-colbert-v2, colbertv2.0, answerai-colbert-small |
ColBERTModernBERTFlashAdapter | Flash | GTE-ModernColBERT-v1, Reason-ModernColBERT |
ColBERTRotaryFlashAdapter | Flash | ColBERT models with RoPE |
Reranker Adapters
Section titled “Reranker Adapters”| Adapter | Compute | Models |
|---|---|---|
CrossEncoderAdapter | SDPA | BGE-reranker, Jina-reranker, MS-MARCO |
BertFlashCrossEncoderAdapter | Flash | BERT-based rerankers |
JinaFlashCrossEncoderAdapter | Flash | jina-reranker-v2-base-multilingual |
ModernBERTFlashCrossEncoderAdapter | Flash | gte-reranker-modernbert-base |
Qwen2FlashCrossEncoderAdapter | Flash | Qwen2-based rerankers |
Vision Adapters
Section titled “Vision Adapters”| Adapter | Modality | Models |
|---|---|---|
CLIPAdapter | Text + Image | openai/clip-vit-base-patch32, LAION CLIP |
SigLIPAdapter | Text + Image | google/siglip-so400m-patch14 |
ColPaliAdapter | Image | vidore/colpali-v1.3-hf |
ColQwen2Adapter | Image | vidore/colqwen2.5-v0.2 |
NemoColEmbedAdapter | Image | nvidia/llama-nemoretriever-colembed-3b |
Extraction Adapters
Section titled “Extraction Adapters”| Adapter | Task | Models |
|---|---|---|
GLiNERAdapter | Zero-shot NER | gliner_multi-v2.1, NuNER_Zero |
GLiRELAdapter | Relation extraction | glirel-large-v0 |
GLiClassAdapter | Classification | gliclass-base-v1.0 |
Florence2Adapter | Document understanding | Florence-2-base, Florence-2-large |
DonutAdapter | Document parsing | donut-base-finetuned-docvqa |
GroundingDinoAdapter | Object detection | grounding-dino-tiny, grounding-dino-base |
OwlV2Adapter | Zero-shot detection | owlv2-base-patch16-ensemble |
NLIClassificationAdapter | Zero-shot classification | deberta-v3-large-zeroshot-v2.0 |
Memory Management
Section titled “Memory Management”Adapters must fully release GPU memory in unload() so LRU eviction is safe.
if self._model is not None: del self._model self._model = None
self._device = None
# Release GPU memory import gc gc.collect() if device and device.startswith("cuda"): torch.cuda.empty_cache() elif device == "mps": torch.mps.empty_cache()The registry tracks memory usage via memory_footprint() for LRU eviction.
LoRA Support
Section titled “LoRA Support”Some adapters support dynamic LoRA adapter loading:
def supports_lora(self) -> bool: """Return True if this adapter supports LoRA.""" ...
def load_lora(self, lora_path: str) -> int: """Load a LoRA adapter, return memory usage.""" ...
def set_active_lora(self, lora_name: str | None) -> None: """Switch active LoRA before inference.""" ...SGLang adapters use the HTTP API for LoRA switching. PEFT-based adapters use the PEFTLoRAMixin for in-process loading.
Writing Custom Adapters
Section titled “Writing Custom Adapters”For adding support for new model architectures, see Adding Models.
The typical workflow:
- Identify the model architecture (BERT, Qwen2, custom)
- Choose a compute backend (SDPA, Flash, SGLang)
- Implement the adapter protocol
- Create a model config in
packages/sie_server/models/
What’s Next
Section titled “What’s Next”- Adding Models - configure new models
- Model Catalog - supported encode models