LoRA Adapters
LoRA (Low-Rank Adaptation) lets you customize embedding models for specific domains. Instead of fine-tuning all model weights, LoRA trains small adapter layers. This reduces training cost and enables swapping adapters at inference time.
What is LoRA
Section titled “What is LoRA”LoRA freezes the base model and injects trainable low-rank matrices into attention layers. A typical LoRA adapter is 1-5% of the base model size. Multiple LoRA adapters can share the same base model, switching between domains without reloading weights.
Benefits:
- Train domain-specific embeddings with minimal data
- Share base model across multiple adapters
- Hot-swap adapters per request
- Reduce GPU memory vs separate fine-tuned models
Quick Example
Section titled “Quick Example”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Use a LoRA adapter for domain-specific embeddingsresult = client.encode( "BAAI/bge-m3", Item(text="breach of fiduciary duty"), options={"lora": "org/bge-m3-legal-lora"})PEFT LoRA (Dynamic Loading)
Section titled “PEFT LoRA (Dynamic Loading)”Most SIE adapters use PEFT (Parameter-Efficient Fine-Tuning) for LoRA support. PEFT provides dynamic loading and hot reload capabilities.
How it works:
- First request with a LoRA triggers async loading
- PEFT wraps the base model with adapter layers
- Subsequent requests use the loaded adapter instantly
- Multiple LoRAs can be loaded simultaneously
# First request: triggers LoRA load (may take a few seconds)result = client.encode("BAAI/bge-m3", Item(text="legal query"), options={"lora": "org/legal-lora"})
# Subsequent requests: instant (adapter already loaded)result = client.encode("BAAI/bge-m3", Item(text="another query"), options={"lora": "org/legal-lora"})
# Switch to different LoRAresult = client.encode("BAAI/bge-m3", Item(text="medical query"), options={"lora": "org/medical-lora"})PEFT adapters support hot reload. Loading a new LoRA does not block ongoing inference requests.
SGLang LoRA (Pre-loaded)
Section titled “SGLang LoRA (Pre-loaded)”For LLM-based embedding models (4B+ parameters), SIE uses SGLang. SGLang requires LoRA adapters to be pre-loaded at server startup.
Configure in model config YAML:
name: Qwen/Qwen3-Embedding-8Badapter: sglangadapter_options_loadtime: lora_paths: legal: org/qwen3-legal-lora medical: /path/to/medical-adapter max_loras_per_batch: 8Use at request time:
# Select pre-loaded LoRA by nameresult = client.encode( "Qwen/Qwen3-Embedding-8B", Item(text="legal document"), options={"lora": "legal"})SGLang handles mixed-LoRA batching internally via S-LoRA. Requests with different LoRAs can batch together.
Configuring LoRA
Section titled “Configuring LoRA”Via Request Options
Section titled “Via Request Options”Pass lora in the options parameter:
result = client.encode( "BAAI/bge-m3", Item(text="query"), options={"lora": "org/my-lora-adapter"})Via Profiles
Section titled “Via Profiles”Define LoRA adapters as profiles in your model config. This simplifies client code and enables named presets.
name: BAAI/bge-m3profiles: legal: instruction: "Given a legal query, retrieve relevant case law" lora: org/bge-m3-legal-lora medical: instruction: "Retrieve medical research for this query" lora: org/bge-m3-medical-loraUse the profile by name:
result = client.encode( "BAAI/bge-m3", Item(text="breach of contract"), profile="legal")HTTP API
Section titled “HTTP API”curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{"items": [{"text": "legal query"}], "params": {"options": {"lora": "org/legal-lora"}}}'LoRA Eviction
Section titled “LoRA Eviction”SIE limits the number of loaded LoRA adapters per model to manage GPU memory. When this limit is reached, the least recently used (LRU) adapter is evicted.
Configuration:
max_loras_per_model: 10 # Default: 10 adapters per modelOr via environment variable:
SIE_MAX_LORAS_PER_MODEL=20Eviction behavior:
- New LoRA request triggers eviction if limit reached
- Oldest unused adapter is unloaded first
- Evicted adapters reload automatically on next request
- Base model remains loaded (only adapter weights evicted)
Each LoRA adds approximately 1-5% of base model memory. Monitor GPU memory if loading many adapters.
Supported Adapters
Section titled “Supported Adapters”| Adapter Type | LoRA Support | Hot Reload | Notes |
|---|---|---|---|
| PEFT-based (sentence-transformers, BGE-M3, etc.) | Yes | Yes | Dynamic loading |
| SGLang (LLM embeddings) | Yes | No | Pre-loaded at startup |
| ColBERT | No | - | Not yet supported |
| CLIP/SigLIP | No | - | Not yet supported |
What’s Next
Section titled “What’s Next”- Model Catalog - see which models support LoRA
- Profiles - bundle LoRA with other options