Skip to content
SIE

LoRA Adapters

LoRA (Low-Rank Adaptation) lets you customize embedding models for specific domains. Instead of fine-tuning all model weights, LoRA trains small adapter layers. This reduces training cost and enables swapping adapters at inference time.

LoRA freezes the base model and injects trainable low-rank matrices into attention layers. A typical LoRA adapter is 1-5% of the base model size. Multiple LoRA adapters can share the same base model, switching between domains without reloading weights.

Benefits:

  • Train domain-specific embeddings with minimal data
  • Share base model across multiple adapters
  • Hot-swap adapters per request
  • Reduce GPU memory vs separate fine-tuned models
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Use a LoRA adapter for domain-specific embeddings
result = client.encode(
"BAAI/bge-m3",
Item(text="breach of fiduciary duty"),
options={"lora": "org/bge-m3-legal-lora"}
)

Most SIE adapters use PEFT (Parameter-Efficient Fine-Tuning) for LoRA support. PEFT provides dynamic loading and hot reload capabilities.

How it works:

  1. First request with a LoRA triggers async loading
  2. PEFT wraps the base model with adapter layers
  3. Subsequent requests use the loaded adapter instantly
  4. Multiple LoRAs can be loaded simultaneously
# First request: triggers LoRA load (may take a few seconds)
result = client.encode("BAAI/bge-m3", Item(text="legal query"), options={"lora": "org/legal-lora"})
# Subsequent requests: instant (adapter already loaded)
result = client.encode("BAAI/bge-m3", Item(text="another query"), options={"lora": "org/legal-lora"})
# Switch to different LoRA
result = client.encode("BAAI/bge-m3", Item(text="medical query"), options={"lora": "org/medical-lora"})

PEFT adapters support hot reload. Loading a new LoRA does not block ongoing inference requests.

For LLM-based embedding models (4B+ parameters), SIE uses SGLang. SGLang requires LoRA adapters to be pre-loaded at server startup.

Configure in model config YAML:

name: Qwen/Qwen3-Embedding-8B
adapter: sglang
adapter_options_loadtime:
lora_paths:
legal: org/qwen3-legal-lora
medical: /path/to/medical-adapter
max_loras_per_batch: 8

Use at request time:

# Select pre-loaded LoRA by name
result = client.encode(
"Qwen/Qwen3-Embedding-8B",
Item(text="legal document"),
options={"lora": "legal"}
)

SGLang handles mixed-LoRA batching internally via S-LoRA. Requests with different LoRAs can batch together.

Pass lora in the options parameter:

result = client.encode(
"BAAI/bge-m3",
Item(text="query"),
options={"lora": "org/my-lora-adapter"}
)

Define LoRA adapters as profiles in your model config. This simplifies client code and enables named presets.

name: BAAI/bge-m3
profiles:
legal:
instruction: "Given a legal query, retrieve relevant case law"
lora: org/bge-m3-legal-lora
medical:
instruction: "Retrieve medical research for this query"
lora: org/bge-m3-medical-lora

Use the profile by name:

result = client.encode(
"BAAI/bge-m3",
Item(text="breach of contract"),
profile="legal"
)
Terminal window
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"items": [{"text": "legal query"}], "params": {"options": {"lora": "org/legal-lora"}}}'

SIE limits the number of loaded LoRA adapters per model to manage GPU memory. When this limit is reached, the least recently used (LRU) adapter is evicted.

Configuration:

engine.yaml
max_loras_per_model: 10 # Default: 10 adapters per model

Or via environment variable:

Terminal window
SIE_MAX_LORAS_PER_MODEL=20

Eviction behavior:

  • New LoRA request triggers eviction if limit reached
  • Oldest unused adapter is unloaded first
  • Evicted adapters reload automatically on next request
  • Base model remains loaded (only adapter weights evicted)

Each LoRA adds approximately 1-5% of base model memory. Monitor GPU memory if loading many adapters.

Adapter TypeLoRA SupportHot ReloadNotes
PEFT-based (sentence-transformers, BGE-M3, etc.)YesYesDynamic loading
SGLang (LLM embeddings)YesNoPre-loaded at startup
ColBERTNo-Not yet supported
CLIP/SigLIPNo-Not yet supported