LoRA Adapters

LoRA (Low-Rank Adaptation) lets you customize embedding models for specific domains. Instead of fine-tuning all model weights, LoRA trains small adapter layers. This reduces training cost and enables swapping adapters at inference time.

What is LoRA

LoRA freezes the base model and injects trainable low-rank matrices into attention layers. A typical LoRA adapter is 1-5% of the base model size. Multiple LoRA adapters can share the same base model, switching between domains without reloading weights.

Benefits:

Train domain-specific embeddings with minimal data
Share base model across multiple adapters
Hot-swap adapters per request
Reduce GPU memory vs separate fine-tuned models

Quick Example

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Use a LoRA adapter for domain-specific embeddings
result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of fiduciary duty"),
    options={"lora": "org/bge-m3-legal-lora"}
)

PEFT LoRA (Dynamic Loading)

Most SIE adapters use PEFT (Parameter-Efficient Fine-Tuning) for LoRA support. PEFT provides dynamic loading and hot reload capabilities.

How it works:

First request with a LoRA triggers async loading
PEFT wraps the base model with adapter layers
Subsequent requests use the loaded adapter instantly
Multiple LoRAs can be loaded simultaneously

# First request: triggers LoRA load (may take a few seconds)
result = client.encode("BAAI/bge-m3", Item(text="legal query"), options={"lora": "org/legal-lora"})

# Subsequent requests: instant (adapter already loaded)
result = client.encode("BAAI/bge-m3", Item(text="another query"), options={"lora": "org/legal-lora"})

# Switch to different LoRA
result = client.encode("BAAI/bge-m3", Item(text="medical query"), options={"lora": "org/medical-lora"})

PEFT adapters support hot reload. Loading a new LoRA does not block ongoing inference requests.

SGLang LoRA (Pre-loaded)

For LLM-based embedding models (4B+ parameters), SIE uses SGLang. SGLang requires LoRA adapters to be pre-loaded at server startup.

Configure in model config YAML:

name: Qwen/Qwen3-Embedding-8B
adapter: sglang
adapter_options_loadtime:
  lora_paths:
    legal: org/qwen3-legal-lora
    medical: /path/to/medical-adapter
  max_loras_per_batch: 8

Use at request time:

# Select pre-loaded LoRA by name
result = client.encode(
    "Qwen/Qwen3-Embedding-8B",
    Item(text="legal document"),
    options={"lora": "legal"}
)

SGLang handles mixed-LoRA batching internally via S-LoRA. Requests with different LoRAs can batch together.

Configuring LoRA

Via Request Options

Pass lora in the options parameter:

result = client.encode(
    "BAAI/bge-m3",
    Item(text="query"),
    options={"lora": "org/my-lora-adapter"}
)

Via Profiles

Define LoRA adapters as profiles in your model config. This simplifies client code and enables named presets.

name: BAAI/bge-m3
profiles:
  legal:
    instruction: "Given a legal query, retrieve relevant case law"
    lora: org/bge-m3-legal-lora
  medical:
    instruction: "Retrieve medical research for this query"
    lora: org/bge-m3-medical-lora

Use the profile by name:

result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of contract"),
    profile="legal"
)

HTTP API

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"items": [{"text": "legal query"}], "params": {"options": {"lora": "org/legal-lora"}}}'

LoRA Eviction

SIE limits the number of loaded LoRA adapters per model to manage GPU memory. When this limit is reached, the least recently used (LRU) adapter is evicted.

Configuration:

max_loras_per_model: 10  # Default: 10 adapters per model

Or via environment variable:

SIE_MAX_LORAS_PER_MODEL=20

Eviction behavior:

New LoRA request triggers eviction if limit reached
Oldest unused adapter is unloaded first
Evicted adapters reload automatically on next request
Base model remains loaded (only adapter weights evicted)

Each LoRA adds approximately 1-5% of base model memory. Monitor GPU memory if loading many adapters.

Supported Adapters

Adapter Type	LoRA Support	Hot Reload	Notes
PEFT-based (sentence-transformers, BGE-M3, etc.)	Yes	Yes	Dynamic loading
SGLang (LLM embeddings)	Yes	No	Pre-loaded at startup
ColBERT	No	-	Not yet supported
CLIP/SigLIP	No	-	Not yet supported

What’s Next

Model Catalog - see which models support LoRA
Profiles - bundle LoRA with other options