Skip to content
SIE

Multi-modal

Multimodal models encode images and text into a shared vector space. Search images with text queries, or find similar images directly. SIE supports CLIP, SigLIP, and document models like ColPali.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode an image
with open("photo.jpg", "rb") as f:
image_bytes = f.read()
result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[{"data": image_bytes, "format": "jpeg"}])
)
print(f"Dense vector: {len(result['dense'])} dims") # 512

The SDK accepts images in multiple formats:

# From bytes with format hint
result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[{"data": image_bytes, "format": "jpeg"}])
)
# From file (read bytes)
with open("image.png", "rb") as f:
result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[{"data": f.read(), "format": "png"}])
)
# Multiple images (averaged)
result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[
{"data": img1_bytes, "format": "jpeg"},
{"data": img2_bytes, "format": "jpeg"},
])
)

Supported formats: JPEG, PNG, WebP, BMP, GIF (first frame).

CLIP and SigLIP encode text and images into the same vector space:

# Index images
image_embeddings = []
for image_path in image_paths:
with open(image_path, "rb") as f:
result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[{"data": f.read(), "format": "jpeg"}])
)
image_embeddings.append(result["dense"])
# Store in vector database
for i, embedding in enumerate(image_embeddings):
vector_db.insert(id=f"img-{i}", vector=embedding)
# Search with text query
query_result = client.encode(
"openai/clip-vit-base-patch32",
Item(text="a cat sitting on a couch")
)
# Find similar images
results = vector_db.search(query_result["dense"], top_k=10)

Search for visually similar images:

# Encode reference image
with open("reference.jpg", "rb") as f:
ref_result = client.encode(
"openai/clip-vit-base-patch32",
Item(images=[{"data": f.read(), "format": "jpeg"}])
)
# Find similar images in your database
similar = vector_db.search(ref_result["dense"], top_k=10)

SigLIP often outperforms CLIP on image-text matching:

result = client.encode(
"google/siglip-so400m-patch14-384",
Item(images=[{"data": image_bytes, "format": "jpeg"}])
)
print(f"Dense vector: {len(result['dense'])} dims") # 1152

SigLIP uses sigmoid loss (vs contrastive), which can improve fine-grained matching.

ColPali encodes document page images directly—no OCR needed. The model “sees” layout, tables, and figures:

# Encode a PDF page as image
result = client.encode(
"vidore/colpali-v1.3-hf",
Item(images=[{"data": page_image_bytes, "format": "png"}]),
output_types=["multivector"]
)
# ColPali returns multi-vector (per-patch) embeddings
print(f"Patches: {result['multivector'].shape[0]}")

ColPali is ColBERT-style: multi-vector output, MaxSim scoring.

ModelDimensionsResolutionNotes
openai/clip-vit-base-patch32512224Fast, general
openai/clip-vit-large-patch14768224Higher quality
google/siglip-so400m-patch14-3841152384Best quality
laion/CLIP-ViT-H-14-laion2B-s32B-b79K1024224Large-scale trained
vidore/colpali-v1.3-hf128 (multi)448Document pages

Images are base64-encoded in HTTP requests. The server defaults to msgpack. For JSON:

Terminal window
curl -X POST http://localhost:8080/v1/encode/openai/clip-vit-base-patch32 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"items": [{"images": [{"data": "'$(base64 -w0 photo.jpg)'", "format": "jpeg"}]}]}'