Multi-modal
Multimodal models encode images and text into a shared vector space. Search images with text queries, or find similar images directly. SIE supports CLIP, SigLIP, and document models like ColPali.
Quick Example
Section titled “Quick Example”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode an imagewith open("photo.jpg", "rb") as f: image_bytes = f.read()
result = client.encode( "openai/clip-vit-base-patch32", Item(images=[{"data": image_bytes, "format": "jpeg"}]))
print(f"Dense vector: {len(result['dense'])} dims") # 512import { SIEClient } from "@sie/sdk";import { readFileSync } from "fs";
const client = new SIEClient("http://localhost:8080");
// Encode an imageconst imageBytes = readFileSync("photo.jpg");
const result = await client.encode( "openai/clip-vit-base-patch32", { images: [imageBytes] });
console.log(`Dense vector: ${result.dense?.length} dims`); // 512
await client.close();Image Input Formats
Section titled “Image Input Formats”The SDK accepts images in multiple formats:
# From bytes with format hintresult = client.encode( "openai/clip-vit-base-patch32", Item(images=[{"data": image_bytes, "format": "jpeg"}]))
# From file (read bytes)with open("image.png", "rb") as f: result = client.encode( "openai/clip-vit-base-patch32", Item(images=[{"data": f.read(), "format": "png"}]) )
# Multiple images (averaged)result = client.encode( "openai/clip-vit-base-patch32", Item(images=[ {"data": img1_bytes, "format": "jpeg"}, {"data": img2_bytes, "format": "jpeg"}, ]))import { readFileSync } from "fs";
// From file (read bytes)const imageBytes = readFileSync("photo.jpg");const result = await client.encode( "openai/clip-vit-base-patch32", { images: [imageBytes] });
// Multiple images (averaged)const img1 = readFileSync("image1.jpg");const img2 = readFileSync("image2.jpg");const multiResult = await client.encode( "openai/clip-vit-base-patch32", { images: [img1, img2] });Supported formats: JPEG, PNG, WebP, BMP, GIF (first frame).
Text-to-Image Search
Section titled “Text-to-Image Search”CLIP and SigLIP encode text and images into the same vector space:
# Index imagesimage_embeddings = []for image_path in image_paths: with open(image_path, "rb") as f: result = client.encode( "openai/clip-vit-base-patch32", Item(images=[{"data": f.read(), "format": "jpeg"}]) ) image_embeddings.append(result["dense"])
# Store in vector databasefor i, embedding in enumerate(image_embeddings): vector_db.insert(id=f"img-{i}", vector=embedding)
# Search with text queryquery_result = client.encode( "openai/clip-vit-base-patch32", Item(text="a cat sitting on a couch"))
# Find similar imagesresults = vector_db.search(query_result["dense"], top_k=10)import { readFileSync } from "fs";
// Index imagesconst imageEmbeddings: Float32Array[] = [];for (const imagePath of imagePaths) { const imageBytes = readFileSync(imagePath); const result = await client.encode( "openai/clip-vit-base-patch32", { images: [imageBytes] } ); if (result.dense) { imageEmbeddings.push(result.dense); }}
// Store in vector databasefor (let i = 0; i < imageEmbeddings.length; i++) { await vectorDb.insert({ id: `img-${i}`, vector: imageEmbeddings[i] });}
// Search with text queryconst queryResult = await client.encode( "openai/clip-vit-base-patch32", { text: "a cat sitting on a couch" });
// Find similar imagesconst results = await vectorDb.search(queryResult.dense!, 10);Image-to-Image Search
Section titled “Image-to-Image Search”Search for visually similar images:
# Encode reference imagewith open("reference.jpg", "rb") as f: ref_result = client.encode( "openai/clip-vit-base-patch32", Item(images=[{"data": f.read(), "format": "jpeg"}]) )
# Find similar images in your databasesimilar = vector_db.search(ref_result["dense"], top_k=10)import { readFileSync } from "fs";
// Encode reference imageconst refImage = readFileSync("reference.jpg");const refResult = await client.encode( "openai/clip-vit-base-patch32", { images: [refImage] });
// Find similar images in your databaseconst similar = await vectorDb.search(refResult.dense!, 10);SigLIP Models
Section titled “SigLIP Models”SigLIP often outperforms CLIP on image-text matching:
result = client.encode( "google/siglip-so400m-patch14-384", Item(images=[{"data": image_bytes, "format": "jpeg"}]))
print(f"Dense vector: {len(result['dense'])} dims") # 1152const result = await client.encode( "google/siglip-so400m-patch14-384", { images: [imageBytes] });
console.log(`Dense vector: ${result.dense?.length} dims`); // 1152SigLIP uses sigmoid loss (vs contrastive), which can improve fine-grained matching.
Document Search with ColPali
Section titled “Document Search with ColPali”ColPali encodes document page images directly—no OCR needed. The model “sees” layout, tables, and figures:
# Encode a PDF page as imageresult = client.encode( "vidore/colpali-v1.3-hf", Item(images=[{"data": page_image_bytes, "format": "png"}]), output_types=["multivector"])
# ColPali returns multi-vector (per-patch) embeddingsprint(f"Patches: {result['multivector'].shape[0]}")// Encode a PDF page as imageconst result = await client.encode( "vidore/colpali-v1.3-hf", { images: [pageImageBytes] }, { outputTypes: ["multivector"] });
// ColPali returns multi-vector (per-patch) embeddingsconsole.log(`Patches: ${result.multivector?.length}`);ColPali is ColBERT-style: multi-vector output, MaxSim scoring.
Vision Models
Section titled “Vision Models”| Model | Dimensions | Resolution | Notes |
|---|---|---|---|
openai/clip-vit-base-patch32 | 512 | 224 | Fast, general |
openai/clip-vit-large-patch14 | 768 | 224 | Higher quality |
google/siglip-so400m-patch14-384 | 1152 | 384 | Best quality |
laion/CLIP-ViT-H-14-laion2B-s32B-b79K | 1024 | 224 | Large-scale trained |
vidore/colpali-v1.3-hf | 128 (multi) | 448 | Document pages |
HTTP API
Section titled “HTTP API”Images are base64-encoded in HTTP requests. The server defaults to msgpack. For JSON:
curl -X POST http://localhost:8080/v1/encode/openai/clip-vit-base-patch32 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{"items": [{"images": [{"data": "'$(base64 -w0 photo.jpg)'", "format": "jpeg"}]}]}'What’s Next
Section titled “What’s Next”- Multi-vector embeddings - ColPali uses multivector output
- Dense embeddings - text-only encoding