HTTP API Reference
This reference documents all HTTP endpoints exposed by the SIE server.
Endpoint Summary
Section titled “Endpoint Summary”| Endpoint | Method | Purpose |
|---|---|---|
/v1/encode/:model | POST | Generate embeddings |
/v1/score/:model | POST | Rerank items |
/v1/extract/:model | POST | Extract entities and structured data |
/v1/models | GET | List available models |
/v1/models/:model | GET | Get model details |
/v1/embeddings | POST | OpenAI-compatible embeddings |
/healthz | GET | Liveness probe |
/readyz | GET | Readiness probe |
/metrics | GET | Prometheus metrics |
/ws/status | WebSocket | Real-time worker status |
Wire Format
Section titled “Wire Format”SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.
Content negotiation:
Content-Type: application/msgpackfor requestsAccept: application/msgpackfor responses (default)Accept: application/jsonfalls back to JSON
When using JSON, arrays are converted to lists.
POST /v1/encode/:model
Section titled “POST /v1/encode/:model”Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.
Request Schema
Section titled “Request Schema”class EncodeRequest(TypedDict, total=False): items: list[Item] # Required: items to encode params: EncodeParams # Optional: encoding parameters
class EncodeParams(TypedDict, total=False): output_types: list[str] # 'dense', 'sparse', 'multivector' instruction: str # Task instruction for query encoding output_dtype: str # 'float32', 'float16', 'int8', 'binary' options: dict[str, Any] # Profile, LoRA, runtime options
class Item(TypedDict, total=False): id: str # Client-provided ID (echoed back) text: str # Text content images: list[ImageInput] # Image bytes with format hint
class ImageInput(TypedDict, total=False): data: bytes # Image bytes format: str # 'jpeg', 'png', 'webp'Response Schema
Section titled “Response Schema”class EncodeResponse(TypedDict, total=False): model: str # Model name used items: list[EncodeResult] # One result per input item timing: TimingInfo # Server-side timing breakdown
class EncodeResult(TypedDict, total=False): id: str # Echoed item ID dense: DenseVector # Dense embedding sparse: SparseVector # Sparse embedding multivector: MultiVector # Per-token embeddings
class DenseVector(TypedDict, total=False): dims: int # Vector dimensionality dtype: str # 'float32', 'float16', 'int8', 'binary' values: list[float] # Vector values
class SparseVector(TypedDict, total=False): dims: int # Vocabulary size dtype: str # Data type indices: list[int] # Non-zero dimension indices values: list[float] # Values at those indices
class MultiVector(TypedDict, total=False): token_dims: int # Per-token embedding dimension num_tokens: int # Number of tokens dtype: str # Data type values: list[list[float]] # Token embeddingsRequest Parameters
Section titled “Request Parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
items | list[Item] | Required | Items to encode |
params.output_types | list[str] | ["dense"] | Output types to return |
params.instruction | str | None | Instruction prefix for query encoding |
params.output_dtype | str | "float32" | Output precision |
params.options | dict | None | Runtime options (profile, lora, etc.) |
Examples
Section titled “Examples”Basic encoding:
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Hello, world!"}] }'Multiple output types:
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Search query"}], "params": { "output_types": ["dense", "sparse"], "instruction": "Represent this query for retrieval:" } }'Response:
{ "model": "BAAI/bge-m3", "items": [ { "dense": { "dims": 1024, "dtype": "float32", "values": [0.0234, -0.0891, 0.1234, ...] }, "sparse": { "dims": 250002, "dtype": "float32", "indices": [101, 2023, 5789, ...], "values": [0.45, 0.32, 0.28, ...] } } ]}POST /v1/score/:model
Section titled “POST /v1/score/:model”Rerank items against a query using a cross-encoder model.
Request Schema
Section titled “Request Schema”class ScoreRequest(TypedDict, total=False): query: Item # Required: query to score against items: list[Item] # Required: items to score instruction: str # Optional instruction options: dict[str, Any] # Runtime optionsResponse Schema
Section titled “Response Schema”class ScoreResponse(TypedDict, total=False): model: str query_id: str | None # Echoed query ID scores: list[ScoreEntry] # Sorted by score descending
class ScoreEntry(TypedDict): item_id: str | None # Echoed item ID score: float # Relevance score rank: int # Position (0 = most relevant)Example
Section titled “Example”curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "query": {"text": "What is machine learning?"}, "items": [ {"id": "doc-1", "text": "ML uses algorithms to learn from data."}, {"id": "doc-2", "text": "The weather is sunny today."} ] }'Response:
{ "model": "BAAI/bge-reranker-v2-m3", "scores": [ {"item_id": "doc-1", "score": 0.891, "rank": 0}, {"item_id": "doc-2", "score": 0.023, "rank": 1} ]}POST /v1/extract/:model
Section titled “POST /v1/extract/:model”Extract structured data from items: entities, relations, classifications, or vision outputs.
Request Schema
Section titled “Request Schema”class ExtractRequest(TypedDict, total=False): items: list[Item] # Required: items to extract from params: ExtractParams # Optional: extraction parameters
class ExtractParams(TypedDict, total=False): labels: list[str] # Entity types for NER output_schema: dict # JSON schema for structured extraction instruction: str # Task instruction options: dict[str, Any] # Runtime optionsResponse Schema
Section titled “Response Schema”class ExtractResponse(TypedDict, total=False): model: str items: list[ExtractResult]
class ExtractResult(TypedDict, total=False): id: str entities: list[Entity] # NER results relations: list[Relation] # Relation extraction classifications: list[Classification] objects: list[DetectedObject] # Object detection data: dict[str, Any] # Structured extraction results
class Entity(TypedDict, total=False): text: str # Extracted span label: str # Entity type score: float # Confidence (0-1) start: int # Start character offset end: int # End character offset bbox: list[int] # Bounding box [x, y, w, h] (images)
class Relation(TypedDict): head: str # Source entity tail: str # Target entity relation: str # Relation type score: float # Confidence
class Classification(TypedDict): label: str # Class label score: float # ProbabilityExample
Section titled “Example”curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Tim Cook is the CEO of Apple Inc."}], "params": { "labels": ["person", "organization", "role"] } }'Response:
{ "model": "urchade/gliner_multi-v2.1", "items": [ { "id": "item-0", "entities": [ {"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8}, {"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19}, {"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32} ] } ]}GET /v1/models
Section titled “GET /v1/models”List all available models with their capabilities.
Response Schema
Section titled “Response Schema”class ModelsListResponse(BaseModel): models: list[ModelInfo]
class ModelInfo(BaseModel): name: str # Model name inputs: list[str] # Supported inputs: text, image outputs: list[str] # Supported outputs: dense, sparse, multivector dims: dict[str, int] # Dimensions per output type loaded: bool # Whether model is in GPU memory max_sequence_length: int # Maximum tokens profiles: dict[str, ProfileInfo] # Available profiles
class ProfileInfo(BaseModel): is_default: bool # Whether this is the default profile output_types: list[str] # Output types enabled by this profile output_similarity: dict[str, str] # Similarity metrics per output typeExample
Section titled “Example”curl -H "Accept: application/json" http://localhost:8080/v1/modelsResponse:
{ "models": [ { "name": "BAAI/bge-m3", "inputs": ["text"], "outputs": ["dense", "sparse", "multivector"], "dims": {"dense": 1024, "sparse": 250002, "multivector": 1024}, "loaded": true, "max_sequence_length": 8192, "profiles": {} }, { "name": "BAAI/bge-reranker-v2-m3", "inputs": ["text"], "outputs": ["score"], "dims": {}, "loaded": false, "max_sequence_length": 8192, "profiles": {} } ]}POST /v1/embeddings (OpenAI Compatible)
Section titled “POST /v1/embeddings (OpenAI Compatible)”Drop-in replacement for OpenAI’s embeddings API.
Example
Section titled “Example”curl -X POST http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "model": "BAAI/bge-m3", "input": ["Hello, world!"] }'Response:
{ "object": "list", "model": "BAAI/bge-m3", "data": [ { "object": "embedding", "index": 0, "embedding": [0.0234, -0.0891, ...] } ], "usage": { "prompt_tokens": 3, "total_tokens": 3 }}Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.
Health Endpoints
Section titled “Health Endpoints”GET /healthz
Section titled “GET /healthz”Liveness probe. Returns 200 if the server process is running.
curl http://localhost:8080/healthz# "ok"GET /readyz
Section titled “GET /readyz”Readiness probe. Returns 200 if the server is ready to accept traffic.
curl http://localhost:8080/readyz# "ok"GET /metrics
Section titled “GET /metrics”Prometheus metrics endpoint.
Available Metrics
Section titled “Available Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_requests_total | Counter | model, endpoint, status | Total request count |
sie_request_duration_seconds | Histogram | model, endpoint, phase | Latency by phase |
sie_batch_size | Histogram | model | Batch size distribution |
sie_tokens_processed_total | Counter | model | Total tokens processed |
sie_queue_depth | Gauge | model | Pending items per model |
sie_model_loaded | Gauge | model, device | Model load status (1/0) |
sie_model_memory_bytes | Gauge | model, device | GPU memory per model |
WebSocket /ws/status
Section titled “WebSocket /ws/status”Real-time worker status stream. Sends updates every 200ms.
Message Schema
Section titled “Message Schema”{ "timestamp": float, # Unix timestamp "gpu": str, # GPU type (e.g., "l4", "a100-80gb") "loaded_models": list[str], # Currently loaded models "server": { "version": str, "uptime_seconds": int, "user": str, "working_dir": str, "pid": int }, "gpus": [ # Per-GPU metrics { "index": int, "name": str, "gpu_type": str, # Normalized type (e.g., "l4", "a100-80gb") "utilization_percent": float, "memory_used_bytes": int, "memory_total_bytes": int, "memory_threshold_pct": float, "temperature_c": int } ], "models": [ # Per-model status { "name": str, "state": str, # "loaded", "loading", "unloading", "available" "device": str | None, "memory_bytes": int, "queue_depth": int, "queue_pending_items": int, "config": {...} # Model configuration } ], "counters": {...}, # Prometheus counter metrics "histograms": {...} # Prometheus histogram metrics}const ws = new WebSocket("ws://localhost:8080/ws/status");ws.onmessage = (event) => { const status = JSON.parse(event.data); console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);};Error Responses
Section titled “Error Responses”All endpoints return consistent error responses:
{ "detail": { "code": "MODEL_NOT_FOUND", "message": "Model 'unknown-model' not found" }}Error Codes
Section titled “Error Codes”| Code | HTTP Status | Description |
|---|---|---|
MODEL_NOT_FOUND | 404 | Requested model doesn’t exist |
INVALID_INPUT | 400 | Invalid request format |
MODEL_NOT_LOADED | 503 | Model is not loaded or still loading |
LORA_LOADING | 503 | LoRA adapter is loading (retry with Retry-After header) |
QUEUE_FULL | 503 | Server overloaded, request queue is full |
DEPENDENCY_CONFLICT | 409 | Model requires different bundle/dependencies |
INFERENCE_ERROR | 500 | Error during model inference |
INTERNAL_ERROR | 500 | Unexpected server error |
Response Headers
Section titled “Response Headers”Timing and tracing information is included in response headers:
| Header | Description |
|---|---|
X-Total-Time | Total request time (ms) |
X-Queue-Time | Time waiting in queue (ms) |
X-Tokenization-Time | Preprocessing time (ms) |
X-Inference-Time | GPU inference time (ms) |
X-Postprocessing-Time | Postprocessing time (ms), only if > 0 |
X-Trace-ID | OpenTelemetry trace ID for distributed tracing |