HTTP API Reference

This reference documents all HTTP endpoints exposed by the SIE server.

Endpoint Summary

Endpoint	Method	Purpose
`/v1/encode/:model`	POST	Generate embeddings
`/v1/score/:model`	POST	Rerank items
`/v1/extract/:model`	POST	Extract entities and structured data
`/v1/models`	GET	List available models
`/v1/models/:model`	GET	Get model details
`/v1/embeddings`	POST	OpenAI-compatible embeddings
`/healthz`	GET	Liveness probe
`/readyz`	GET	Readiness probe
`/metrics`	GET	Prometheus metrics
`/ws/status`	WebSocket	Real-time worker status

Wire Format

SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.

Content negotiation:

Content-Type: application/msgpack for requests
Accept: application/msgpack for responses (default)
Accept: application/json falls back to JSON

When using JSON, arrays are converted to lists.

POST /v1/encode/:model

Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.

Request Schema

class EncodeRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to encode
    params: EncodeParams           # Optional: encoding parameters

class EncodeParams(TypedDict, total=False):
    output_types: list[str]        # 'dense', 'sparse', 'multivector'
    instruction: str               # Task instruction for query encoding
    output_dtype: str              # 'float32', 'float16', 'int8', 'binary'
    options: dict[str, Any]        # Profile, LoRA, runtime options

class Item(TypedDict, total=False):
    id: str                        # Client-provided ID (echoed back)
    text: str                      # Text content
    images: list[ImageInput]       # Image bytes with format hint

class ImageInput(TypedDict, total=False):
    data: bytes                    # Image bytes
    format: str                    # 'jpeg', 'png', 'webp'

Response Schema

class EncodeResponse(TypedDict, total=False):
    model: str                     # Model name used
    items: list[EncodeResult]      # One result per input item
    timing: TimingInfo             # Server-side timing breakdown

class EncodeResult(TypedDict, total=False):
    id: str                        # Echoed item ID
    dense: DenseVector             # Dense embedding
    sparse: SparseVector           # Sparse embedding
    multivector: MultiVector       # Per-token embeddings

class DenseVector(TypedDict, total=False):
    dims: int                      # Vector dimensionality
    dtype: str                     # 'float32', 'float16', 'int8', 'binary'
    values: list[float]            # Vector values

class SparseVector(TypedDict, total=False):
    dims: int                      # Vocabulary size
    dtype: str                     # Data type
    indices: list[int]             # Non-zero dimension indices
    values: list[float]            # Values at those indices

class MultiVector(TypedDict, total=False):
    token_dims: int                # Per-token embedding dimension
    num_tokens: int                # Number of tokens
    dtype: str                     # Data type
    values: list[list[float]]      # Token embeddings

Request Parameters

Parameter	Type	Default	Description
`items`	`list[Item]`	Required	Items to encode
`params.output_types`	`list[str]`	`["dense"]`	Output types to return
`params.instruction`	`str`	None	Instruction prefix for query encoding
`params.output_dtype`	`str`	`"float32"`	Output precision
`params.options`	`dict`	None	Runtime options (profile, lora, etc.)

Examples

Basic encoding:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Hello, world!"}]
  }'

Multiple output types:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Search query"}],
    "params": {
      "output_types": ["dense", "sparse"],
      "instruction": "Represent this query for retrieval:"
    }
  }'

Response:

{
  "model": "BAAI/bge-m3",
  "items": [
    {
      "dense": {
        "dims": 1024,
        "dtype": "float32",
        "values": [0.0234, -0.0891, 0.1234, ...]
      },
      "sparse": {
        "dims": 250002,
        "dtype": "float32",
        "indices": [101, 2023, 5789, ...],
        "values": [0.45, 0.32, 0.28, ...]
      }
    }
  ]
}

POST /v1/score/:model

Rerank items against a query using a cross-encoder model.

Request Schema

class ScoreRequest(TypedDict, total=False):
    query: Item                    # Required: query to score against
    items: list[Item]              # Required: items to score
    instruction: str               # Optional instruction
    options: dict[str, Any]        # Runtime options

Response Schema

class ScoreResponse(TypedDict, total=False):
    model: str
    query_id: str | None           # Echoed query ID
    scores: list[ScoreEntry]       # Sorted by score descending

class ScoreEntry(TypedDict):
    item_id: str | None            # Echoed item ID
    score: float                   # Relevance score
    rank: int                      # Position (0 = most relevant)

Example

curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "query": {"text": "What is machine learning?"},
    "items": [
      {"id": "doc-1", "text": "ML uses algorithms to learn from data."},
      {"id": "doc-2", "text": "The weather is sunny today."}
    ]
  }'

Response:

{
  "model": "BAAI/bge-reranker-v2-m3",
  "scores": [
    {"item_id": "doc-1", "score": 0.891, "rank": 0},
    {"item_id": "doc-2", "score": 0.023, "rank": 1}
  ]
}

POST /v1/extract/:model

Extract structured data from items: entities, relations, classifications, or vision outputs.

Request Schema

class ExtractRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to extract from
    params: ExtractParams          # Optional: extraction parameters

class ExtractParams(TypedDict, total=False):
    labels: list[str]              # Entity types for NER
    output_schema: dict            # JSON schema for structured extraction
    instruction: str               # Task instruction
    options: dict[str, Any]        # Runtime options

Response Schema

class ExtractResponse(TypedDict, total=False):
    model: str
    items: list[ExtractResult]

class ExtractResult(TypedDict, total=False):
    id: str
    entities: list[Entity]         # NER results
    relations: list[Relation]      # Relation extraction
    classifications: list[Classification]
    objects: list[DetectedObject]  # Object detection
    data: dict[str, Any]           # Structured extraction results

class Entity(TypedDict, total=False):
    text: str                      # Extracted span
    label: str                     # Entity type
    score: float                   # Confidence (0-1)
    start: int                     # Start character offset
    end: int                       # End character offset
    bbox: list[int]                # Bounding box [x, y, w, h] (images)

class Relation(TypedDict):
    head: str                      # Source entity
    tail: str                      # Target entity
    relation: str                  # Relation type
    score: float                   # Confidence

class Classification(TypedDict):
    label: str                     # Class label
    score: float                   # Probability

Example

curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Tim Cook is the CEO of Apple Inc."}],
    "params": {
      "labels": ["person", "organization", "role"]
    }
  }'

Response:

{
  "model": "urchade/gliner_multi-v2.1",
  "items": [
    {
      "id": "item-0",
      "entities": [
        {"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8},
        {"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19},
        {"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32}
      ]
    }
  ]
}

GET /v1/models

List all available models with their capabilities.

Response Schema

class ModelsListResponse(BaseModel):
    models: list[ModelInfo]

class ModelInfo(BaseModel):
    name: str                      # Model name
    inputs: list[str]              # Supported inputs: text, image
    outputs: list[str]             # Supported outputs: dense, sparse, multivector
    dims: dict[str, int]           # Dimensions per output type
    loaded: bool                   # Whether model is in GPU memory
    max_sequence_length: int       # Maximum tokens
    profiles: dict[str, ProfileInfo]  # Available profiles

class ProfileInfo(BaseModel):
    is_default: bool               # Whether this is the default profile
    output_types: list[str]        # Output types enabled by this profile
    output_similarity: dict[str, str]  # Similarity metrics per output type

Example

curl -H "Accept: application/json" http://localhost:8080/v1/models

Response:

{
  "models": [
    {
      "name": "BAAI/bge-m3",
      "inputs": ["text"],
      "outputs": ["dense", "sparse", "multivector"],
      "dims": {"dense": 1024, "sparse": 250002, "multivector": 1024},
      "loaded": true,
      "max_sequence_length": 8192,
      "profiles": {}
    },
    {
      "name": "BAAI/bge-reranker-v2-m3",
      "inputs": ["text"],
      "outputs": ["score"],
      "dims": {},
      "loaded": false,
      "max_sequence_length": 8192,
      "profiles": {}
    }
  ]
}

POST /v1/embeddings (OpenAI Compatible)

Drop-in replacement for OpenAI’s embeddings API.

Example

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "BAAI/bge-m3",
    "input": ["Hello, world!"]
  }'

Response:

{
  "object": "list",
  "model": "BAAI/bge-m3",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0891, ...]
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 3
  }
}

Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.

Health Endpoints

GET /healthz

Liveness probe. Returns 200 if the server process is running.

curl http://localhost:8080/healthz
# "ok"

GET /readyz

Readiness probe. Returns 200 if the server is ready to accept traffic.

curl http://localhost:8080/readyz
# "ok"

GET /metrics

Prometheus metrics endpoint.

Available Metrics

Metric	Type	Labels	Description
`sie_requests_total`	Counter	model, endpoint, status	Total request count
`sie_request_duration_seconds`	Histogram	model, endpoint, phase	Latency by phase
`sie_batch_size`	Histogram	model	Batch size distribution
`sie_tokens_processed_total`	Counter	model	Total tokens processed
`sie_queue_depth`	Gauge	model	Pending items per model
`sie_model_loaded`	Gauge	model, device	Model load status (1/0)
`sie_model_memory_bytes`	Gauge	model, device	GPU memory per model

WebSocket /ws/status

Real-time worker status stream. Sends updates every 200ms.

Message Schema

{
    "timestamp": float,            # Unix timestamp
    "gpu": str,                    # GPU type (e.g., "l4", "a100-80gb")
    "loaded_models": list[str],    # Currently loaded models
    "server": {
        "version": str,
        "uptime_seconds": int,
        "user": str,
        "working_dir": str,
        "pid": int
    },
    "gpus": [                      # Per-GPU metrics
        {
            "index": int,
            "name": str,
            "gpu_type": str,       # Normalized type (e.g., "l4", "a100-80gb")
            "utilization_percent": float,
            "memory_used_bytes": int,
            "memory_total_bytes": int,
            "memory_threshold_pct": float,
            "temperature_c": int
        }
    ],
    "models": [                    # Per-model status
        {
            "name": str,
            "state": str,          # "loaded", "loading", "unloading", "available"
            "device": str | None,
            "memory_bytes": int,
            "queue_depth": int,
            "queue_pending_items": int,
            "config": {...}        # Model configuration
        }
    ],
    "counters": {...},             # Prometheus counter metrics
    "histograms": {...}            # Prometheus histogram metrics
}

Usage

const ws = new WebSocket("ws://localhost:8080/ws/status");
ws.onmessage = (event) => {
    const status = JSON.parse(event.data);
    console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);
};

Error Responses

All endpoints return consistent error responses:

{
  "detail": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'unknown-model' not found"
  }
}

Error Codes

Code	HTTP Status	Description
`MODEL_NOT_FOUND`	404	Requested model doesn’t exist
`INVALID_INPUT`	400	Invalid request format
`MODEL_NOT_LOADED`	503	Model is not loaded or still loading
`LORA_LOADING`	503	LoRA adapter is loading (retry with Retry-After header)
`QUEUE_FULL`	503	Server overloaded, request queue is full
`DEPENDENCY_CONFLICT`	409	Model requires different bundle/dependencies
`INFERENCE_ERROR`	500	Error during model inference
`INTERNAL_ERROR`	500	Unexpected server error

Response Headers

Timing and tracing information is included in response headers:

Header	Description
`X-Total-Time`	Total request time (ms)
`X-Queue-Time`	Time waiting in queue (ms)
`X-Tokenization-Time`	Preprocessing time (ms)
`X-Inference-Time`	GPU inference time (ms)
`X-Postprocessing-Time`	Postprocessing time (ms), only if > 0
`X-Trace-ID`	OpenTelemetry trace ID for distributed tracing